From coreweave-pack
Sets up GPU monitoring for CoreWeave Kubernetes clusters using DCGM exporter metrics and Prometheus alerts for utilization, memory usage, temperature, and inference pod health.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packThis skill is limited to using the following tools:
CKS clusters come with DCGM exporter pre-installed. Key metrics:
Monitors CoreWeave Kubernetes events, GPU utilization, and inference service health. Tracks pod lifecycles and sends alerts via kubectl and Python scripts.
Collects Vast.ai GPU instance metrics (utilization, costs, status) via CLI, logs to JSONL, and checks alerts for idle GPUs or high temps. Use for cost tracking and observability dashboards.
Configures Grafana Cloud infrastructure monitoring for Kubernetes clusters, AWS/Azure/GCP integrations, node exporter/cAdvisor, dashboards, and k8s-monitoring Helm chart.
Share bugs, ideas, or general feedback.
CKS clusters come with DCGM exporter pre-installed. Key metrics:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU core utilization % |
DCGM_FI_DEV_FB_USED | GPU memory used (MB) |
DCGM_FI_DEV_FB_FREE | GPU memory free (MB) |
DCGM_FI_DEV_POWER_USAGE | Power consumption (W) |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
groups:
- name: coreweave-gpu
rules:
- alert: GPUUtilizationLow
expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
for: 30m
labels: { severity: warning }
annotations:
summary: "GPU utilization below 20% for 30min -- consider scaling down"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 5m
labels: { severity: critical }
annotations:
summary: "GPU memory >95% -- risk of OOM"
- alert: InferencePodDown
expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
for: 2m
labels: { severity: critical }
For incident response, see coreweave-incident-runbook.