From coreweave-pack
Monitors CoreWeave Kubernetes events, GPU utilization, and inference service health. Tracks pod lifecycles and sends alerts via kubectl and Python scripts.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packThis skill is limited to using the following tools:
```bash
Sets up GPU monitoring for CoreWeave Kubernetes clusters using DCGM exporter metrics and Prometheus alerts for utilization, memory usage, temperature, and inference pod health.
Collects Vast.ai GPU instance metrics (utilization, costs, status) via CLI, logs to JSONL, and checks alerts for idle GPUs or high temps. Use for cost tracking and observability dashboards.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Share bugs, ideas, or general feedback.
# Watch GPU pod events
kubectl get events --watch --field-selector=reason=Scheduled,reason=Pulled,reason=Failed
# Monitor GPU utilization via exec
kubectl exec -it deployment/inference -- nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv -l 5
# DCGM exporter for GPU metrics (pre-installed on CKS)
# Key metrics:
# DCGM_FI_DEV_GPU_UTIL - GPU utilization %
# DCGM_FI_DEV_FB_USED - GPU memory used
# DCGM_FI_DEV_POWER_USAGE - Power draw
import subprocess, json, requests
def check_inference_health(deployment: str, slack_url: str):
result = subprocess.run(
["kubectl", "get", "deployment", deployment, "-o", "json"],
capture_output=True, text=True,
)
deploy = json.loads(result.stdout)
ready = deploy["status"].get("readyReplicas", 0)
desired = deploy["spec"]["replicas"]
if ready < desired:
requests.post(slack_url, json={
"text": f"CoreWeave: {deployment} has {ready}/{desired} replicas ready"
})
For performance optimization, see coreweave-performance-tuning.