Skill

coreweave-observability

Set up GPU monitoring and observability for CoreWeave workloads. Use when implementing GPU metrics dashboards, configuring alerts, or tracking inference latency and throughput. Trigger with phrases like "coreweave monitoring", "coreweave observability", "coreweave gpu metrics", "coreweave grafana".

npx claudepluginhub flight505/skill-forge --plugin coreweave-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(kubectl:*)Grep

Preview

CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade.

SKILL.md

Similar Skills

cache-components

139.3k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 30, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

CoreWeave Observability

Overview

Key Metrics

Metric	Type	Target	Alert Threshold
GPU utilization	Gauge	> 60%	< 20% for 30m
GPU memory usage	Gauge	< 85%	> 95% for 5m
Inference latency p99	Histogram	< 200ms	> 500ms
Job completion rate	Counter	> 99%	< 95% per hour
Pod restart count	Counter	0	> 3 in 15m
Node GPU temperature	Gauge	< 80C	> 85C for 10m

Instrumentation

async function trackInference(model: string, fn: () => Promise<any>) {
  const start = Date.now();
  try {
    const result = await fn();
    metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' });
    metrics.increment('coreweave.inference.completed', { model });
    return result;
  } catch (err) {
    metrics.increment('coreweave.inference.errors', { model, error: err.code });
    throw err;
  }
}

Health Check Dashboard

async function coreweaveHealth(): Promise<Record<string, string>> {
  const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)');
  const mem = await queryPrometheus('avg(DCGM_FI_DEV_FB_USED/(DCGM_FI_DEV_FB_USED+DCGM_FI_DEV_FB_FREE))');
  const pods = await queryPrometheus('kube_deployment_status_replicas_available{namespace="inference"}');
  return {
    gpu_utilization: gpu > 20 ? 'healthy' : 'underutilized',
    gpu_memory: mem < 0.9 ? 'healthy' : 'critical',
    inference_pods: pods > 0 ? 'healthy' : 'down',
  };
}

Alerting Rules

const alerts = [
  { metric: 'DCGM_FI_DEV_GPU_UTIL', condition: 'avg < 20', window: '30m', severity: 'warning' },
  { metric: 'gpu_memory_pct', condition: '> 0.95', window: '5m', severity: 'critical' },
  { metric: 'inference_latency_p99', condition: '> 500ms', window: '10m', severity: 'warning' },
  { metric: 'pod_restart_count', condition: '> 3', window: '15m', severity: 'critical' },
];

Structured Logging

function logGpuEvent(event: string, node: string, data: Record<string, any>) {
  console.log(JSON.stringify({
    service: 'coreweave', event, node,
    gpu_model: data.gpu_model, utilization: data.util,
    memory_pct: data.memPct, temperature: data.temp,
    timestamp: new Date().toISOString(),
  }));
}

Error Handling

Signal	Meaning	Action
GPU util < 20% sustained	Idle GPUs burning cost	Scale down or reassign workload
GPU memory > 95%	OOM imminent	Reduce batch size or add nodes
Pod CrashLoopBackOff	Driver or config failure	Check DCGM logs, restart node
Inference latency spike	Contention or throttling	Review GPU temp and queue depth
Node NotReady	Hardware or network issue	Cordon node, migrate pods

Resources

CoreWeave Observability

Next Steps

For incident response, see coreweave-incident-runbook.