Skill

coreweave-incident-runbook

Incident response runbook for CoreWeave GPU workload failures. Use when inference services are down, GPUs are unavailable, or responding to production incidents on CoreWeave. Trigger with phrases like "coreweave incident", "coreweave outage", "coreweave runbook", "coreweave service down".

npx claudepluginhub flight505/skill-forge --plugin coreweave-pack

Tool Access

This skill is limited to using the following tools:

ReadBash(kubectl:*)Grep

Preview

```bash

SKILL.md

Similar Skills

cache-components

139.3k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 30, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

coreweave-incident-runbook

CoreWeave Incident Runbook

Triage Steps

# 1. Check pod status kubectl get pods -l app=inference -o wide # 2. Check recent events kubectl get events --sort-by=.lastTimestamp | tail -20 # 3. Check node status kubectl get nodes -l gpu.nvidia.com/class -o wide # 4. Check GPU health kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi

Common Incidents

Inference Service Down

Check pod status and events

If OOMKilled: reduce batch size or upgrade GPU

If ImagePullBackOff: check registry credentials

If Pending: check GPU quota and availability

GPU Node Failure

Pods will be rescheduled automatically

If no capacity: scale down non-critical workloads

Contact CoreWeave support for extended outages

Model Loading Failure

Check HuggingFace token secret exists

Verify model name spelling

Check PVC has sufficient storage

Review container logs for download errors

Rollback

kubectl rollout undo deployment/inference

Resources

Next Steps

For data handling, see coreweave-data-handling.