From coreweave-pack
Diagnoses and fixes CoreWeave Kubernetes errors: GPU scheduling failures, pending pods, CUDA OOM, NCCL timeouts, PVC mounting, image pull issues.
How this skill is triggered — by the user, by Claude, or both
Slash command
/coreweave-pack:coreweave-common-errorsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
```bash
kubectl describe pod <pod-name> | grep -A5 Events
# "0/N nodes are available: insufficient nvidia.com/gpu"
Fix: Check GPU availability: kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB. Try a different GPU type or region.
torch.cuda.OutOfMemoryError: CUDA out of memory
Fix: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB).
Fix: Create an imagePullSecret:
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=$GH_USER \
--docker-password=$GH_TOKEN
NCCL error: unhandled system error
Fix: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes.
Fix: Check storage class availability: kubectl get sc. Use CoreWeave storage classes like shared-hdd-ord1 or shared-ssd-ord1.
Fix: List valid GPU class labels:
kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u
Fix: Check Service and Endpoints:
kubectl get svc,endpoints <service-name>
For diagnostics, see coreweave-debug-bundle.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packGuides triage and resolution of CoreWeave GPU incidents for inference services using kubectl checks, common fixes, and rollback procedures.
Diagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.