From coreweave-pack
Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".
npx claudepluginhub flight505/skill-forge --plugin coreweave-packThis skill is limited to using the following tools:
```bash
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
kubectl describe pod <pod-name> | grep -A5 Events
# "0/N nodes are available: insufficient nvidia.com/gpu"
Fix: Check GPU availability: kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB. Try a different GPU type or region.
torch.cuda.OutOfMemoryError: CUDA out of memory
Fix: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB).
Fix: Create an imagePullSecret:
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=$GH_USER \
--docker-password=$GH_TOKEN
NCCL error: unhandled system error
Fix: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes.
Fix: Check storage class availability: kubectl get sc. Use CoreWeave storage classes like shared-hdd-ord1 or shared-ssd-ord1.
Fix: List valid GPU class labels:
kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u
Fix: Check Service and Endpoints:
kubectl get svc,endpoints <service-name>
For diagnostics, see coreweave-debug-bundle.