From coreweave-pack
Sets up distributed PyTorch GPU training on CoreWeave Kubernetes with multi-node DDP, Jobs, PVCs for H100/A100 clusters and model fine-tuning.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packThis skill is limited to using the following tools:
Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.
Manages CoreWeave persistent storage for ML training data and model artifacts using Kubernetes PVCs and Jobs. For large datasets, storage classes, and GPU data pipelines.
Configures distributed training setups for ML models with PyTorch, TensorFlow, or scikit-learn. Generates code, configs, and best practices for multi-node training tasks.
Provisions on-demand/reserved GPU clusters (H100/H200/B200) on Together AI with Kubernetes/Slurm orchestration, shared storage, and scaling for ML/HPC multi-node jobs.
Share bugs, ideas, or general feedback.
Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llm-finetune
spec:
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: ghcr.io/myorg/trainer:latest
command: ["torchrun"]
args:
- "--nproc_per_node=8"
- "train.py"
- "--model_name=meta-llama/Llama-3.1-8B"
- "--batch_size=4"
- "--epochs=3"
resources:
limits:
nvidia.com/gpu: "8"
memory: 512Gi
cpu: "64"
volumeMounts:
- name: data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: data
persistentVolumeClaim:
claimName: training-data
- name: checkpoints
persistentVolumeClaim:
claimName: model-checkpoints
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_NVLINK_A100_SXM4_80GB"]
# storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 500Gi
storageClassName: shared-hdd-ord1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-checkpoints
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 200Gi
storageClassName: shared-ssd-ord1
# Watch training logs
kubectl logs -f job/llm-finetune
# Check GPU utilization
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi
# Check training metrics
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \
cat /checkpoints/training_log.json | tail -5
| Error | Cause | Solution |
|---|---|---|
| NCCL timeout | Network issue between GPUs | Use NVLink nodes (SXM4/SXM5) |
| OOMKilled | Batch size too large | Reduce batch size or use gradient accumulation |
| Checkpoint save failed | PVC full | Increase storage or prune old checkpoints |
| Job evicted | Preemption | Use on-demand nodes for training |
For troubleshooting, see coreweave-common-errors.