Skill

coreweave-core-workflow-b

Sets up distributed PyTorch GPU training on CoreWeave Kubernetes with multi-node DDP, Jobs, PVCs for H100/A100 clusters and model fine-tuning.

Python

Kubernetes

Bash

ai-ml

devops

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(kubectl:*)Grep

Preview

Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.

SKILL.md

Similar Skills

coreweave-data-handling

1.9k

Manages CoreWeave persistent storage for ML training data and model artifacts using Kubernetes PVCs and Jobs. For large datasets, storage classes, and GPU data pipelines.

5 tools

coreweave-pack

distributed-training-setup

2.0k

Configures distributed training setups for ML models with PyTorch, TensorFlow, or scikit-learn. Generates code, configs, and best practices for multi-node training tasks.

5 tools

jeremylongshore-claude-code-plugins-plus-skills

together-gpu-clusters

Provisions on-demand/reserved GPU clusters (H100/H200/B200) on Together AI with Kubernetes/Slurm orchestration, shared storage, and scaling for ML/HPC multi-node jobs.

7 files

togetherai-skills

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

CoreWeave Core Workflow: GPU Training

Overview

Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.

Prerequisites

CKS cluster with multi-GPU node pools (8xA100 or 8xH100)
Shared storage (CoreWeave PVC or NFS)
Training container with PyTorch and NCCL

Instructions

Step 1: Single-Node Multi-GPU Training

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-finetune
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: trainer
          image: ghcr.io/myorg/trainer:latest
          command: ["torchrun"]
          args:
            - "--nproc_per_node=8"
            - "train.py"
            - "--model_name=meta-llama/Llama-3.1-8B"
            - "--batch_size=4"
            - "--epochs=3"
          resources:
            limits:
              nvidia.com/gpu: "8"
              memory: 512Gi
              cpu: "64"
          volumeMounts:
            - name: data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: training-data
        - name: checkpoints
          persistentVolumeClaim:
            claimName: model-checkpoints
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: gpu.nvidia.com/class
                    operator: In
                    values: ["A100_NVLINK_A100_SXM4_80GB"]

Step 2: Persistent Storage for Training Data

# storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 500Gi
  storageClassName: shared-hdd-ord1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-checkpoints
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 200Gi
  storageClassName: shared-ssd-ord1

Step 3: Monitor Training Progress

# Watch training logs
kubectl logs -f job/llm-finetune

# Check GPU utilization
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi

# Check training metrics
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \
  cat /checkpoints/training_log.json | tail -5

Error Handling

Error	Cause	Solution
NCCL timeout	Network issue between GPUs	Use NVLink nodes (SXM4/SXM5)
OOMKilled	Batch size too large	Reduce batch size or use gradient accumulation
Checkpoint save failed	PVC full	Increase storage or prune old checkpoints
Job evicted	Preemption	Use on-demand nodes for training

Resources

Next Steps

For troubleshooting, see coreweave-common-errors.