From coreweave-pack
Deploys GPU workloads on CoreWeave Kubernetes with kubectl: vLLM inference server or batch job. For first GPU deploys, inference testing, cluster access checks.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packThis skill is limited to using the following tools:
Deploy your first GPU workload on CoreWeave: a simple inference service using vLLM or a batch CUDA job. CoreWeave runs Kubernetes on bare-metal GPU nodes with A100, H100, and L40 GPUs.
Provides Python helpers and Kubernetes patterns for CoreWeave GPU workloads, including affinity configs, resource limits, and inference clients for programmatic management.
Deploys vLLM OpenAI-compatible server to Kubernetes with GPU support, health probes, and services via YAML templates. Checks HF token secret and existing deployments before applying.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Share bugs, ideas, or general feedback.
Deploy your first GPU workload on CoreWeave: a simple inference service using vLLM or a batch CUDA job. CoreWeave runs Kubernetes on bare-metal GPU nodes with A100, H100, and L40 GPUs.
coreweave-install-auth setup# vllm-inference.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 48Gi
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: 32Gi
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm-server
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
# Create HuggingFace token secret
kubectl create secret generic hf-token --from-literal=token="${HF_TOKEN}"
# Deploy
kubectl apply -f vllm-inference.yaml
kubectl get pods -w # Wait for Running state
# Port-forward and test
kubectl port-forward svc/vllm-server 8000:8000 &
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
# gpu-batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-benchmark
spec:
template:
spec:
restartPolicy: Never
containers:
- name: benchmark
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
command: ["python3", "-c"]
args:
- |
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
x = torch.randn(10000, 10000, device="cuda")
y = torch.matmul(x, x)
print(f"Matrix multiply result shape: {y.shape}")
print("CoreWeave GPU test passed!")
resources:
limits:
nvidia.com/gpu: 1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
kubectl apply -f gpu-batch-job.yaml
kubectl logs job/gpu-benchmark --follow
| Error | Cause | Solution |
|---|---|---|
| Pod stuck Pending | No GPU capacity | Try different GPU type or check quota |
nvidia-smi not found | Wrong base image | Use NVIDIA CUDA images |
| OOMKilled | Insufficient GPU memory | Use larger GPU (80GB A100) |
| Image pull error | Registry auth | Create imagePullSecret |
Proceed to coreweave-local-dev-loop for development workflow setup.