From coreweave-pack
Deploys KServe InferenceService on CoreWeave Kubernetes for GPU ML model serving with vLLM, autoscaling, scale-to-zero, and A100 affinity.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packThis skill is limited to using the following tools:
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
Deploys AI inference services on CoreWeave Kubernetes using Helm charts and Kustomize for GPU scaling and multi-model setups.
Deploys vLLM OpenAI-compatible server to Kubernetes with GPU support, health probes, and services via YAML templates. Checks HF token secret and existing deployments before applying.
Deploys ML models to production serving infrastructure using MLflow, BentoML, Seldon Core with REST/gRPC endpoints, autoscaling, monitoring, A/B testing for scalable real-time inference.
Share bugs, ideas, or general feedback.
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
coreweave-install-auth setup# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-inference
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "1"
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "5"
spec:
predictor:
minReplicas: 1
maxReplicas: 5
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8080"
ports:
- containerPort: 8080
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
memory: 48Gi
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: 32Gi
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w
# For dev/staging -- scale down to zero when idle
metadata:
annotations:
autoscaling.knative.dev/minScale: "0" # Scale to zero
autoscaling.knative.dev/maxScale: "3"
autoscaling.knative.dev/scaleDownDelay: "5m"
# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
-o jsonpath='{.status.url}')
curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
| Error | Cause | Solution |
|---|---|---|
| InferenceService not ready | GPU not available | Check node capacity and affinity |
| Scale-to-zero cold start | First request after idle | Set minScale: 1 for production |
| Model loading timeout | Large model download | Pre-cache model in PVC |
| OOMKilled | Model too large | Use multi-GPU or quantized model |
For GPU training workloads, see coreweave-core-workflow-b.