Skill

coreweave-core-workflow-a

Deploys KServe InferenceService on CoreWeave Kubernetes for GPU ML model serving with vLLM, autoscaling, scale-to-zero, and A100 affinity.

Kubernetes

Hugging Face

ai-ml

devops

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(kubectl:*)Grep

Preview

Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.

SKILL.md

Similar Skills

coreweave-deploy-integration

1.9k

Deploys AI inference services on CoreWeave Kubernetes using Helm charts and Kustomize for GPU scaling and multi-model setups.

6 tools

coreweave-pack

vllm-deploy-k8s

Deploys vLLM OpenAI-compatible server to Kubernetes with GPU support, health probes, and services via YAML templates. Checks HF token secret and existing deployments before applying.

2 files

vllm-skills

deploy-ml-model-serving

Deploys ML models to production serving infrastructure using MLflow, BentoML, Seldon Core with REST/gRPC endpoints, autoscaling, monitoring, A/B testing for scalable real-time inference.

1 file1 tool

agent-almanac

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

CoreWeave Core Workflow: KServe Inference

Overview

Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.

Prerequisites

Completed coreweave-install-auth setup
KServe available on your CKS cluster
Model stored in S3, GCS, or HuggingFace

Instructions

Step 1: Deploy an InferenceService

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-inference
  annotations:
    autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
    autoscaling.knative.dev/metric: "concurrency"
    autoscaling.knative.dev/target: "1"
    autoscaling.knative.dev/minScale: "1"
    autoscaling.knative.dev/maxScale: "5"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    containers:
      - name: kserve-container
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--port"
          - "8080"
        ports:
          - containerPort: 8080
            protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 48Gi
            cpu: "8"
          requests:
            nvidia.com/gpu: "1"
            memory: 32Gi
            cpu: "4"
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: token
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: gpu.nvidia.com/class
                  operator: In
                  values: ["A100_PCIE_80GB"]

kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w

Step 2: Scale-to-Zero Configuration

# For dev/staging -- scale down to zero when idle
metadata:
  annotations:
    autoscaling.knative.dev/minScale: "0"    # Scale to zero
    autoscaling.knative.dev/maxScale: "3"
    autoscaling.knative.dev/scaleDownDelay: "5m"

Step 3: Test the Endpoint

# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
  -o jsonpath='{.status.url}')

curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Error Handling

Error	Cause	Solution
InferenceService not ready	GPU not available	Check node capacity and affinity
Scale-to-zero cold start	First request after idle	Set `minScale: 1` for production
Model loading timeout	Large model download	Pre-cache model in PVC
OOMKilled	Model too large	Use multi-GPU or quantized model

Resources

Next Steps

For GPU training workloads, see coreweave-core-workflow-b.