From vllm-skills
Deploys vLLM OpenAI-compatible server to Kubernetes with GPU support, health probes, and services via YAML templates. Checks HF token secret and existing deployments before applying.
npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsThis skill uses the workspace's default tool permissions.
A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.
vllm/vllm-openai:latest image by default (user can specify a different version)kubectl configured with access to a Kubernetes clusterBefore deploying, check if the hf-token Kubernetes secret exists in the target namespace:
kubectl get secret hf-token -n <namespace>
kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>
This is required for gated models (e.g., meta-llama/Meta-Llama-3.1-8B). For public models, the secret is optional but recommended to avoid rate limits.
Before applying, check if a vLLM deployment already exists:
kubectl get deployment vllm -n <namespace>
Apply the template YAML files to deploy vLLM:
kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>
Wait for the deployment to roll out:
kubectl rollout status deployment/vllm -n <namespace> --timeout=600s
Verify the pod is running and ready:
kubectl get pods -n <namespace> -l app=vllm
Confirm the pod shows READY 1/1 and STATUS Running. If the pod is not ready yet, wait and check again. If it's in CrashLoopBackOff or Error, check the logs with kubectl logs -n <namespace> -l app=vllm.
Once the pod is ready, print a summary message to the user in this format (replace placeholders with actual values):
๐ **vLLM Deployment Successful!**
| Resource | Name | Status |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> Ready |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | Running |
| Image | <image> | |
| Model | <model> | |
**To test the API, run these two commands in your terminal:**
**1. Open a port-forward** (this connects your local port <port> to the vLLM service inside the cluster):
kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>
**2. In a separate terminal**, send a test request to the OpenAI-compatible API:
curl -s http://localhost:<port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool
If everything is working, you'll get a JSON response with the model's reply.
The templates use the following defaults:
| Parameter | Default Value |
|---|---|
| Image | vllm/vllm-openai:latest |
| Model | Qwen/Qwen2.5-1.5B-Instruct |
| Port | 8000 |
| Replicas | 1 |
| GPU count | 1 |
| GPU memory utilization | 0.85 |
| Tensor parallel size | 1 |
| CPU request / limit | 12 / 128 |
| Memory request / limit | 100Gi / 400Gi |
| Shared memory (dshm) | 80Gi |
When the user requests changes, modify the template YAML files before applying. The following can be customized:
image: vllm/vllm-openai:<version> in templates/vllm-deployment.yaml (default: latest). Use a specific version tag like v0.17.1 if the user requests it.vllm serve command inside the Deployment args.vllm serve command in the Deployment args (e.g., --max-model-len 4096, --kv-cache-dtype fp8, --enforce-eager, --generation-config vllm).replicas: in the Deployment spec.nvidia.com/gpu in both requests and limits under resources.--tensor-parallel-size flag to match the GPU count.cpu and memory values under requests and limits.containerPort in the Deployment, port/targetPort in the Service, the port in all health probes (liveness, readiness, startup), AND add --port <port> to the vllm serve command in args. All four must match.-n <namespace>.sizeLimit of the dshm emptyDir volume.Edit the template files using the Edit tool, then apply the modified templates.
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
When the user asks to clean up or delete the vLLM deployment, run the following steps:
kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>
kubectl delete secret hf-token -n <namespace>
kubectl get deployment,svc,pods -n <namespace> -l app=vllm
vLLM deployment has been cleaned up from namespace <namespace>.
Deleted: Deployment/vllm, Service/vllm-svc
HF token secret: <kept/deleted>
kubectl describe pod <pod-name> for scheduling errors. Ensure NVIDIA GPU Operator or device plugin is installed.memory limits in the Deployment, or use a smaller model.kubectl logs <pod-name>. Ensure hf-token secret exists for gated models. Increase failureThreshold on the startup probe if needed.kubectl get secret hf-token -n <namespace>. Check the token is valid.nvidia.com/gpu resource is requested and the NVIDIA device plugin is running on the node.