From mlops-skills
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
npx claudepluginhub rnben/hermes-skills --plugin mlops-skillsThis skill uses the workspace's default tool permissions.
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Analyzes BMad project state from catalog CSV, configs, artifacts, and query to recommend next skills or answer questions. Useful for help requests, 'what next', or starting BMad.
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
pip install vllm
Basic offline inference:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI-compatible server:
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
Step 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
Step 2: Test with limited traffic
Run load test before production:
# Install load testing tool
pip install locust
# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllm
Key metrics to monitor:
vllm:time_to_first_token_seconds - Latencyvllm:num_requests_running - Active requestsvllm:gpu_cache_usage_perc - KV cache utilizationStep 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
Step 5: Verify performance metrics
Check that deployment meets targets:
For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
Step 1: Prepare input data
# Load prompts from file
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one call
outputs = llm.generate(prompts, sampling)
# vLLM handles batching internally
# No need to manually chunk prompts
Step 4: Process results
# Extract generated text
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# Save to file
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
Step 1: Choose quantization method
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ
Step 3: Launch with quantization flag
# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAM
Step 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged
Use vLLM when:
Use alternatives instead:
Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096
Or use quantization:
vllm serve MODEL --quantization awq
Issue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-caching
For long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefill
Issue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-code
Issue: Low throughput (<50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512
Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3
Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODEL
Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs