LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
Provides patterns for designing LLM inference infrastructure, including serving frameworks (vLLM, TGI, TensorRT-LLM), quantization, batching strategies, and streaming responses. Use when optimizing inference latency, scaling deployments, or choosing serving architectures.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Serving Stack │
├─────────────────────────────────────────────────────────────────────┤
│ Clients (API, Chat UI, Agents) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Load Balancer / API Gateway │ │
│ │ • Rate limiting • Authentication • Request routing │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inference Server │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Request │ │ Batching │ │ KV Cache │ │ │
│ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Model Execution Engine │ │ │
│ │ │ • Tensor operations • Attention • Token sampling │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GPU/TPU Cluster │ │
│ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
| Framework | Strengths | Best For | Considerations |
|---|---|---|---|
| vLLM | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| TGI (Text Generation Inference) | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| TensorRT-LLM | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| Triton Inference Server | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| Ollama | Simple local deployment | Development, edge deployment | Limited scaling features |
| llama.cpp | CPU inference, quantization | Resource-constrained, edge | C++ integration required |
Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
└── Need high throughput with many concurrent users?
├── Yes → vLLM (PagedAttention)
└── No
└── Need enterprise features + HF integration?
├── Yes → TGI
└── No
└── Simple local/edge deployment?
├── Yes → Ollama or llama.cpp
└── No → vLLM (general purpose)
| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |
| Method | Description | Quality | Speed |
|---|---|---|---|
| PTQ (Post-Training Quantization) | Quantize after training, no retraining | Good | Fast to apply |
| QAT (Quantization-Aware Training) | Simulate quantization during training | Better | Requires training |
| GPTQ | One-shot weight quantization | Very good | Moderate |
| AWQ (Activation-aware Weight Quantization) | Preserves salient weights | Excellent | Moderate |
| GGUF/GGML | llama.cpp format, CPU-optimized | Good | Very fast inference |
| SmoothQuant | Migrates difficulty to weights | Excellent | Moderate |
Quality vs. Efficiency Trade-off:
Quality ────────────────────────────────────────────▶ Efficiency
│ │
│ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │
│ ○───────○────────○──────────○──────────○──────○ │
│ │ │ │ │ │ │ │
│ Best Great Good Good Fair Poor │
│ │
Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80] ─┘
Problem: Short requests wait for long ones (head-of-line blocking)
Time ──────────────────────────────────────────────────────────▶
Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]
• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization
| Parameter | Description | Trade-off |
|---|---|---|
max_batch_size | Maximum concurrent requests | Memory vs. throughput |
max_waiting_tokens | Tokens before forcing batch | Latency vs. throughput |
max_num_seqs | Maximum sequences in batch | Memory vs. concurrency |
Attention: Q × K^T × V
For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)
Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory
Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed) │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE │
└──────────────────────────────────────────┘
PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput
| Strategy | Description | Memory Savings |
|---|---|---|
| Paged Attention | Virtual memory for KV cache | ~50% reduction |
| Prefix Caching | Reuse KV cache for common prefixes | System prompt: 100% |
| Quantized KV Cache | INT8/FP8 for KV values | 50-75% reduction |
| Sliding Window | Limited attention context | Linear memory |
| MQA/GQA | Grouped query attention | Architecture-dependent |
Client Server
│ │
│──── GET /v1/chat/completions ──────▶│
│ (stream: true) │
│ │
│◀──── HTTP 200 OK ───────────────────│
│ Content-Type: text/event-stream│
│ │
│◀──── data: {"token": "Hello"} ──────│
│◀──── data: {"token": " world"} ─────│
│◀──── data: {"token": "!"} ──────────│
│◀──── data: [DONE] ──────────────────│
│ │
SSE Benefits:
Client Server
│ │
│──── WebSocket Upgrade ─────────────▶│
│◀──── 101 Switching Protocols ───────│
│ │
│──── {"prompt": "Hello"} ───────────▶│
│ │
│◀──── {"token": "Hi"} ───────────────│
│◀──── {"token": " there"} ───────────│
│◀──── {"token": "!"} ────────────────│
│◀──── {"done": true} ────────────────│
│ │
WebSocket Benefits:
| Aspect | SSE | WebSocket |
|---|---|---|
| Reconnection | Built-in | Manual |
| Scalability | Per-request | Connection pool |
| Load Balancing | Standard HTTP | Sticky sessions |
| Firewall/Proxy | Usually works | May need config |
| Best For | One-way streaming | Interactive chat |
Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
│
▼
Large Model: [Verify T1-T5 in one pass] (15ms)
Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗
│
▼
[Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
| Factor | Impact |
|---|---|
| Draft model quality | Higher match rate = more speedup |
| Draft model size | Larger = better quality, slower |
| Speculation depth | More tokens = higher risk/reward |
| Verification cost | Must be < sequential generation |
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Round-robin, Least-connections) │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ vLLM │ │ vLLM │ │ vLLM │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │
└─────────┘ └─────────┘ └─────────┘
| Strategy | Description | Use Case |
|---|---|---|
| Tensor Parallelism | Split layers across GPUs | Single large model |
| Pipeline Parallelism | Different layers on different GPUs | Very large models |
| Data Parallelism | Same model, different batches | High throughput |
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│ Layer N │
│ GPU0 │ GPU1 │ GPU2 │ GPU3 │
│ 25% │ 25% │ 25% │ 25% │
└─────────────────────────────────────────┘
Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
| Factor | Impact | Optimization |
|---|---|---|
| GPU hours | Highest | Quantization, batching |
| Memory | High | PagedAttention, KV cache optimization |
| Network | Medium | Response compression, edge deployment |
| Storage | Low | Model deduplication |
Monthly Cost =
(Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
─────────────────────────────────────────────────────────────────────────────
3600
Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour
Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
┌─────────────────────────────────────────────────────────┐
│ Router │
│ • Classify request complexity │
│ • Route to appropriate model │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Small │ │ Medium │ │ Large │
│ Model │ │ Model │ │ Model │
│ (7B) │ │ (13B) │ │ (70B) │
│ Fast │ │ Balanced│ │ Quality │
└─────────┘ └─────────┘ └─────────┘
| Cache Type | What to Cache | TTL |
|---|---|---|
| Prompt cache | Common system prompts | Long |
| KV cache | Prefix tokens | Session |
| Response cache | Exact query matches | Varies |
| Embedding cache | Document embeddings | Long |
ml-system-design - End-to-end ML pipeline designrag-architecture - Retrieval-augmented generation patternsvector-databases - Vector search for LLM contextml-inference-optimization - General inference optimizationestimation-techniques - Capacity planning for LLM systemsDate: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.