ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge.
Optimize ML inference performance using model compression, quantization, and caching. Use when reducing latency or model size for deployment.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Optimization Stack │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Model Level │ │
│ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Compiler Level │ │
│ │ Graph optimization │ Operator fusion │ Memory planning │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Runtime Level │ │
│ │ Batching │ Caching │ Async execution │ Multi-threading │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Hardware Level │ │
│ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Technique | Size Reduction | Speed Improvement | Accuracy Impact |
|---|---|---|---|
| Quantization | 2-4x | 2-4x | Low (1-2%) |
| Pruning | 2-10x | 1-3x | Low-Medium |
| Distillation | 3-10x | 3-10x | Medium |
| Low-rank factorization | 2-5x | 1.5-3x | Low-Medium |
| Weight sharing | 10-100x | Variable | Medium-High |
┌─────────────────────────────────────────────────────────────────────┐
│ Knowledge Distillation │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Teacher Model│ (Large, accurate, slow) │
│ │ GPT-4 │ │
│ └──────────────┘ │
│ │ │
│ ▼ Soft labels (probability distributions) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Training Process │ │
│ │ Loss = α × CrossEntropy(student, hard_labels) │ │
│ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Student Model │ (Small, nearly as accurate, fast) │
│ │ DistilBERT │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Distillation Types:
| Type | Description | Use Case |
|---|---|---|
| Response distillation | Match teacher outputs | General compression |
| Feature distillation | Match intermediate layers | Better transfer |
| Relation distillation | Match sample relationships | Structured data |
| Self-distillation | Model teaches itself | Regularization |
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries
Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
│ C1│ C2│ C3│ C4│
└───┴───┴───┴───┘
After: ┌───┬───┬───┐
│ C1│ C3│ C4│ (Removed C2 entirely)
└───┴───┴───┘
• Works with standard hardware
• Lower compression ratio
Pruning Decision Criteria:
| Method | Description | Effectiveness |
|---|---|---|
| Magnitude-based | Remove smallest weights | Simple, effective |
| Gradient-based | Remove low-gradient weights | Better accuracy |
| Second-order | Use Hessian information | Best but expensive |
| Lottery ticket | Find winning subnetwork | Theoretical insight |
Precision Hierarchy:
FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████ (different mantissa/exponent)
INT8 (8 bits): ████████
INT4 (4 bits): ████
Binary (1 bit): █
Memory and Compute Scale Proportionally
Quantization Approaches:
| Approach | When Applied | Quality | Effort |
|---|---|---|---|
| Dynamic quantization | Runtime | Good | Low |
| Static quantization | Post-training with calibration | Better | Medium |
| QAT | During training | Best | High |
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output
Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output
Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth
| Optimization | Description | Speedup |
|---|---|---|
| Operator fusion | Combine sequential ops | 1.2-2x |
| Constant folding | Pre-compute constants | 1.1-1.5x |
| Dead code elimination | Remove unused ops | Variable |
| Layout optimization | Optimize tensor memory layout | 1.1-1.3x |
| Memory planning | Optimize buffer allocation | 1.1-1.2x |
| Framework | Vendor | Best For |
|---|---|---|
| TensorRT | NVIDIA | NVIDIA GPUs, lowest latency |
| ONNX Runtime | Microsoft | Cross-platform, broad support |
| OpenVINO | Intel | Intel CPUs/GPUs |
| Core ML | Apple | Apple devices |
| TFLite | Mobile, embedded | |
| Apache TVM | Open source | Custom hardware, research |
No Batching:
Request 1: [Process] → Response 1 10ms
Request 2: [Process] → Response 2 10ms
Request 3: [Process] → Response 3 10ms
Total: 30ms, GPU underutilized
Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput
Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput
Batching Parameters:
| Parameter | Description | Trade-off |
|---|---|---|
batch_size | Maximum batch size | Throughput vs. latency |
max_wait_time | Wait time for batch fill | Latency vs. efficiency |
min_batch_size | Minimum before processing | Latency predictability |
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Caching Layers │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Input Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache exact inputs → Return cached outputs │ │
│ │ Hit rate: Low (inputs rarely repeat exactly) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 2: Embedding Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache computed embeddings for repeated tokens/entities │ │
│ │ Hit rate: Medium (common tokens repeat) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 3: KV Cache (for transformers) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache key-value pairs for attention │ │
│ │ Hit rate: High (reuse across tokens in sequence) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 4: Result Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache semantic equivalents (fuzzy matching) │ │
│ │ Hit rate: Variable (depends on query distribution) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Semantic Caching for LLMs:
Query: "What's the capital of France?"
↓
Hash + Embed query
↓
Search cache (similarity > threshold)
↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │ Total: 30ms
│10ms │ │15ms │ │5ms │
└─────┘ └─────┘ └─────┘
Pipelined:
Request 1: │Prep│Model│Post│
Request 2: │Prep│Model│Post│
Request 3: │Prep│Model│Post│
Throughput: 3x higher
Latency per request: Same
| Hardware | Strengths | Limitations | Best For |
|---|---|---|---|
| GPU (NVIDIA) | High parallelism, mature ecosystem | Power, cost | Training, large batch inference |
| TPU (Google) | Matrix ops, cloud integration | Vendor lock-in | Google Cloud workloads |
| NPU (Apple/Qualcomm) | Power efficient, on-device | Limited models | Mobile, edge |
| CPU | Flexible, available | Slower for ML | Low-batch, CPU-bound |
| FPGA | Customizable, low latency | Development complexity | Specialized workloads |
| Optimization | Description | Impact |
|---|---|---|
| Tensor Cores | Use FP16/INT8 tensor operations | 2-8x speedup |
| CUDA graphs | Reduce kernel launch overhead | 1.5-2x for small models |
| Multi-stream | Parallel execution | Higher throughput |
| Memory pooling | Reduce allocation overhead | Lower latency variance |
┌─────────────────────────────────────────────────────────────────────┐
│ Edge Deployment Constraints │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Resource Constraints: │
│ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │
│ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │
│ ├── Power: 5-15W (vs. 300W+ cloud) │
│ └── Storage: 16-128 GB (vs. TB cloud) │
│ │
│ Operational Constraints: │
│ ├── No network (offline operation) │
│ ├── Variable ambient conditions │
│ ├── Infrequent updates │
│ └── Long deployment lifetime │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Strategy | Description | Use When |
|---|---|---|
| Model selection | Use edge-native models (MobileNet, EfficientNet) | Accuracy acceptable |
| Aggressive quantization | INT8 or lower | Memory/power constrained |
| On-device distillation | Distill to tiny model | Extreme constraints |
| Split inference | Edge preprocessing, cloud inference | Network available |
| Model caching | Cache results locally | Repeated queries |
| Framework | Platform | Features |
|---|---|---|
| TensorFlow Lite | Android, iOS, embedded | Quantization, delegates |
| Core ML | iOS, macOS | Neural Engine optimization |
| ONNX Runtime Mobile | Cross-platform | Broad model support |
| PyTorch Mobile | Android, iOS | Familiar API |
| TensorRT | NVIDIA Jetson | Maximum performance |
┌─────────────────────────────────────────────────────────────────────┐
│ Latency Breakdown Analysis │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Data Loading: ████████░░░░░░░░░░ 15% │
│ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │
│ 3. Model Inference: ████████████████░░ 60% │
│ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │
│ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │
│ │
│ Target: Model inference (60% = biggest optimization opportunity) │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Tool | Use For |
|---|---|
| PyTorch Profiler | PyTorch model profiling |
| TensorBoard | TensorFlow visualization |
| NVIDIA Nsight | GPU profiling |
| Chrome Tracing | General timeline visualization |
| perf | CPU profiling |
| Metric | Description | Target |
|---|---|---|
| P50 latency | Median latency | < SLA |
| P99 latency | Tail latency | < 2x P50 |
| Throughput | Requests/second | Meet demand |
| GPU utilization | Compute usage | > 80% |
| Memory bandwidth | Memory usage | < limit |
┌─────────────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Baseline │
│ └── Measure current performance (latency, throughput, accuracy) │
│ │
│ 2. Profile │
│ └── Identify bottlenecks (model, data, system) │
│ │
│ 3. Optimize (in order of effort/impact): │
│ ├── Hardware: Use right accelerator │
│ ├── Compiler: Enable optimizations (TensorRT, ONNX) │
│ ├── Runtime: Batching, caching, async │
│ ├── Model: Quantization, pruning │
│ └── Architecture: Distillation, model change │
│ │
│ 4. Validate │
│ └── Verify accuracy maintained, latency improved │
│ │
│ 5. Deploy and Monitor │
│ └── Track real-world performance │
│ │
└─────────────────────────────────────────────────────────────────────┘
High Impact
│
Compiler Opts ────┼──── Quantization
(easy win) │ (best ROI)
│
Low Effort ──────────────┼──────────────── High Effort
│
Batching ────┼──── Distillation
(quick win) │ (major effort)
│
Low Impact
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Request → ┌─────────┐ │
│ │ Router │ │
│ └─────────┘ │
│ │ │ │ │
│ ┌────────┘ │ └────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Tiny │ │ Small │ │ Large │ │
│ │ <10ms │ │ <50ms │ │<500ms │ │
│ └───────┘ └───────┘ └───────┘ │
│ │
│ Routing strategies: │
│ • Complexity-based: Simple→Tiny, Complex→Large │
│ • Confidence-based: Try Tiny, escalate if low confidence │
│ • SLA-based: Route based on latency requirements │
│ │
└─────────────────────────────────────────────────────────────────────┘
Query: "Translate: Hello"
│
├──▶ Small model (draft): "Bonjour" (5ms)
│
└──▶ Large model (verify): Check "Bonjour" (10ms parallel)
│
├── Accept: Return immediately
└── Reject: Generate with large model
Speedup: 2-3x when drafts are often accepted
Input → ┌────────┐
│ Filter │ ← Cheap filter (reject obvious negatives)
└────────┘
│ (candidates only)
▼
┌────────┐
│ Stage 1│ ← Fast model (coarse ranking)
└────────┘
│ (top-100)
▼
┌────────┐
│ Stage 2│ ← Accurate model (fine ranking)
└────────┘
│ (top-10)
▼
Output
Benefit: 10x cheaper, similar accuracy
llm-serving-patterns - LLM-specific serving optimizationml-system-design - End-to-end ML pipeline designquality-attributes-taxonomy - Performance as quality attributeestimation-techniques - Capacity planning for ML systemsDate: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.