End-to-end ML system design for production. Use when designing ML pipelines, feature stores, model training infrastructure, or serving systems. Covers the complete lifecycle from data ingestion to model deployment and monitoring.
Designs production ML systems from data ingestion to model monitoring. Use when planning ML pipelines, feature stores, training infrastructure, or serving systems for production deployments.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
This skill provides frameworks for designing production machine learning systems, from data pipelines to model serving.
Keywords: ML pipeline, machine learning system, feature store, model training, model serving, ML infrastructure, MLOps, A/B testing ML, feature engineering, model deployment
Use this skill when:
┌─────────────────────────────────────────────────────────────────────────┐
│ ML SYSTEM LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │──▶│ Monitor│ │
│ │ Ingestion│ │ Pipeline │ │ Training │ │ Serving │ │ & Eval │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Data │ │ Feature │ │ Model │ │ Inference│ │ Metrics│ │
│ │ Lake │ │ Store │ │ Registry │ │ Cache │ │ Store │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Component | Purpose | Examples |
|---|---|---|
| Data Ingestion | Collect raw data from sources | Kafka, Kinesis, Pub/Sub |
| Feature Pipeline | Transform raw data to features | Spark, Flink, dbt |
| Feature Store | Store and serve features | Feast, Tecton, Vertex AI |
| Model Training | Train and validate models | SageMaker, Vertex AI, Kubeflow |
| Model Registry | Version and track models | MLflow, Weights & Biases |
| Model Serving | Serve predictions | TensorFlow Serving, Triton, vLLM |
| Monitoring | Track model performance | Evidently, WhyLabs, Arize |
Problems without a feature store:
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE STORE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ OFFLINE STORE │ │ ONLINE STORE │ │
│ │ │ │ │ │
│ │ - Historical data │ │ - Low-latency │ │
│ │ - Training queries │ ────▶ │ - Point lookups │ │
│ │ - Batch features │ sync │ - Real-time serving│ │
│ │ │ │ │ │
│ │ (Data Warehouse) │ │ (Redis, DynamoDB) │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ FEATURE REGISTRY ││
│ │ - Feature definitions - Version control ││
│ │ - Data lineage - Access control ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
| Type | Computation | Storage | Example |
|---|---|---|---|
| Batch | Scheduled (hourly/daily) | Offline → Online | User purchase count (30 days) |
| Streaming | Real-time event processing | Direct to online | Items in cart (current) |
| On-demand | Request-time computation | Not stored | Distance to nearest store |
TRAINING (Historical):
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Historical │───▶│ Point-in-Time│───▶│ Training │
│ Events │ │ Join │ │ Dataset │
└──────────────┘ └──────────────┘ └──────────────┘
│
Uses feature
definitions
│
SERVING (Real-time): ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Online │───▶│ Same Feature │───▶│ Prediction │
│ Store │ │ Definitions │ │ Request │
└──────────────┘ └──────────────┘ └──────────────┘
┌───────────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │ │
│ │ Loader │ │ Transform│ │ Train │ │ Validate │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Experiment │ │ Hyperparameter│ │ Checkpoint │ │ Model │ │
│ │ Tracking │ │ Tuning │ │ Storage │ │ Registry │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────┘
| Pattern | Use Case | Tools |
|---|---|---|
| Single-node | Small datasets, quick experiments | Jupyter, local GPU |
| Distributed data-parallel | Large datasets, same model | Horovod, PyTorch DDP |
| Model-parallel | Large models that don't fit in memory | DeepSpeed, FSDP, Megatron |
| Hyperparameter tuning | Automated model optimization | Optuna, Ray Tune |
Track for reproducibility:
| What to Track | Why |
|---|---|
| Hyperparameters | Reproduce training runs |
| Metrics | Compare model performance |
| Artifacts | Model files, datasets |
| Code version | Git commit hash |
| Environment | Docker image, dependencies |
| Data version | Dataset hash or snapshot |
| Pattern | Latency | Throughput | Use Case |
|---|---|---|---|
| Online (REST/gRPC) | Low (<100ms) | Medium | Real-time predictions |
| Batch | High (hours) | Very high | Bulk scoring |
| Streaming | Medium | High | Event-driven predictions |
| Embedded | Very low | Varies | Edge/mobile inference |
┌─────────────────────────────────────────────────────────────────────┐
│ MODEL SERVING SYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Clients │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Load Balancer│ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ API Gateway │ │
│ │ - Authentication - Rate limiting - Request validation │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Model A │ │ Model B │ │ Model C │ │
│ │ (v1.2) │ │ (v2.0) │ │ (v1.0) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Feature Store │ │
│ │ (Online) │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Technique | Latency Impact | Trade-off |
|---|---|---|
| Batching | Reduces per-request | Increases latency for first request |
| Caching | 10-100x faster | May serve stale predictions |
| Quantization | 2-4x faster | Slight accuracy loss |
| Distillation | Variable | Training overhead |
| GPU inference | 10-100x faster | Cost increase |
┌─────────────────────────────────────────────────────────────────────┐
│ A/B TESTING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Traffic │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Experiment Assignment │ ◀─────── Experiment Config │
│ │ - User bucketing │ - Allocation % │
│ │ - Feature flags │ - Target segments │
│ └──────────┬───────────┘ - Guardrails │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌────────┐ ┌────────┐ │
│ │Control │ │Treatment│ │
│ │Model A │ │Model B │ │
│ └────┬───┘ └────┬───┘ │
│ │ │ │
│ └────────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Metrics Logger │ │
│ └────────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Statistical │ ─────▶ Decision: Ship / Iterate / Kill │
│ │ Analysis │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Metric Type | Examples | Purpose |
|---|---|---|
| Model metrics | AUC, RMSE, precision/recall | Model quality |
| Business metrics | CTR, conversion, revenue | Business impact |
| Guardrail metrics | Latency, error rate, engagement | Prevent regressions |
| Segment metrics | Metrics by user segment | Detect heterogeneous effects |
| Category | Metrics | Alert Threshold |
|---|---|---|
| Data quality | Missing values, schema drift | >1% change |
| Feature drift | Distribution shift (PSI, KL) | PSI >0.2 |
| Prediction drift | Output distribution shift | Depends on use case |
| Model performance | Accuracy, AUC (when labels available) | >5% degradation |
| Operational | Latency, throughput, errors | SLO violations |
┌─────────────────────────────────────────────────────────────────────┐
│ DRIFT DETECTION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Training Data Production Data │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Reference │ │ Current │ │
│ │ Distribution │ │ Distribution │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └──────────────┬──────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Statistical Test │ │
│ │ - PSI (Population Stability Index) │
│ │ - KS Test │
│ │ - Chi-squared │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Drift Score │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ ▼ ▼ ▼ │
│ No Drift Warning Critical │
│ (< 0.1) (0.1-0.2) (> 0.2) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Continue Investigate Retrain │
│ │
└─────────────────────────────────────────────────────────────────────┘
Components needed:
- Candidate Generation (retrieve 100s-1000s)
- Ranking Model (score and sort)
- Feature Store (user features, item features)
- Real-time personalization (recent behavior)
- A/B testing infrastructure
Components needed:
- Real-time feature computation
- Low-latency model serving (<50ms)
- High recall focus (can't miss fraud)
- Explainability for compliance
- Human-in-the-loop review
- Feedback loop for labels
Components needed:
- Two-stage ranking (retrieval + ranking)
- Feature store for query/document features
- Low latency (<200ms end-to-end)
- Learning to rank models
- Click-through rate prediction
- A/B testing with interleaving
Training time estimation:
- Dataset size: 100M examples
- Model: Transformer (100M params)
- GPU: A100 (80GB, 312 TFLOPS)
- Batch size: 32
- Training steps: Dataset / batch = 3.1M steps
- Time per step: ~100ms
- Total time: ~86 hours single GPU
- With 8 GPUs (data parallel): ~11 hours
Inference estimation:
- QPS: 10,000
- Model latency: 20ms
- Batch size: 1 (real-time)
- GPU utilization: 50% (latency constraint)
- Requests per GPU/sec: 25
- GPUs needed: 10,000 / 25 = 400 GPUs
- With batching (batch 8): 100 GPUs (4x reduction)
llm-serving-patterns - LLM-specific serving and optimizationrag-architecture - Retrieval-Augmented Generation patternsvector-databases - Vector search and embeddingsml-inference-optimization - Latency and cost optimizationestimation-techniques - Back-of-envelope calculationsquality-attributes-taxonomy - NFR definitions/sd:ml-pipeline <problem> - Design ML system interactively/sd:estimate <scenario> - Capacity calculationsml-systems-designer - Design ML architecturesml-interviewer - Mock ML system design interviewsDate: 2025-12-26 Model: claude-opus-4-5-20251101
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.