From grimoire
Designs production ML pipelines with automated training, validation, deployment, and monitoring. Useful when moving ML systems from experimentation to reliable production.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:design-ml-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design a production ML pipeline that automates training, validation, deployment, and monitoring to deliver reliable model updates continuously.
Design a production ML pipeline that automates training, validation, deployment, and monitoring to deliver reliable model updates continuously.
Adopted by: Google (TFX), Uber (Michelangelo), Airbnb (Bighead), Netflix (Metaflow) — all large ML orgs converge on pipeline automation Impact: Teams with automated ML pipelines deploy models 46x more frequently (Google DORA ML 2022); Sculley et al. found ML systems accrue technical debt 10x faster than software systems without pipeline discipline Why best: Manual ML workflows do not scale; data dependencies, model decay, and experiment tracking become unmanageable without automation
Sources: Sculley et al. NIPS 2015; Google "Practitioners Guide to MLOps" (2021); Huyen "Designing Machine Learning Systems" O'Reilly (2022)
Define the ML problem formally — State: input features, prediction target, success metric (AUC, RMSE, business KPI), and serving latency/throughput requirements. Ambiguous problem statements produce unmeasurable models. Get stakeholder sign-off before writing code.
Design data ingestion and validation — Automate data collection from source systems. Implement schema validation (Great Expectations, TFX Data Validation) to catch data drift, missing features, and distribution shifts at ingestion time. Fail the pipeline on critical validation errors rather than silently training on corrupt data.
Build a feature store or feature engineering pipeline — Centralize feature computation to prevent train-serve skew (the #1 source of silent model degradation). Features computed differently in training vs serving produce models that perform worse in production than offline. Use point-in-time joins to prevent data leakage.
Implement experiment tracking — Log every training run: code version (git SHA), dataset version, hyperparameters, and metrics. Use MLflow, Weights & Biases, or Vertex AI Experiments. Never make architecture decisions from runs you cannot reproduce.
Automate training and hyperparameter tuning — Parameterize training scripts; never hardcode hyperparameters. Define a training compute budget and use Bayesian optimization or successive halving (Optuna, Ray Tune) rather than grid search. Reproducible training requires pinned library versions and fixed random seeds.
Implement model validation gates — Before promotion to staging: compare new model against current production model on a held-out evaluation set. Gate on: metric threshold (e.g., AUC ≥ 0.85), regression tests (known failure cases), and latency budget (p99 inference < 100 ms). Fail the pipeline if any gate fails.
Design model registry and versioning — Store trained model artifacts with metadata in a model registry (MLflow Registry, Vertex AI Model Registry). Each registered model version links to: training data version, code version, evaluation metrics. Never deploy a model that isn't registered.
Implement staged rollout — Deploy via shadow mode (log predictions without serving), canary (5% of traffic), then full rollout. Use feature flags to enable rollback in < 5 minutes. Automated rollback triggers if online metrics (prediction distribution, downstream business metric) degrade.
Monitor model performance in production — Track: prediction distribution (statistical drift from training distribution), feature distribution, downstream business metrics, and data pipeline freshness. Set alerts on distribution shift (KL divergence threshold). Retrain triggers automatically when drift exceeds threshold.
Schedule retraining — Define retraining trigger: time-based (weekly), data-volume-based (every 1M new samples), or drift-based (monitoring alert). Automate the full retrain-validate-deploy cycle. Manual retraining is a bottleneck for models that decay quickly (ad click-through, recommendation).
npx claudepluginhub jeffreytse/grimoire --plugin grimoireDesigns and implements production-ready ML pipelines using multi-agent MLOps orchestration for specified requirements. Covers data ingestion, quality, features, training, deployment, and monitoring.
Designs production-grade ML pipelines: experiment tracking (MLflow, W&B), orchestration (Kubeflow, Airflow), feature stores (Feast), model registries, and automated retraining. For ML pipeline building and MLOps.
Designs and implements production-grade ML pipeline infrastructure including experiment tracking, training orchestration, feature stores, model registries, and automated retraining workflows.