From harness-claude
Advises on ML pipeline management, experiment tracking hygiene, model serving patterns, and prompt evaluation frameworks. Audits reproducibility, model versioning, and deployment readiness across MLflow, Weights & Biases, SageMaker, Vertex AI.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Advise on ML pipeline management, experiment tracking hygiene, model serving patterns, and prompt evaluation frameworks. Audits reproducibility, model versioning, and deployment readiness across MLflow, Weights and Biases, SageMaker, and Vertex AI.
Assesses ML pipeline stage and applies patterns for data pipelines, model training, serving, MLOps, evaluation, and debugging with validations like schema checks, drift detection, and skew guards.
Guides MLOps workflows for ML model deployment: readiness checklists, serving infrastructure (FastAPI, SageMaker, Triton), inference optimization, versioning, A/B testing, drift detection, retraining, and monitoring.
Builds ML pipelines, tracks experiments, and manages model registries with MLflow, Kubeflow, Airflow, SageMaker, and Azure ML. Automates training, deployment, monitoring for MLOps infrastructure.
Share bugs, ideas, or general feedback.
Advise on ML pipeline management, experiment tracking hygiene, model serving patterns, and prompt evaluation frameworks. Audits reproducibility, model versioning, and deployment readiness across MLflow, Weights and Biases, SageMaker, and Vertex AI.
Resolve project root. Use provided path or cwd.
Detect ML frameworks and tools. Scan for:
mlflow/, mlruns/, wandb/, mlflow.log_param, wandb.init, wandb.logtorch, tensorflow, sklearn, xgboost, transformers, langchain, openaiDockerfile with model references, serve.py, predict.py, app.py with /predict routes, BentoML, TorchServe, TensorFlow Serving configsevals/, prompts/, eval_config.yaml, files with evaluate, benchmark, metrics*.ipynb files, notebooks/ directoryInventory model artifacts. Locate and catalog:
*.pt, *.pth, *.h5, *.pkl, *.onnx, *.safetensorsconfig.json, model_config.yaml, hyperparameter filesprompts/, *.prompt, template strings with {variable} interpolationevals/, test_data/, golden sets, benchmark datasetsDetect model registry usage. Check for:
mlflow.register_model, model stages (Staging, Production, Archived)wandb.Artifact, model versioningsagemaker.register_model_steppush_to_hub, from_pretrained with custom modelsMap the ML lifecycle. Identify which stages are present:
Report detection summary:
ML Stack Detection:
Frameworks: PyTorch 2.1, Hugging Face Transformers 4.36
Tracking: MLflow 2.10 (local tracking server)
Serving: FastAPI + TorchServe
Models: 3 fine-tuned transformers, 1 XGBoost classifier
Prompts: 12 templates in prompts/ (LangChain format)
Evaluation: 2 eval configs, 1 golden dataset
Registry: MLflow (2 models registered, 1 in Production stage)
Missing stages: monitoring, automated retraining
Check experiment tracking hygiene. Evaluate:
mlflow.autolog() or manual log_param/log_metric)Check reproducibility. Verify:
torch.manual_seed, np.random.seed, random.seed)requirements.txt with versions, poetry.lock, pip freeze)Check model serving patterns. Evaluate:
/health, /ready)Check prompt management (for LLM applications). Evaluate:
{variable} not string concatenation)Check evaluation coverage. Evaluate:
Classify findings by severity:
Recommend experiment tracking setup. Based on the detected framework:
mlflow.set_tracking_uri() and mlflow.autolog() setupwandb.init(project=...) and wandb.config setupRecommend model registry workflow. Design a versioning and promotion flow:
Training -> Candidate (auto-registered)
-> Evaluation gate (metrics threshold)
-> Staging (shadow deployment)
-> Production (canary rollout)
-> Archived (previous version)
Adapt to the project's scale: small projects may skip staging/canary.
Recommend evaluation framework. Based on model type:
Recommend prompt management patterns. For LLM applications:
prompts/ directoryRecommend monitoring and retraining triggers. Design:
Provide implementation templates. Generate starter code for:
Check deployment checklist. Verify each item:
Check for common deployment pitfalls:
Validate prompt safety (for LLM applications). Check:
Output ML readiness report:
ML Ops Report: [READY/NEEDS_ATTENTION/NOT_READY]
Stack: PyTorch + MLflow + FastAPI
Models: 3 detected, 1 registered in Production
Experiment tracking: 85% coverage (2 training scripts missing logging)
Reproducibility: PARTIAL (seeds set, packages not pinned)
Evaluation: 1/3 models have golden set evaluation
Serving: health check present, input validation missing
ERRORS:
[ML-ERR-001] src/serve.py:12
Model loaded inside request handler -- move to startup event
[ML-ERR-002] training/train_classifier.py
No experiment tracking -- results are not reproducible
WARNINGS:
[ML-WARN-001] requirements.txt
Package versions not pinned (torch, transformers)
[ML-WARN-002] evals/
Only accuracy metric tracked -- add precision, recall, F1
RECOMMENDATIONS:
1. Add mlflow.autolog() to train_classifier.py
2. Pin package versions in requirements.txt
3. Move model loading to FastAPI lifespan event
4. Add input validation schema to /predict endpoint
Verify report accuracy. Cross-check:
harness skill run harness-ml-ops -- Primary command for ML operations auditing.harness validate -- Run after applying recommendations to verify project health.Glob -- Used to locate model artifacts, experiment configs, notebooks, prompt templates, and evaluation datasets.Grep -- Used to find experiment logging calls, model loading patterns, and serving endpoint definitions.Read -- Used to read training scripts, serving code, evaluation configs, and model metadata.Write -- Used to generate experiment tracking wrappers, serving boilerplate, and evaluation harness templates.Bash -- Used to check MLflow tracking server status, validate model registry entries, and run lightweight eval checks.emit_interaction -- Used to present the readiness report and confirm recommendations before generating implementation templates.Phase 1: DETECT
Frameworks: PyTorch 2.1, Hugging Face Transformers 4.36
Tracking: MLflow 2.10 (local, 47 runs logged)
Models: 2 fine-tuned BERT models in mlruns/
Notebooks: 3 in notebooks/ (exploration, training, evaluation)
Phase 2: ANALYZE
[ML-WARN-001] notebooks/training.ipynb
Cells executed out of order (cell 7 before cell 5) -- not reproducible
[ML-WARN-002] training/finetune.py
Random seed set for torch but not for numpy or python random
[ML-INFO-001] MLflow runs missing GPU type metadata
Phase 3: DESIGN
Recommended: Add np.random.seed() and random.seed() alongside torch.manual_seed()
Recommended: Add mlflow.log_param("gpu_type", torch.cuda.get_device_name())
Generated: training/experiment_wrapper.py (standardized logging)
Phase 4: VALIDATE
Deployment readiness: NEEDS_ATTENTION
Model registered: YES (bert-sentiment-v2 in Production)
Evaluation: golden set present with 500 examples
Missing: automated regression test comparing v2 vs v1
Phase 1: DETECT
Frameworks: LangChain 0.1, OpenAI API (GPT-4)
Prompts: 8 templates in src/prompts/ (hardcoded as Python strings)
Evaluation: none detected
Serving: FastAPI with /chat and /summarize endpoints
Phase 2: ANALYZE
[ML-ERR-001] src/prompts/summarize.py
Prompt template uses string concatenation with user input -- injection risk
[ML-ERR-002] src/api/chat.py
No token limit enforcement -- single request could consume entire budget
[ML-WARN-001] No evaluation framework -- model changes deployed without quality check
[ML-WARN-002] No prompt versioning -- changes to prompts are not tracked
Phase 3: DESIGN
Recommended: Move prompts to YAML files with version tags
Recommended: Implement promptfoo or custom eval harness with golden QA pairs
Recommended: Add token budget middleware (max 4096 tokens per request)
Recommended: Use LangChain PromptTemplate with input validation
Generated: evals/eval_config.yaml (promptfoo configuration)
Generated: src/middleware/token_budget.py (request token limiter)
Phase 4: VALIDATE
Deployment readiness: NOT_READY (2 errors, 2 warnings)
Critical: prompt injection risk and unbounded token usage must be fixed
Phase 1: DETECT
Frameworks: scikit-learn 1.4, XGBoost 2.0
Tracking: Weights and Biases (23 runs, 3 sweeps)
Models: 1 XGBoost classifier (model.pkl in models/)
Serving: Flask app with /predict endpoint
Phase 2: ANALYZE
[ML-ERR-001] models/model.pkl committed to git (12MB)
Should be in W&B Artifacts or external storage
[ML-ERR-002] app.py:15
pickle.load(open("models/model.pkl")) on every request
[ML-WARN-001] training/train.py
Only accuracy logged -- imbalanced dataset needs precision/recall
[ML-INFO-001] W&B sweeps well-configured, good hyperparameter search
Phase 3: DESIGN
Recommended: Store model in W&B Artifacts, download at startup
Recommended: Load model once in Flask app factory, not per-request
Recommended: Add classification_report metrics to training
Generated: .gitignore addition for *.pkl
Generated: app.py refactor with model singleton
Phase 4: VALIDATE
Deployment readiness: NOT_READY (2 errors)
Critical: model in git and per-request loading must be fixed
After fixes: projected NEEDS_ATTENTION (missing precision/recall metrics)
| Rationalization | Reality |
|---|---|
| "We re-trained with more data but the architecture is the same — the previous evaluation still applies." | Evaluation results are bound to a specific model artifact, not to the architecture. A re-trained model with different weights can have dramatically different failure modes even if accuracy appears similar. Every model version that goes to production must be evaluated against the golden set, not inherited from its predecessor. |
| "The model file is only 8MB — committing it to git is more convenient than setting up an artifact store." | Model files in git corrupt repository history, explode clone times for all contributors, and cannot be versioned alongside experiment metadata. Convenience now creates permanent technical debt. The artifact store setup is a one-time cost; git pollution is permanent. |
| "Loading the model inside the request handler is simpler — the model is small enough that latency won't be noticeable." | Per-request model loading adds I/O and deserialization on every inference call, holds no persistent state across requests, and collapses under any meaningful concurrency. "Small enough" is a guess without measurement. Models must be loaded at startup and held in memory. |
| "We can add experiment tracking after we get the model working — right now we just need to iterate quickly." | Experiment tracking is hardest to add retroactively because you cannot reconstruct the conditions of runs you did not log. The runs being executed without tracking right now are the ones producing the model that may go to production. Log them now or accept that the model is not reproducible. |
| "The prompt template is short enough to read in context — version controlling it adds unnecessary process." | Prompts embedded in application code change silently when developers edit them, have no history of what changed and why, and cannot be evaluated independently. A prompt is a model artifact. It requires the same versioning, evaluation, and promotion discipline as model weights. |
.pkl, .pt, .h5, .onnx) belong in artifact stores, not in version control. If detected, flag as error with migration path.