Master ML experiment tracking - MLflow, W&B, Neptune, versioning, reproducibility
Automate ML experiment tracking with MLflow and Weights & Biases. Claude will log parameters, metrics, and artifacts during training runs, and manage model versions through staging to production.
/plugin marketplace add pluginagentmarketplace/custom-plugin-mlops/plugin install custom-plugin-mlops@pluginagentmarketplace-mlopsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyLearn: Master ML experiment tracking for reproducibility and collaboration.
| Attribute | Value |
|---|---|
| Bonded Agent | 02-experiment-tracking |
| Difficulty | Intermediate |
| Duration | 30 hours |
| Prerequisites | mlops-basics |
Platform Comparison:
| Feature | MLflow | W&B | Neptune |
|---|---|---|---|
| Self-hosted | ✅ | ❌ | ❌ |
| Free tier | ✅ | ✅ | ✅ |
| Real-time | ❌ | ✅ | ✅ |
| Git integration | ⚠️ | ✅ | ✅ |
Setup Exercises:
What to Log:
# Complete logging example
with mlflow.start_run():
# 1. Parameters (hyperparameters, configs)
mlflow.log_params({
"learning_rate": 0.001,
"batch_size": 32,
"model_type": "transformer"
})
# 2. Metrics (per-step and final)
for epoch in range(10):
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss
}, step=epoch)
# 3. Artifacts (models, plots, configs)
mlflow.log_artifact("confusion_matrix.png")
mlflow.pytorch.log_model(model, "model")
# 4. Tags (for filtering)
mlflow.set_tags({
"experiment_type": "baseline",
"dataset_version": "v2.1"
})
Registry Workflow:
┌─────────────────────────────────────────────────────────────┐
│ MODEL REGISTRY FLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ Train → Log Model → Register → Staging → Production → Archive
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Version 1 Validate Deploy │
│ Version 2 A/B Test Monitor │
│ Version N Approve Rollback │
│ │
└─────────────────────────────────────────────────────────────┘
Exercises:
Naming Conventions:
experiments/
├── {project_name}/
│ ├── {experiment_type}_{date}/
│ │ ├── run_{config_hash}/
Reproducibility Checklist:
# templates/experiment_tracker.py
import mlflow
import hashlib
import subprocess
from datetime import datetime
class ProductionExperimentTracker:
"""Production-ready experiment tracking wrapper."""
def __init__(self, experiment_name: str, tracking_uri: str):
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)
self.run = None
def start_run(self, run_name: str = None):
"""Start a new tracked run."""
self.run = mlflow.start_run(run_name=run_name)
# Auto-log environment info
self._log_environment()
return self
def _log_environment(self):
"""Capture reproducibility information."""
# Git info
try:
git_hash = subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode().strip()
mlflow.set_tag("git_commit", git_hash)
except:
pass
# Timestamp
mlflow.set_tag("run_timestamp", datetime.now().isoformat())
def log_config(self, config: dict):
"""Log configuration as parameters."""
# Flatten nested config
flat_config = self._flatten_dict(config)
mlflow.log_params(flat_config)
def log_metrics(self, metrics: dict, step: int = None):
"""Log metrics with optional step."""
mlflow.log_metrics(metrics, step=step)
def log_model(self, model, artifact_path: str = "model"):
"""Log model with signature."""
mlflow.pytorch.log_model(model, artifact_path)
def end_run(self):
"""End the current run."""
if self.run:
mlflow.end_run()
def _flatten_dict(self, d: dict, parent_key: str = '') -> dict:
"""Flatten nested dictionary."""
items = []
for k, v in d.items():
new_key = f"{parent_key}.{k}" if parent_key else k
if isinstance(v, dict):
items.extend(self._flatten_dict(v, new_key).items())
else:
items.append((new_key, v))
return dict(items)
| Issue | Cause | Solution |
|---|---|---|
| Runs not syncing | Network issue | Check connectivity, use offline mode |
| Large artifacts fail | Size limit | Use cloud storage for large files |
| Duplicate run names | No uniqueness | Add timestamp or hash to names |
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2024-12 | Production-grade with templates |
| 1.0.0 | 2024-11 | Initial release |
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.