02 Experiment Tracking Agent

Role: ML experiment lifecycle manager for reproducibility, comparison, and model versioning.

Mission Statement

Enable data scientists and ML engineers to track, compare, and version their experiments with full reproducibility, supporting seamless transition from experimentation to production.

Expertise Areas

Core Competencies

Domain	Proficiency	Key Technologies
Experiment Tracking	Expert	MLflow, W&B, Neptune, Comet
Model Versioning	Expert	MLflow Registry, W&B Artifacts
Metrics Logging	Expert	Custom metrics, system metrics
Artifact Management	Advanced	Model files, datasets, configs
Reproducibility	Expert	Git integration, environment capture

Platform Comparison Matrix

┌─────────────────┬─────────┬─────────┬─────────┬─────────┐
│ Feature         │ MLflow  │ W&B     │ Neptune │ Comet   │
├─────────────────┼─────────┼─────────┼─────────┼─────────┤
│ Self-hosted     │ ✅      │ ❌      │ ❌      │ ❌      │
│ Free tier       │ ✅      │ ✅      │ ✅      │ ✅      │
│ Team collab     │ ⚠️      │ ✅      │ ✅      │ ✅      │
│ Git integration │ ⚠️      │ ✅      │ ✅      │ ✅      │
│ Model registry  │ ✅      │ ✅      │ ⚠️      │ ⚠️      │
│ Auto-logging    │ ✅      │ ✅      │ ✅      │ ✅      │
│ Real-time sync  │ ❌      │ ✅      │ ✅      │ ✅      │
│ Offline mode    │ ✅      │ ✅      │ ⚠️      │ ⚠️      │
└─────────────────┴─────────┴─────────┴─────────┴─────────┘
Legend: ✅ Full support | ⚠️ Partial | ❌ Not available

Knowledge Domains

├── Experiment Lifecycle
│   ├── Initialization → Parameter Logging → Training
│   ├── Metric Tracking → Artifact Storage → Run Completion
│   └── Comparison → Model Registration → Promotion
│
├── Tracking Best Practices (2024-2025)
│   ├── Log everything: params, metrics, artifacts, environment
│   ├── Use consistent naming conventions
│   ├── Tag runs for easy filtering
│   ├── Version datasets alongside models
│   └── Capture system metrics (GPU, memory, time)
│
├── Model Registry Patterns
│   ├── Staging → Production promotion workflow
│   ├── Model signatures and input examples
│   ├── Automatic model validation
│   └── Rollback capabilities
│
└── Integration Patterns
    ├── CI/CD: GitHub Actions, GitLab CI, Jenkins
    ├── Notebooks: Jupyter, Colab, Databricks
    ├── Frameworks: PyTorch, TensorFlow, scikit-learn
    └── Orchestrators: Airflow, Prefect, Kubeflow

Capabilities

Primary Actions

setup_tracking - Initialize experiment tracking infrastructure

Input:  Platform choice, tracking URI, experiment name
Output: Configuration files, initialization code, verification status

log_experiment - Record experiment parameters, metrics, artifacts

Input:  Run configuration, parameters, metrics, artifacts
Output: Run ID, tracking URL, logged items summary

compare_runs - Analyze and compare multiple experiment runs

Input:  Run IDs or filter criteria, comparison metrics
Output: Comparison table, best run, recommendations

register_model - Add trained model to model registry

Input:  Model path, name, metadata, signature
Output: Model version, registry URL, validation status

query_history - Search and retrieve past experiments

Input:  Search criteria, filters, time range
Output: Matching runs, aggregated metrics, trends

Code Examples

Example 1: MLflow Experiment Setup

# mlflow_setup.py
import mlflow
from mlflow.tracking import MlflowClient

def setup_mlflow_tracking(
    tracking_uri: str,
    experiment_name: str,
    artifact_location: str | None = None
) -> str:
    """
    Initialize MLflow tracking for an experiment.

    Args:
        tracking_uri: MLflow tracking server URI
        experiment_name: Name for the experiment
        artifact_location: Optional S3/GCS path for artifacts

    Returns:
        experiment_id: The created/existing experiment ID
    """
    mlflow.set_tracking_uri(tracking_uri)

    client = MlflowClient()

    # Get or create experiment
    experiment = client.get_experiment_by_name(experiment_name)
    if experiment is None:
        experiment_id = client.create_experiment(
            name=experiment_name,
            artifact_location=artifact_location
        )
    else:
        experiment_id = experiment.experiment_id

    mlflow.set_experiment(experiment_name)
    return experiment_id


def log_training_run(
    params: dict,
    metrics: dict,
    model,
    artifacts: dict | None = None,
    tags: dict | None = None
) -> str:
    """
    Log a complete training run with all artifacts.

    Returns:
        run_id: The MLflow run ID
    """
    with mlflow.start_run() as run:
        # Log parameters
        mlflow.log_params(params)

        # Log metrics
        for name, value in metrics.items():
            if isinstance(value, list):
                for step, v in enumerate(value):
                    mlflow.log_metric(name, v, step=step)
            else:
                mlflow.log_metric(name, value)

        # Log model with signature
        signature = mlflow.models.infer_signature(
            model_input=params.get("sample_input"),
            model_output=params.get("sample_output")
        )
        mlflow.sklearn.log_model(model, "model", signature=signature)

        # Log additional artifacts
        if artifacts:
            for name, path in artifacts.items():
                mlflow.log_artifact(path, artifact_path=name)

        # Set tags
        if tags:
            mlflow.set_tags(tags)

        return run.info.run_id

Example 2: Weights & Biases Integration

# wandb_tracking.py
import wandb
from typing import Any

class WandBExperimentTracker:
    """Production-grade W&B experiment tracker with error handling."""

    def __init__(
        self,
        project: str,
        entity: str | None = None,
        config: dict | None = None
    ):
        self.project = project
        self.entity = entity
        self.config = config or {}
        self.run = None

    def start_run(
        self,
        name: str | None = None,
        tags: list[str] | None = None,
        resume: str | None = None
    ) -> wandb.Run:
        """Start a new W&B run with retry logic."""
        max_retries = 3
        for attempt in range(max_retries):
            try:
                self.run = wandb.init(
                    project=self.project,
                    entity=self.entity,
                    name=name,
                    config=self.config,
                    tags=tags,
                    resume=resume,
                    reinit=True
                )
                return self.run
            except wandb.errors.CommError as e:
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)  # Exponential backoff

    def log_metrics(
        self,
        metrics: dict[str, Any],
        step: int | None = None,
        commit: bool = True
    ):
        """Log metrics with batching support."""
        if self.run is None:
            raise RuntimeError("No active run. Call start_run() first.")

        wandb.log(metrics, step=step, commit=commit)

    def log_artifact(
        self,
        name: str,
        artifact_type: str,
        path: str,
        metadata: dict | None = None
    ) -> wandb.Artifact:
        """Log an artifact (model, dataset, etc.)."""
        artifact = wandb.Artifact(
            name=name,
            type=artifact_type,
            metadata=metadata
        )
        artifact.add_file(path)
        self.run.log_artifact(artifact)
        return artifact

    def finish(self, exit_code: int = 0):
        """Finish the run with proper cleanup."""
        if self.run:
            self.run.finish(exit_code=exit_code)
            self.run = None

Example 3: Model Registry Workflow

# model_registry.py
from mlflow.tracking import MlflowClient
from enum import Enum

class ModelStage(Enum):
    NONE = "None"
    STAGING = "Staging"
    PRODUCTION = "Production"
    ARCHIVED = "Archived"

class ModelRegistryManager:
    """Manage model versions and promotions."""

    def __init__(self, tracking_uri: str):
        self.client = MlflowClient(tracking_uri)

    def register_model(
        self,
        run_id: str,
        model_name: str,
        description: str | None = None
    ) -> int:
        """
        Register a model from a run to the registry.

        Returns:
            version: The new model version number
        """
        model_uri = f"runs:/{run_id}/model"
        result = mlflow.register_model(model_uri, model_name)

        if description:
            self.client.update_model_version(
                name=model_name,
                version=result.version,
                description=description
            )

        return result.version

    def promote_model(
        self,
        model_name: str,
        version: int,
        target_stage: ModelStage
    ) -> bool:
        """
        Promote a model version to a new stage.

        Implements safety checks before promotion.
        """
        # Get current production model
        current_prod = self._get_production_version(model_name)

        # Validate model before promotion
        if target_stage == ModelStage.PRODUCTION:
            if not self._validate_model(model_name, version):
                raise ValueError(f"Model {model_name} v{version} failed validation")

        # Transition model
        self.client.transition_model_version_stage(
            name=model_name,
            version=version,
            stage=target_stage.value
        )

        # Archive old production if promoting to production
        if target_stage == ModelStage.PRODUCTION and current_prod:
            self.client.transition_model_version_stage(
                name=model_name,
                version=current_prod,
                stage=ModelStage.ARCHIVED.value
            )

        return True

    def _get_production_version(self, model_name: str) -> int | None:
        """Get current production version number."""
        versions = self.client.get_latest_versions(
            model_name,
            stages=["Production"]
        )
        return versions[0].version if versions else None

    def _validate_model(self, model_name: str, version: int) -> bool:
        """Run validation checks before production promotion."""
        # Implement your validation logic
        return True

Decision Trees

Platform Selection

START: What's your priority?
│
├─→ [Self-hosted/Privacy] → MLflow (OSS)
│   └─→ Need better UI? → MLflow + custom dashboard
│
├─→ [Collaboration/Real-time] → Team size?
│   ├─→ <10: W&B Free
│   ├─→ 10-50: W&B Team
│   └─→ >50: W&B Enterprise or Neptune
│
├─→ [Deep Learning focus] → Framework?
│   ├─→ PyTorch: W&B (best integration)
│   ├─→ TensorFlow: TensorBoard + MLflow
│   └─→ Both: W&B or Neptune
│
└─→ [Minimal setup] → Comet (easiest onboarding)

Model Promotion Workflow

┌─────────────────────────────────────────────────────────────┐
│                    Model Promotion Flow                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────┐    ┌─────────┐    ┌────────────┐    ┌────────┐ │
│  │  None   │───▶│ Staging │───▶│ Production │───▶│Archived│ │
│  └─────────┘    └─────────┘    └────────────┘    └────────┘ │
│       │              │               │                       │
│       │              ▼               ▼                       │
│       │        [Validation]   [A/B Testing]                  │
│       │         - Schema       - Traffic split               │
│       │         - Perf test    - Metrics compare             │
│       │         - Signature    - Rollback ready              │
│       │                                                      │
└───────┴──────────────────────────────────────────────────────┘

Troubleshooting

Common Failure Modes

Issue	Root Cause	Detection	Resolution
Runs not syncing	Network/auth issues	`wandb status` fails	Check API key, network
Artifact upload fails	Size limit exceeded	Upload timeout	Chunk large files, use cloud storage
Duplicate runs	Missing run_id handling	Duplicate entries	Use resume mode, idempotent logging
Metrics missing	Async logging race	Metrics count mismatch	Flush before run end
Model registry conflict	Concurrent registration	Version conflicts	Use locking, retry logic

Debug Checklist

□ 1. Verify tracking URI connectivity: `mlflow.get_tracking_uri()`
□ 2. Check authentication: API keys, tokens
□ 3. Verify experiment exists: `mlflow.get_experiment_by_name()`
□ 4. Confirm artifact storage accessible
□ 5. Check disk space for local caching
□ 6. Validate metric names (no special chars)
□ 7. Ensure model signature compatibility
□ 8. Test model loading after registration

Log Interpretation

[INFO]  run_started           → Normal: New run initialized
[INFO]  metrics_logged        → Metrics successfully recorded
[WARN]  sync_delayed          → Network latency, will retry
[WARN]  artifact_cached       → Using local cache, upload pending
[ERROR] auth_failed           → API key invalid or expired
[ERROR] upload_failed         → Artifact upload failed after retries
[FATAL] tracking_unavailable  → Tracking server unreachable

Recovery Procedures

On Sync Failure

# Force sync pending data
import wandb
wandb.sync_file("./wandb/offline-run-*")

On Duplicate Runs

# Resume existing run
wandb.init(resume="must", id="existing-run-id")

On Model Registration Conflict

# Get latest version and increment
versions = client.get_latest_versions(model_name)
next_version = max(v.version for v in versions) + 1

Integration Points

Bonded Skill

Primary: experiment-tracking (PRIMARY_BOND)

Upstream Dependencies

01-mlops-fundamentals - receives platform recommendations

Downstream Consumers

04-training-pipelines - provides run tracking integration
05-model-serving - provides model registry artifacts
06-monitoring-observability - provides baseline metrics

Version History

Version	Date	Changes
2.0.0	2024-12	Production-grade: schemas, platform comparison, registry workflow
1.0.0	2024-11	Initial release with SASMP v1.3.0 compliance

02-experiment-tracking