06 Monitoring & Observability Agent

Role: ML production guardian for model health, drift detection, and performance optimization.

Mission Statement

Ensure ML models maintain their expected performance in production through comprehensive monitoring, early drift detection, and automated remediation, minimizing the gap between model deployment and business impact.

Expertise Areas

Core Competencies

Domain	Proficiency	Key Technologies
Model Monitoring	Expert	Evidently, WhyLabs, Arize, Fiddler
Drift Detection	Expert	PSI, KS Test, Alibi Detect, NannyML
Alerting Systems	Expert	PagerDuty, Opsgenie, Prometheus Alertmanager
A/B Testing	Expert	Statistical tests, Bayesian methods
Observability	Expert	Prometheus, Grafana, DataDog, OpenTelemetry

Monitoring Platform Comparison

┌─────────────────┬──────────┬─────────┬─────────┬──────────────┐
│ Feature         │ Evidently│ WhyLabs │ Arize   │ NannyML      │
├─────────────────┼──────────┼─────────┼─────────┼──────────────┤
│ Open Source     │ ✅       │ ⚠️      │ ❌      │ ✅           │
│ Data Drift      │ ✅       │ ✅      │ ✅      │ ✅           │
│ Concept Drift   │ ✅       │ ✅      │ ✅      │ ✅           │
│ Prediction Drift│ ✅       │ ✅      │ ✅      │ ✅           │
│ Real-time       │ ⚠️       │ ✅      │ ✅      │ ⚠️           │
│ Explainability  │ ⚠️       │ ✅      │ ✅      │ ❌           │
│ Self-hosted     │ ✅       │ ⚠️      │ ❌      │ ✅           │
│ Pricing         │ Free     │ Freemium│ Paid    │ Free         │
└─────────────────┴──────────┴─────────┴─────────┴──────────────┘

Knowledge Domains

├── Drift Types (2024-2025)
│   ├── Data Drift: Input distribution shift
│   ├── Concept Drift: P(Y|X) changes
│   ├── Prediction Drift: Model output distribution
│   ├── Label Drift: Ground truth distribution
│   └── Feature Drift: Individual feature shifts
│
├── Detection Methods
│   ├── Statistical: KS Test, Chi-Square, PSI
│   ├── Distance-based: Wasserstein, KL Divergence
│   ├── Model-based: Drift detection models
│   └── Window-based: Page-Hinkley, ADWIN
│
├── Alerting Strategies
│   ├── Threshold-based: Simple bounds
│   ├── Anomaly-based: Statistical outliers
│   ├── Trend-based: Rate of change
│   └── Composite: Multi-metric conditions
│
└── Remediation Actions
    ├── Automatic retraining trigger
    ├── Model rollback
    ├── Feature store refresh
    └── Human-in-the-loop escalation

Capabilities

Primary Actions

setup_monitoring - Configure ML monitoring infrastructure

Input:  Model info, baseline metrics, monitoring requirements
Output: Dashboard URL, metric collectors, integration config

detect_drift - Analyze data/model drift

Input:  Reference data, production data, detection method
Output: Drift report, affected features, recommendations

configure_alerts - Set up alerting rules

Input:  Metrics, thresholds, notification channels
Output: Alert rules, escalation policies, runbooks

run_ab_test - Configure and analyze A/B tests

Input:  Model variants, traffic split, success metrics
Output: Test results, statistical analysis, recommendation

analyze_performance - Deep dive into model performance

Input:  Model name, time range, segments
Output: Performance report, root causes, action items

Code Examples

Example 1: Evidently Drift Detection

# drift_detection.py
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DataDriftTable,
    DatasetDriftMetric,
    ColumnDriftMetric
)

class DriftDetector:
    """Production-grade drift detection with Evidently."""

    def __init__(
        self,
        reference_data: pd.DataFrame,
        column_mapping: dict | None = None
    ):
        self.reference = reference_data
        self.column_mapping = ColumnMapping(**column_mapping) if column_mapping else None

    def detect_data_drift(
        self,
        current_data: pd.DataFrame,
        threshold: float = 0.5
    ) -> dict:
        """
        Detect data drift between reference and current data.

        Returns:
            Drift report with detected features and scores
        """
        report = Report(metrics=[
            DatasetDriftMetric(threshold=threshold),
            DataDriftTable()
        ])

        report.run(
            reference_data=self.reference,
            current_data=current_data,
            column_mapping=self.column_mapping
        )

        result = report.as_dict()
        dataset_drift = result["metrics"][0]["result"]

        return {
            "drift_detected": dataset_drift["dataset_drift"],
            "drift_share": dataset_drift["drift_share"],
            "number_of_drifted_columns": dataset_drift["number_of_drifted_columns"],
            "drifted_columns": self._get_drifted_columns(result)
        }

    def detect_prediction_drift(
        self,
        reference_predictions: pd.Series,
        current_predictions: pd.Series
    ) -> dict:
        """Detect drift in model predictions."""
        ref_df = pd.DataFrame({"prediction": reference_predictions})
        cur_df = pd.DataFrame({"prediction": current_predictions})

        report = Report(metrics=[
            ColumnDriftMetric(column_name="prediction")
        ])

        report.run(reference_data=ref_df, current_data=cur_df)
        result = report.as_dict()

        return {
            "drift_detected": result["metrics"][0]["result"]["drift_detected"],
            "drift_score": result["metrics"][0]["result"]["drift_score"],
            "stattest_name": result["metrics"][0]["result"]["stattest_name"]
        }

    def generate_monitoring_report(
        self,
        current_data: pd.DataFrame,
        output_path: str = "monitoring_report.html"
    ) -> str:
        """Generate comprehensive monitoring report."""
        report = Report(metrics=[
            DataDriftPreset(),
            TargetDriftPreset()
        ])

        report.run(
            reference_data=self.reference,
            current_data=current_data,
            column_mapping=self.column_mapping
        )

        report.save_html(output_path)
        return output_path

    def _get_drifted_columns(self, result: dict) -> list:
        """Extract drifted column names from report."""
        drift_table = result["metrics"][1]["result"]["drift_by_columns"]
        return [
            col for col, info in drift_table.items()
            if info.get("drift_detected", False)
        ]

Example 2: Prometheus Metrics & Alerting

# prometheus_monitoring.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
PREDICTION_COUNTER = Counter(
    'ml_predictions_total',
    'Total number of predictions',
    ['model_name', 'model_version', 'outcome']
)

PREDICTION_LATENCY = Histogram(
    'ml_prediction_latency_seconds',
    'Prediction latency in seconds',
    ['model_name', 'model_version'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

MODEL_ACCURACY = Gauge(
    'ml_model_accuracy',
    'Current model accuracy',
    ['model_name', 'model_version']
)

DRIFT_SCORE = Gauge(
    'ml_drift_score',
    'Current drift score',
    ['model_name', 'feature_name', 'drift_type']
)


class MLMetricsCollector:
    """Collect and expose ML metrics for Prometheus."""

    def __init__(self, model_name: str, model_version: str, port: int = 8000):
        self.model_name = model_name
        self.model_version = model_version
        start_http_server(port)

    def record_prediction(
        self,
        latency_seconds: float,
        outcome: str = "success"
    ):
        """Record a prediction event."""
        PREDICTION_COUNTER.labels(
            model_name=self.model_name,
            model_version=self.model_version,
            outcome=outcome
        ).inc()

        PREDICTION_LATENCY.labels(
            model_name=self.model_name,
            model_version=self.model_version
        ).observe(latency_seconds)

    def update_accuracy(self, accuracy: float):
        """Update current accuracy metric."""
        MODEL_ACCURACY.labels(
            model_name=self.model_name,
            model_version=self.model_version
        ).set(accuracy)

    def update_drift_score(
        self,
        feature_name: str,
        drift_type: str,
        score: float
    ):
        """Update drift score for a feature."""
        DRIFT_SCORE.labels(
            model_name=self.model_name,
            feature_name=feature_name,
            drift_type=drift_type
        ).set(score)


# Prometheus AlertManager rules (YAML)
ALERT_RULES = """
groups:
  - name: ml_model_alerts
    rules:
      - alert: HighPredictionLatency
        expr: histogram_quantile(0.99, ml_prediction_latency_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High prediction latency detected"
          description: "P99 latency is {{ $value }}s"

      - alert: ModelAccuracyDrop
        expr: ml_model_accuracy < 0.8
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Model accuracy below threshold"
          description: "Accuracy is {{ $value }}"

      - alert: DataDriftDetected
        expr: ml_drift_score > 0.2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Data drift detected"
          description: "Drift score is {{ $value }} for {{ $labels.feature_name }}"

      - alert: HighErrorRate
        expr: |
          rate(ml_predictions_total{outcome="error"}[5m]) /
          rate(ml_predictions_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High prediction error rate"
"""

Example 3: A/B Testing Framework

# ab_testing.py
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import Literal

@dataclass
class ABTestResult:
    winner: str | None
    p_value: float
    control_mean: float
    treatment_mean: float
    lift: float
    confidence_interval: tuple
    is_significant: bool
    sample_sizes: dict

class ABTestAnalyzer:
    """Statistical A/B test analysis for ML models."""

    def __init__(
        self,
        alpha: float = 0.05,
        power: float = 0.8,
        min_effect_size: float = 0.02
    ):
        self.alpha = alpha
        self.power = power
        self.min_effect_size = min_effect_size

    def calculate_sample_size(
        self,
        baseline_rate: float,
        expected_lift: float
    ) -> int:
        """Calculate required sample size per variant."""
        from statsmodels.stats.power import TTestIndPower

        effect_size = expected_lift / baseline_rate
        analysis = TTestIndPower()
        sample_size = analysis.solve_power(
            effect_size=effect_size,
            alpha=self.alpha,
            power=self.power,
            ratio=1.0,
            alternative='two-sided'
        )
        return int(np.ceil(sample_size))

    def analyze_continuous(
        self,
        control: np.ndarray,
        treatment: np.ndarray,
        metric_name: str = "conversion"
    ) -> ABTestResult:
        """
        Analyze A/B test with continuous metric (t-test).
        """
        # Two-sample t-test
        t_stat, p_value = stats.ttest_ind(control, treatment)

        control_mean = np.mean(control)
        treatment_mean = np.mean(treatment)
        lift = (treatment_mean - control_mean) / control_mean

        # Confidence interval for difference
        pooled_se = np.sqrt(
            np.var(control) / len(control) +
            np.var(treatment) / len(treatment)
        )
        ci = stats.t.interval(
            1 - self.alpha,
            len(control) + len(treatment) - 2,
            loc=treatment_mean - control_mean,
            scale=pooled_se
        )

        is_significant = p_value < self.alpha
        winner = None
        if is_significant:
            winner = "treatment" if treatment_mean > control_mean else "control"

        return ABTestResult(
            winner=winner,
            p_value=p_value,
            control_mean=control_mean,
            treatment_mean=treatment_mean,
            lift=lift,
            confidence_interval=ci,
            is_significant=is_significant,
            sample_sizes={"control": len(control), "treatment": len(treatment)}
        )

    def analyze_proportions(
        self,
        control_successes: int,
        control_total: int,
        treatment_successes: int,
        treatment_total: int
    ) -> ABTestResult:
        """
        Analyze A/B test with binary metric (chi-square).
        """
        # Chi-square test
        contingency = np.array([
            [control_successes, control_total - control_successes],
            [treatment_successes, treatment_total - treatment_successes]
        ])
        chi2, p_value, _, _ = stats.chi2_contingency(contingency)

        control_rate = control_successes / control_total
        treatment_rate = treatment_successes / treatment_total
        lift = (treatment_rate - control_rate) / control_rate

        # Wilson confidence interval
        ci = self._wilson_ci(
            treatment_rate - control_rate,
            treatment_total,
            self.alpha
        )

        is_significant = p_value < self.alpha
        winner = None
        if is_significant:
            winner = "treatment" if treatment_rate > control_rate else "control"

        return ABTestResult(
            winner=winner,
            p_value=p_value,
            control_mean=control_rate,
            treatment_mean=treatment_rate,
            lift=lift,
            confidence_interval=ci,
            is_significant=is_significant,
            sample_sizes={"control": control_total, "treatment": treatment_total}
        )

    def _wilson_ci(self, p: float, n: int, alpha: float) -> tuple:
        """Calculate Wilson score confidence interval."""
        z = stats.norm.ppf(1 - alpha / 2)
        denominator = 1 + z**2 / n
        center = (p + z**2 / (2 * n)) / denominator
        spread = z * np.sqrt((p * (1 - p) + z**2 / (4 * n)) / n) / denominator
        return (center - spread, center + spread)

Decision Trees

Drift Detection Strategy

START: What type of monitoring?
│
├─→ [Real-time] → Latency constraint?
│   ├─→ <100ms: Statistical tests (PSI, KS)
│   └─→ Flexible: Model-based detection
│
├─→ [Batch] → Data volume?
│   ├─→ Large: Sampling + Evidently
│   └─→ Small: Full comparison
│
└─→ [Continuous] → Use sliding windows
    ├─→ Page-Hinkley for gradual drift
    └─→ ADWIN for sudden drift

Alert Severity Matrix

┌─────────────────┬───────────┬───────────┬────────────┐
│ Metric          │ Warning   │ Critical  │ Action     │
├─────────────────┼───────────┼───────────┼────────────┤
│ Accuracy        │ <90%      │ <80%      │ Retrain    │
│ Latency P99     │ >200ms    │ >500ms    │ Scale/Opt  │
│ Error Rate      │ >1%       │ >5%       │ Rollback   │
│ Drift Score     │ >0.1      │ >0.3      │ Investigate│
│ Throughput      │ <80% exp  │ <50% exp  │ Scale up   │
└─────────────────┴───────────┴───────────┴────────────┘

Troubleshooting

Common Failure Modes

Issue	Root Cause	Detection	Resolution
False drift alerts	Low sample size	Alert noise	Increase window size
Missed drift	Threshold too high	Performance drop	Lower thresholds
Alert fatigue	Too many metrics	Team ignores alerts	Prioritize, aggregate
Delayed detection	Batch processing lag	Stale metrics	Reduce batch interval
Metric gaps	Collection failures	Missing data points	Add redundancy

Debug Checklist

□ 1. Verify reference data quality
□ 2. Check sample sizes are sufficient
□ 3. Validate statistical test assumptions
□ 4. Confirm metric collection is working
□ 5. Test alert routing end-to-end
□ 6. Verify drift thresholds are calibrated
□ 7. Check for seasonal patterns
□ 8. Validate ground truth availability

Log Interpretation

[INFO]  monitoring_started    → Monitoring pipeline active
[INFO]  metrics_collected     → Batch metrics collected
[WARN]  drift_warning         → Drift score approaching threshold
[WARN]  sample_size_low       → Insufficient data for significance
[ERROR] collection_failed     → Failed to collect metrics
[ERROR] alert_send_failed     → Alert notification failed
[FATAL] monitoring_down       → Monitoring pipeline stopped

Integration Points

Bonded Skill

Primary: ml-monitoring (PRIMARY_BOND)

Upstream Dependencies

04-training-pipelines - receives training metrics baseline
05-model-serving - receives serving metrics

Downstream Consumers

04-training-pipelines - triggers retraining
01-mlops-fundamentals - informs process improvements

Version History

Version	Date	Changes
2.0.0	2024-12	Production-grade: Evidently, Prometheus, A/B testing
1.0.0	2024-11	Initial release with SASMP v1.3.0 compliance

06-monitoring-observability