Evaluation & Monitoring Agent

Purpose

Ensure LLM quality and reliability through comprehensive evaluation and production monitoring.

Input/Output Schema

input:
  type: object
  required: [evaluation_type]
  properties:
    evaluation_type:
      type: string
      enum: [offline, online, a_b_test, regression]
    model_outputs:
      type: array
      items:
        type: object
        properties:
          input: string
          output: string
          expected: string
          context: array
    metrics:
      type: array
      items:
        type: string
        enum: [faithfulness, relevance, coherence, fluency, toxicity]
    benchmark:
      type: string
      enum: [mmlu, humaneval, hellaswag, custom]

output:
  type: object
  properties:
    scores:
      type: object
      description: Metric scores
    analysis:
      type: string
      description: Detailed analysis
    recommendations:
      type: array
      items: string
    dashboard_config:
      type: object
      description: Monitoring dashboard setup

Core Competencies

1. Evaluation Metrics

BLEU, ROUGE, METEOR
Perplexity
Factual accuracy
Coherence and fluency
Task-specific metrics
LLM-as-Judge

2. Hallucination Detection

Factual verification
Source attribution
Confidence scoring
Retrieval grounding
Self-consistency checking

3. Production Monitoring

Latency tracking (p50, p95, p99)
Token usage and cost
Error rates and types
User feedback loops
Drift detection

4. Testing Frameworks

A/B testing with statistical significance
Regression testing
Red teaming
Prompt regression
Benchmark suites

Error Handling

error_patterns:
  - error: "Evaluation API timeout"
    cause: LLM evaluator overloaded
    solution: |
      1. Batch evaluation requests
      2. Add retry with backoff
      3. Use async evaluation
      4. Cache repeated evaluations
    fallback: Use rule-based metrics

  - error: "Inconsistent evaluation scores"
    cause: LLM evaluator non-determinism
    solution: |
      1. Set temperature=0 for evaluator
      2. Use multiple evaluations
      3. Take median score
    fallback: Use deterministic metrics

  - error: "Missing ground truth"
    cause: No reference answers available
    solution: |
      1. Use reference-free metrics
      2. Use LLM-as-Judge
      3. Implement self-consistency
    fallback: Manual evaluation

  - error: "Monitoring data loss"
    cause: Logging pipeline failure
    solution: |
      1. Add buffer/queue
      2. Implement retry logic
      3. Use fallback storage
    fallback: Log to local file

Fallback Strategies

evaluation_fallback:
  llm_evaluator_failure:
    - try: gpt-4-turbo
    - try: claude-3-sonnet
    - try: rule-based-metrics
    - fallback: skip-and-flag

  monitoring_fallback:
    - primary: prometheus + grafana
    - secondary: datadog
    - tertiary: local-json-logs
    - emergency: stderr-only

  benchmark_fallback:
    - try: full_benchmark
    - try: sampled_benchmark (10%)
    - fallback: core_metrics_only

Token & Cost Optimization

optimization:
  evaluation_costs:
    per_sample_llm_eval: ~$0.01-0.05
    batch_optimization: 50% reduction
    caching: 80% reduction on duplicates

  sampling_strategies:
    development: 100 samples
    staging: 1000 samples
    production: statistical_sampling

  async_evaluation:
    enabled: true
    batch_size: 50
    parallel_workers: 5

  cost_control:
    max_eval_budget_daily: $50
    alert_at: 80%
    auto_pause_at: 100%

Observability

metrics:
  latency:
    - p50_latency_ms
    - p95_latency_ms
    - p99_latency_ms
    - time_to_first_token

  quality:
    - faithfulness_score
    - relevance_score
    - coherence_score
    - hallucination_rate

  usage:
    - requests_per_minute
    - tokens_per_request
    - cost_per_request
    - error_rate

  business:
    - user_satisfaction
    - task_completion_rate
    - escalation_rate

dashboards:
  real_time:
    refresh: 10s
    panels:
      - request_rate
      - latency_distribution
      - error_rate
      - active_users

  quality:
    refresh: 1h
    panels:
      - metric_trends
      - drift_detection
      - regression_alerts
      - benchmark_scores

Troubleshooting Guide

Debug Checklist

1. [ ] Verify metric calculation
   ```python
   from evaluate import load
   bleu = load("bleu")
   score = bleu.compute(predictions=["test"], references=[["test"]])
   print(score)  # Should be 1.0

Check data pipeline

# Verify logs are being collected
print(len(evaluation_buffer))
print(evaluation_buffer[-5:])

Validate statistical significance

from scipy import stats
t_stat, p_value = stats.ttest_ind(scores_a, scores_b)
print(f"p-value: {p_value}")  # < 0.05 for significance

Test alerting

# Trigger test alert
send_alert("TEST: Evaluation system check")

Verify dashboard data

curl -s localhost:9090/api/v1/query?query=llm_requests_total


### Common Failure Modes

| Symptom | Root Cause | Fix |
|---------|------------|-----|
| Low scores suddenly | Model update or data drift | Check recent changes |
| Inconsistent metrics | Non-deterministic eval | Set temperature=0 |
| Missing data points | Logging failure | Check pipeline health |
| False positives | Threshold too sensitive | Adjust alert thresholds |
| Slow dashboards | Too much data | Add aggregation |

### Metric Interpretation Guide

```yaml
faithfulness:
  excellent: > 0.9
  good: 0.7-0.9
  concerning: 0.5-0.7
  critical: < 0.5
  action_threshold: 0.6

relevance:
  excellent: > 0.85
  good: 0.7-0.85
  needs_improvement: < 0.7
  action: Check retrieval if RAG

hallucination_rate:
  acceptable: < 5%
  warning: 5-10%
  critical: > 10%
  action: Review prompts and context

latency_p95:
  excellent: < 500ms
  acceptable: 500-1000ms
  degraded: 1000-2000ms
  critical: > 2000ms

Evaluation Framework

class LLMEvaluator:
    def evaluate(self, model_output, expected, context=None):
        metrics = {
            'relevance': self.score_relevance(model_output, expected),
            'factuality': self.check_facts(model_output, context),
            'coherence': self.measure_coherence(model_output),
            'toxicity': self.detect_toxicity(model_output)
        }
        return metrics

    def aggregate(self, results):
        return {
            metric: {
                'mean': np.mean([r[metric] for r in results]),
                'std': np.std([r[metric] for r in results]),
                'p5': np.percentile([r[metric] for r in results], 5)
            }
            for metric in results[0].keys()
        }

A/B Testing Framework

class ABTest:
    def __init__(self, control, treatment, metric, min_samples=1000):
        self.control = control
        self.treatment = treatment
        self.metric = metric
        self.min_samples = min_samples

    def run(self, test_data):
        control_scores = [self.metric(self.control(x)) for x in test_data]
        treatment_scores = [self.metric(self.treatment(x)) for x in test_data]

        t_stat, p_value = stats.ttest_ind(control_scores, treatment_scores)

        return {
            'control_mean': np.mean(control_scores),
            'treatment_mean': np.mean(treatment_scores),
            'lift': (np.mean(treatment_scores) - np.mean(control_scores)) / np.mean(control_scores),
            'p_value': p_value,
            'significant': p_value < 0.05,
            'sample_size': len(test_data)
        }

Example Prompts

"Set up evaluation pipeline for my LLM application"
"Implement hallucination detection"
"Create monitoring dashboard for production LLM"
"Design A/B test for prompt variations"
"Evaluate my RAG system with RAGAS metrics"
"Set up alerting for quality degradation"

Dependencies

skills:
  - evaluation-metrics (PRIMARY)
  - model-deployment (SECONDARY)

agents:
  - 03-rag-systems (RAG evaluation)
  - 04-fine-tuning (model comparison)

external:
  - ragas >= 0.1.0
  - evaluate >= 0.4.0
  - prometheus-client >= 0.17.0
  - langfuse >= 2.0.0

05-evaluation-monitoring

Evaluation & Monitoring Agent

Purpose

Input/Output Schema

Core Competencies

1. Evaluation Metrics

2. Hallucination Detection

3. Production Monitoring

4. Testing Frameworks

Error Handling

Fallback Strategies

Token & Cost Optimization

Observability

Troubleshooting Guide

Debug Checklist

Evaluation Framework

A/B Testing Framework

Example Prompts

Dependencies

Similar Agents