Implement LLM evaluation frameworks, monitoring, and observability for production systems
Build comprehensive evaluation pipelines and production monitoring for LLM applications. Implement A/B testing, hallucination detection, and track quality metrics like faithfulness and relevance with real-time dashboards.
/plugin marketplace add pluginagentmarketplace/custom-plugin-ai-engineer/plugin install pluginagentmarketplace-ai-engineer-plugin@pluginagentmarketplace/custom-plugin-ai-engineersonnetEnsure LLM quality and reliability through comprehensive evaluation and production monitoring.
input:
type: object
required: [evaluation_type]
properties:
evaluation_type:
type: string
enum: [offline, online, a_b_test, regression]
model_outputs:
type: array
items:
type: object
properties:
input: string
output: string
expected: string
context: array
metrics:
type: array
items:
type: string
enum: [faithfulness, relevance, coherence, fluency, toxicity]
benchmark:
type: string
enum: [mmlu, humaneval, hellaswag, custom]
output:
type: object
properties:
scores:
type: object
description: Metric scores
analysis:
type: string
description: Detailed analysis
recommendations:
type: array
items: string
dashboard_config:
type: object
description: Monitoring dashboard setup
error_patterns:
- error: "Evaluation API timeout"
cause: LLM evaluator overloaded
solution: |
1. Batch evaluation requests
2. Add retry with backoff
3. Use async evaluation
4. Cache repeated evaluations
fallback: Use rule-based metrics
- error: "Inconsistent evaluation scores"
cause: LLM evaluator non-determinism
solution: |
1. Set temperature=0 for evaluator
2. Use multiple evaluations
3. Take median score
fallback: Use deterministic metrics
- error: "Missing ground truth"
cause: No reference answers available
solution: |
1. Use reference-free metrics
2. Use LLM-as-Judge
3. Implement self-consistency
fallback: Manual evaluation
- error: "Monitoring data loss"
cause: Logging pipeline failure
solution: |
1. Add buffer/queue
2. Implement retry logic
3. Use fallback storage
fallback: Log to local file
evaluation_fallback:
llm_evaluator_failure:
- try: gpt-4-turbo
- try: claude-3-sonnet
- try: rule-based-metrics
- fallback: skip-and-flag
monitoring_fallback:
- primary: prometheus + grafana
- secondary: datadog
- tertiary: local-json-logs
- emergency: stderr-only
benchmark_fallback:
- try: full_benchmark
- try: sampled_benchmark (10%)
- fallback: core_metrics_only
optimization:
evaluation_costs:
per_sample_llm_eval: ~$0.01-0.05
batch_optimization: 50% reduction
caching: 80% reduction on duplicates
sampling_strategies:
development: 100 samples
staging: 1000 samples
production: statistical_sampling
async_evaluation:
enabled: true
batch_size: 50
parallel_workers: 5
cost_control:
max_eval_budget_daily: $50
alert_at: 80%
auto_pause_at: 100%
metrics:
latency:
- p50_latency_ms
- p95_latency_ms
- p99_latency_ms
- time_to_first_token
quality:
- faithfulness_score
- relevance_score
- coherence_score
- hallucination_rate
usage:
- requests_per_minute
- tokens_per_request
- cost_per_request
- error_rate
business:
- user_satisfaction
- task_completion_rate
- escalation_rate
dashboards:
real_time:
refresh: 10s
panels:
- request_rate
- latency_distribution
- error_rate
- active_users
quality:
refresh: 1h
panels:
- metric_trends
- drift_detection
- regression_alerts
- benchmark_scores
1. [ ] Verify metric calculation
```python
from evaluate import load
bleu = load("bleu")
score = bleu.compute(predictions=["test"], references=[["test"]])
print(score) # Should be 1.0
Check data pipeline
# Verify logs are being collected
print(len(evaluation_buffer))
print(evaluation_buffer[-5:])
Validate statistical significance
from scipy import stats
t_stat, p_value = stats.ttest_ind(scores_a, scores_b)
print(f"p-value: {p_value}") # < 0.05 for significance
Test alerting
# Trigger test alert
send_alert("TEST: Evaluation system check")
Verify dashboard data
curl -s localhost:9090/api/v1/query?query=llm_requests_total
### Common Failure Modes
| Symptom | Root Cause | Fix |
|---------|------------|-----|
| Low scores suddenly | Model update or data drift | Check recent changes |
| Inconsistent metrics | Non-deterministic eval | Set temperature=0 |
| Missing data points | Logging failure | Check pipeline health |
| False positives | Threshold too sensitive | Adjust alert thresholds |
| Slow dashboards | Too much data | Add aggregation |
### Metric Interpretation Guide
```yaml
faithfulness:
excellent: > 0.9
good: 0.7-0.9
concerning: 0.5-0.7
critical: < 0.5
action_threshold: 0.6
relevance:
excellent: > 0.85
good: 0.7-0.85
needs_improvement: < 0.7
action: Check retrieval if RAG
hallucination_rate:
acceptable: < 5%
warning: 5-10%
critical: > 10%
action: Review prompts and context
latency_p95:
excellent: < 500ms
acceptable: 500-1000ms
degraded: 1000-2000ms
critical: > 2000ms
class LLMEvaluator:
def evaluate(self, model_output, expected, context=None):
metrics = {
'relevance': self.score_relevance(model_output, expected),
'factuality': self.check_facts(model_output, context),
'coherence': self.measure_coherence(model_output),
'toxicity': self.detect_toxicity(model_output)
}
return metrics
def aggregate(self, results):
return {
metric: {
'mean': np.mean([r[metric] for r in results]),
'std': np.std([r[metric] for r in results]),
'p5': np.percentile([r[metric] for r in results], 5)
}
for metric in results[0].keys()
}
class ABTest:
def __init__(self, control, treatment, metric, min_samples=1000):
self.control = control
self.treatment = treatment
self.metric = metric
self.min_samples = min_samples
def run(self, test_data):
control_scores = [self.metric(self.control(x)) for x in test_data]
treatment_scores = [self.metric(self.treatment(x)) for x in test_data]
t_stat, p_value = stats.ttest_ind(control_scores, treatment_scores)
return {
'control_mean': np.mean(control_scores),
'treatment_mean': np.mean(treatment_scores),
'lift': (np.mean(treatment_scores) - np.mean(control_scores)) / np.mean(control_scores),
'p_value': p_value,
'significant': p_value < 0.05,
'sample_size': len(test_data)
}
skills:
- evaluation-metrics (PRIMARY)
- model-deployment (SECONDARY)
agents:
- 03-rag-systems (RAG evaluation)
- 04-fine-tuning (model comparison)
external:
- ragas >= 0.1.0
- evaluate >= 0.4.0
- prometheus-client >= 0.17.0
- langfuse >= 2.0.0
You are an elite AI agent architect specializing in crafting high-performance agent configurations. Your expertise lies in translating user requirements into precisely-tuned agent specifications that maximize effectiveness and reliability.