From agent-evaluation
This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", "implement LLM-as-judge", "compare model outputs", "mitigate evaluation bias", or mentions multi-dimensional evaluation, agent testing, quality gates, direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment for LLM agent systems. NOT for testing code or applications (use testing-framework), NOT for agent coordination or multi-agent design (use multi-agent-patterns).
npx claudepluginhub viktorbezdek/skillstack --plugin agent-evaluationThis skill uses the workspace's default tool permissions.
Evaluation of agent systems requires fundamentally different approaches than traditional software testing or standard language model benchmarking. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Analyzes BMad project state from catalog CSV, configs, artifacts, and query to recommend next skills or answer questions. Useful for help requests, 'what next', or starting BMad.
Evaluation of agent systems requires fundamentally different approaches than traditional software testing or standard language model benchmarking. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback.
This skill is the definitive resource for agent evaluation. It covers foundational concepts (rubric design, test sets, outcome-focused assessment), advanced techniques (LLM-as-judge, pairwise comparison, bias mitigation), and production-grade pipeline design. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
Activate this skill when:
Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context. The solution is outcome-focused evaluation that judges whether agents achieve right outcomes while following reasonable processes.
Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates. Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.
Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa. Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.
Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:
| Factor | Variance Explained | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
This finding has significant implications for evaluation design:
Evaluation approaches fall into two primary categories with distinct reliability profiles:
Direct Scoring: A single LLM rates one response on a defined scale.
Pairwise Comparison: An LLM compares two responses and selects the better one.
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
End-State Evaluation: For agents that mutate persistent state, end-state evaluation focuses on whether the final state matches expectations rather than how the agent got there.
Is there an objective ground truth?
+-- Yes --> Direct Scoring
| Examples: factual accuracy, instruction following, format compliance
|
+-- No --> Is it a preference or quality judgment?
+-- Yes --> Pairwise Comparison
| Examples: tone, style, persuasiveness, creativity
|
+-- No --> Consider reference-based evaluation
Examples: summarization (compare to source), translation (compare to reference)
Choose metrics based on the evaluation task structure:
| Task Type | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |
| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
Effective rubrics cover key dimensions with descriptive levels:
Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Determine passing threshold based on use case requirements.
Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards. Generic rubrics produce generic (less useful) evaluations.
LLM judges exhibit systematic biases that must be actively mitigated:
Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Prompt Structure for Direct Scoring:
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each criterion: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
Chain-of-Thought Requirement: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
Position Bias Mitigation Protocol:
Prompt Structure for Pairwise Comparison:
You are an expert evaluator comparing two AI responses.
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent
## Original Prompt
{prompt}
## Response A
{response_a}
## Response B
{response_b}
## Comparison Criteria
{criteria list}
## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level
## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.
Confidence Calibration: Confidence scores should reflect position consistency:
Human evaluation catches what automation misses. Humans notice hallucinated answers on unusual queries, system failures, and subtle biases that automated evaluation misses.
Effective human evaluation covers edge cases, samples systematically, tracks patterns, and provides contextual understanding. Use human evaluation to validate and calibrate automated systems.
Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.
Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.
Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).
Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.
Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.
Production evaluation systems require multiple layers:
+-------------------------------------------------+
| Evaluation Pipeline |
+-------------------------------------------------+
| |
| Input: Response + Prompt + Context |
| | |
| v |
| +---------------------+ |
| | Criteria Loader | <-- Rubrics, weights |
| +----------+----------+ |
| | |
| v |
| +---------------------+ |
| | Primary Scorer | <-- Direct or Pairwise |
| +----------+----------+ |
| | |
| v |
| +---------------------+ |
| | Bias Mitigation | <-- Position swap, etc.|
| +----------+----------+ |
| | |
| v |
| +---------------------+ |
| | Confidence Scoring | <-- Calibration |
| +----------+----------+ |
| | |
| v |
| Output: Scores + Justifications + Confidence |
| |
+-------------------------------------------------+
Build evaluation pipelines that run automatically on agent changes. Track results over time. Compare versions to identify improvements or regressions.
Track evaluation metrics in production by sampling interactions and evaluating randomly. Set alerts for quality drops. Maintain dashboards for trend analysis.
For high-volume evaluation:
Panel of LLMs (PoLL): Use multiple models as judges, aggregate votes
Hierarchical evaluation: Fast cheap model for screening, expensive model for edge cases
Human-in-the-loop: Automated evaluation for clear cases, human review for low-confidence
Scoring without justification
Single-pass pairwise comparison
Overloaded criteria
Missing edge case guidance
Ignoring confidence calibration
Overfitting to specific paths
Single-metric obsession
Neglecting context effects
Skipping human evaluation
def evaluate_agent_response(response, expected):
rubric = load_rubric()
scores = {}
for dimension, config in rubric.items():
scores[dimension] = assess_dimension(response, expected, dimension)
overall = weighted_average(scores, config["weights"])
return {"passed": overall >= 0.7, "scores": scores}
Test sets should span multiple complexity levels to ensure comprehensive evaluation:
test_set = [
{
"name": "simple_lookup",
"input": "What is the capital of France?",
"expected": {"type": "fact", "answer": "Paris"},
"complexity": "simple",
"description": "Single tool call, factual lookup"
},
{
"name": "medium_query",
"input": "Compare the revenue of Apple and Microsoft last quarter",
"complexity": "medium",
"description": "Multiple tool calls, comparison logic"
},
{
"name": "multi_step_reasoning",
"input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
"complexity": "complex",
"description": "Many tool calls, aggregation, analysis"
},
{
"name": "research_synthesis",
"input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
"complexity": "very_complex",
"description": "Extended interaction, deep reasoning, synthesis"
}
]
Input:
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
Output:
{
"criterion": "Factual Accuracy",
"score": 5,
"evidence": [
"Correctly identifies axial tilt as primary cause",
"Correctly explains differential sunlight by hemisphere",
"No factual errors present"
],
"justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
"improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}
Input:
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
First Pass (A first):
{ "winner": "B", "confidence": 0.8 }
Second Pass (B first):
{ "winner": "A", "confidence": 0.6 }
(Note: Winner is A because B was in first position)
Mapped Second Pass:
{ "winner": "B", "confidence": 0.6 }
Final Result:
{
"winner": "B",
"confidence": 0.7,
"positionConsistency": {
"consistent": true,
"firstPassWinner": "B",
"secondPassWinner": "B"
}
}
Input:
criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"
Output (abbreviated):
{
"levels": [
{
"score": 1,
"label": "Poor",
"description": "Code is difficult to understand without significant effort",
"characteristics": [
"No meaningful variable or function names",
"No comments or documentation",
"Deeply nested or convoluted logic"
]
},
{
"score": 3,
"label": "Adequate",
"description": "Code is understandable with some effort",
"characteristics": [
"Most variables have meaningful names",
"Basic comments present for complex sections",
"Logic is followable but could be cleaner"
]
},
{
"score": 5,
"label": "Excellent",
"description": "Code is immediately clear and maintainable",
"characteristics": [
"All names are descriptive and consistent",
"Comprehensive documentation",
"Clean, modular structure"
]
}
],
"edgeCases": [
{
"situation": "Code is well-structured but uses domain-specific abbreviations",
"guidance": "Score based on readability for domain experts, not general audience"
}
]
}
This skill connects to all other skills as a cross-cutting concern:
External research:
Created: 2024-12-20 Last Updated: 2026-03-14 Authors: Agent Skills for Context Engineering Contributors, Muratcan Koylan Version: 2.0.0