Evaluation harness for testing AI agent prompts and outputs against ground truth. Activates when designing eval scenarios, running prompt evaluations, interpreting results, or iterating on agent behavior. Covers scenario design, expectation definition, ground truth generation, regression testing, and convergence loops.
From sentinelnpx claudepluginhub digistrique-solutions/strique-marketplace --plugin sentinelThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides slash command development for Claude Code: structure, YAML frontmatter, dynamic arguments, bash execution, user interactions, organization, and best practices.
A framework for systematically evaluating AI agent prompts and outputs. Instead of manual spot-checking, this skill defines a repeatable process for measuring agent quality against ground truth.
An eval scenario is a structured test case for an AI agent. It includes:
Scenario:
id: unique-identifier
category: single-turn | multi-turn | tool-use | reasoning | creative
prompt: "The input to the agent"
context:
platform: optional platform or domain context
history: optional prior conversation turns
expectations:
required_elements: [list of things the output must contain]
forbidden_elements: [list of things the output must not contain]
quality_description: "Human-readable description of what good looks like"
expects_tool_calls: true | false
expected_tools: [list of tool names if applicable]
description: "Why this scenario exists and what it tests"
| Category | What It Tests | Example |
|---|---|---|
| Single-turn | Basic prompt-response quality | "Summarize this data" |
| Multi-turn | Context retention across turns | Initial prompt + follow-up |
| Tool-use | Correct tool selection and invocation | "Fetch my recent orders" |
| Reasoning | Multi-step logic and analysis | "Compare performance across channels" |
| Creative | Open-ended generation quality | "Write a strategy document" |
| Edge case | Boundary and failure handling | Empty data, invalid inputs |
VAGUE (bad):
- "Response should be helpful"
- "Agent should handle errors"
- "Output should be formatted well"
PRECISE (good):
- "Response must contain at least 3 item names from the mock data"
- "When the API returns a 401 error, the agent must suggest re-authenticating"
- "Output must be a markdown table with columns: Name, Status, Value"
Ground truth is a verified-correct reference output for a given scenario. It represents what a correct agent response looks like.
A verification pass should check:
| Criterion | What It Checks |
|---|---|
| Correctness | Does the response accurately answer the prompt? |
| Tool usage | Were appropriate tools called with correct parameters? |
| Completeness | Is all necessary information included? |
| Actionability | Are recommendations specific and implementable? |
| Data boundary | Does the response only use available data (no hallucination)? |
| Presentation | Is the output well-formatted and readable? |
| Mode | What It Does | When to Use |
|---|---|---|
| Generate | Run agent, verify output, save as ground truth | After changing prompts or tools |
| Regression | Run agent, compare to existing ground truth | Before releases, after refactors |
| Full | Generate new ground truth, then run regression | Comprehensive validation |
1. Load scenario definition
2. Set up context (mock data, platform state, conversation history)
3. Run the agent with the scenario prompt
4. Capture: final output, tool calls, intermediate reasoning, timing
5. Verify against expectations (structural, content, quality, behavioral)
6. Compare to ground truth (if regression mode)
7. Report: pass/fail per criterion, overall verdict, timing SLAs
For running multiple scenarios:
Define timing SLAs for agent performance:
| Metric | What It Measures |
|---|---|
| Time to first token | How quickly the agent starts responding |
| Total duration | End-to-end time for complete response |
| Tool call duration | Time spent in tool execution |
| Plan completion time | Time to complete multi-step plans |
Set thresholds based on your application's requirements. A response that is correct but takes 5 minutes may still be a failure.
For each failed scenario, determine:
| Pattern | Likely Cause | Fix |
|---|---|---|
| Wrong tools called | Routing or skill loading mismatch | Check tool registration, skill files |
| Correct tools, wrong output | Prompt instructions unclear | Clarify prompt, add examples |
| Hallucinated data | Agent generating data instead of using tools | Add explicit instruction: "only use data from tool results" |
| Incomplete response | Prompt too vague about requirements | Add specific output requirements |
| Format wrong | Missing format instructions | Add output format specification with examples |
| SLA violation | Slow tools or too many tool calls | Optimize tool implementation, reduce unnecessary calls |
Track over time:
1. Run eval suite
2. Identify failing scenarios
3. Diagnose root cause (prompt, tool, routing, data)
4. Make targeted fix
5. Re-run ONLY the failing scenarios
6. If they pass, run the FULL suite (catch regressions)
7. If full suite passes, commit
8. If new failures appear, repeat from step 2
For prompts that need many refinement cycles, use an automated loop:
Loop until convergence or max iterations:
1. Run eval scenario
2. If pass: done
3. If fail: read rejection reasoning
4. Adjust prompt based on reasoning
5. Re-run
Set a maximum iteration count (10 is a reasonable default). If the prompt does not converge after max iterations, the problem may be architectural, not prompt-level.
A prompt change is converged when:
When adding a new capability to the agent:
This is TDD applied to AI agent behavior.
Remove scenarios when:
Never retire a scenario because it is failing. Fix the agent or update expectations.
Periodically audit scenario coverage: