From agentv-dev
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
npx claudepluginhub entityprocess/agentv --plugin agentv-devThis skill uses the workspace's default tool permissions.
Comprehensive docs: https://agentv.dev
Runs AgentV evaluations to benchmark AI agents, optimize prompts/skills via eval-driven iteration, compare outputs across providers, and analyze results.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Share bugs, ideas, or general feedback.
Comprehensive docs: https://agentv.dev
AgentV evaluations measure execution quality — whether your agent or skill produces correct output when invoked.
For trigger quality (whether the right skill is triggered for the right prompts), see the Evaluation Types guide. Do not use execution eval configs (EVAL.yaml, evals.json) for trigger evaluation — these are distinct concerns requiring different tooling and methodologies.
If the project already has an Agent Skills evals.json file, use it as a starting point instead of writing YAML from scratch:
# Convert evals.json to AgentV EVAL YAML
agentv convert evals.json
# Run directly without converting (all commands accept evals.json)
agentv eval evals.json
The converter maps prompt → input, expected_output → expected_output, assertions → assertions (llm-grader), and resolves files[] paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code graders, rubrics, required gates).
After converting, enhance the YAML with AgentV-specific capabilities shown below.
Convert a chat conversation into eval test cases without starting from scratch.
Input formats:
Markdown conversation:
User: How do I reset my password?
Assistant: Go to Settings > Security > Reset Password...
JSON messages:
[{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "Go to Settings > Security > Reset Password..."}]
Select exchanges that make good test cases:
Skip: greetings, one-word acknowledgments, repeated exchanges
Multi-turn format (when context from prior turns matters):
tests:
- id: multi-turn-context
criteria: "Agent remembers prior context"
input:
- role: user
content: "My name is Alice"
- role: assistant
content: "Nice to meet you, Alice!"
- role: user
content: "What's my name?"
expected_output: "Your name is Alice."
assertions:
- type: rubrics
criteria:
- Correctly recalls the user's name from earlier in the conversation
Guidelines: preserve exact wording in expected_output; aim for 5–15 tests per transcript; pick exchanges that test different capabilities.
description: Example eval
execution:
target: default
tests:
- id: greeting
criteria: Friendly greeting
input: "Say hello"
expected_output: "Hello! How can I help you?"
assertions:
- type: rubrics
criteria:
- Greeting is friendly and warm
- Offers to help
Required: tests (array or string path)
Optional: name, description, version, author, tags, license, requires, execution, suite, workspace, assertions, input
Test fields:
| Field | Required | Description |
|---|---|---|
id | yes | Unique identifier |
criteria | yes | What the response should accomplish |
input / input | yes | Input to the agent |
expected_output / expected_output | no | Gold-standard reference answer |
assertions | no | Graders: deterministic checks, rubrics, and LLM/code graders |
rubrics | no | Deprecated — use assertions: [{type: rubrics, criteria: [...]}] instead |
execution | no | Per-case execution overrides |
workspace | no | Per-case workspace config (overrides suite-level) |
metadata | no | Arbitrary key-value pairs passed to setup/teardown scripts |
conversation_id | no | Thread grouping |
Shorthand aliases:
input (string) expands to [{role: "user", content: "..."}]expected_output (string/object) expands to [{role: "assistant", content: ...}]input / expected_output take precedence when both presentMessage format: {role, content} where role is system, user, assistant, or tool
Content types: inline text, {type: "file", value: "./path.md"}
File paths: relative from eval file dir, or absolute with / prefix from repo root
File handling by provider type: LLM providers receive file content inlined in XML tags. Agent providers receive a preread block with file:// URIs and must read files themselves. See Coding Agents > Prompt format.
JSONL format: One test per line as JSON. Optional .yaml sidecar for shared defaults. See examples/features/basic-jsonl/.
Environment variables: All string fields support ${{ VAR }} interpolation. Missing vars resolve to empty string. Works in eval files, external case files, and workspace configs. .env files are loaded automatically.
When name is present, the suite is parsed as a metadata-bearing eval:
name: export-screening # required, lowercase/hyphens, max 64 chars
description: Evaluates export control screening accuracy
version: "1.0"
author: acme-compliance
tags: [compliance, agents]
license: Apache-2.0
requires:
agentv: ">=0.30.0"
Prepend shared input messages to every test (like suite-level assertions). Avoids repeating the same prompt file in each test:
input:
- role: user
content:
- type: file
value: ./system-prompt.md
tests: ./cases.yaml
# cases.yaml — each test only needs its own query
# - id: test-1
# criteria: ...
# input: "User question here"
Effective input: [...suite input, ...test input]. Skipped when execution.skip_defaults: true.
Accepts same formats as test input (string or message array).
Point tests to an external file instead of inlining:
name: my-eval
description: My evaluation suite
tests: ./cases.yaml # relative to eval file dir
The external file can be YAML (array of test objects) or JSONL.
assertions defines graders at the suite level or per-test level. It is the canonical field for all graders:
# Suite-level (appended to every test)
assertions:
- type: is-json
required: true
- type: contains
value: "status"
tests:
- id: test-1
criteria: Returns JSON
input: Get status
# Per-test assertions (runs before suite-level)
assertions:
- type: equals
value: '{"status": "ok"}'
criteria and assertions Interactcriteria is a data field — it describes what the response should accomplish. It is not a grader. How it gets evaluated depends on whether assertions is present:
| Scenario | What happens | Warning? |
|---|---|---|
criteria + no assertions | Implicit llm-grader runs automatically against criteria | No |
criteria + assertions with only deterministic graders (contains, regex, etc.) | Only declared graders run. criteria is not evaluated. | Yes — warns that no grader will consume criteria |
criteria + assertions with a grader (llm-grader, code-grader, rubrics) | Declared graders run. Graders receive criteria as input. | No |
The simplest path. criteria is automatically evaluated by the default llm-grader:
tests:
- id: simple-eval
criteria: Assistant correctly explains the bug and proposes a fix
input: "Debug this function..."
# No assertions → default llm-grader evaluates against criteria
When assertions is defined, only the declared graders run. If you want an LLM grader alongside deterministic checks, declare it explicitly:
tests:
- id: mixed-eval
criteria: Response is helpful and mentions the fix
input: "Debug this function..."
assertions:
- type: llm-grader # must be explicit when assertions is present
- type: contains
value: "fix"
Common mistake: defining criteria with only deterministic graders. The criteria will be ignored and a warning is emitted:
tests:
- id: bad-example
criteria: Gives a thoughtful answer # ⚠ NOT evaluated — no grader in assertions
input: "What is 2+2?"
assertions:
- type: contains
value: "4"
# Warning: criteria is defined but no grader in assertions will evaluate it.
Any grader can be marked required to enforce a minimum score:
assertions:
- type: contains
value: "DENIED"
required: true # must score >= 0.8 (default)
- type: rubrics
required: 0.6 # must score >= 0.6 (custom threshold)
criteria:
- id: accuracy
outcome: Identifies the denied party
weight: 5.0
If a required grader scores below its threshold, the overall verdict is forced to fail.
Run scripts before/after each test. Define at suite level or override per case:
workspace:
template: ./workspace-templates/my-project
setup:
command: ["bun", "run", "setup.ts"]
timeout_ms: 120000
teardown:
command: ["bun", "run", "teardown.ts"]
tests:
- id: case-1
input: Fix the bug
criteria: Bug is fixed
metadata:
repo: sympy/sympy
base_commit: "abc123"
workspace:
setup:
command: ["python", "custom-setup.py"] # overrides suite-level
Lifecycle: template copy → repo clone → setup → git baseline → agent → file changes → teardown → repo reset → cleanup
Merge: Case-level fields replace suite-level fields.
Commands receive stdin JSON: {workspace_path, test_id, eval_run_id, case_input, case_metadata}
Setup failure: aborts case. Teardown failure: non-fatal (warning).
Clone repos into workspace automatically. For shared repo workspaces, pooling is the default:
workspace:
repos:
- path: ./repo
source:
type: git
url: https://github.com/org/repo.git
checkout:
ref: main
ancestor: 1 # parent commit
clone:
depth: 10
hooks:
after_each:
reset: fast # none | fast | strict
isolation: shared # shared | per_test
mode: pooled # pooled | temp | static
hooks:
enabled: true # set false to skip all hooks
source.type: git (URL) or local (path)checkout.resolve: remote (ls-remote) or localclone.depth: shallow clone depthclone.filter: partial clone filter (e.g., blob:none)clone.sparse: sparse checkout paths arraymode: pooled (default for shared repos), temp, or staticpath: workspace path used when mode: static; when empty/missing the workspace is auto-materialised (template copied + repos cloned); populated dirs are reused as-ishooks.enabled: boolean (default true); set false to skip all lifecycle hooksfast (git clean -fd); use --workspace-clean full for strict reset (git clean -fdx)agentv workspace list and agentv workspace cleanSee https://agentv.dev/targets/configuration/#repository-lifecycle
Configure via assertions array. Multiple graders produce a weighted average score.
- name: format_check
type: code-grader
command: [uv, run, validate.py]
cwd: ./scripts # optional working directory
target: {} # optional: enable LLM target proxy (max_calls: 50)
Contract: stdin JSON -> stdout JSON {score, assertions: [{text, passed, evidence?}], reasoning}
Input includes: question, criteria, answer, reference_answer, output, trace, token_usage, cost_usd, duration_ms, start_time, end_time, file_changes, workspace_path, config
When workspace_template is configured, workspace_path is the absolute path to the workspace dir (also available as AGENTV_WORKSPACE_PATH env var). Use this for functional grading (e.g., running npm test in the workspace).
See docs at https://agentv.dev/evaluators/code-graders/
- name: quality
type: llm-grader
prompt: ./prompts/eval.md # markdown template or command config
target: grader_gpt_5_mini # optional: override the grader target for this grader
model: gpt-5-chat # optional model override
config: # passed to prompt templates as context.config
strictness: high
Variables: {{question}}, {{criteria}}, {{answer}}, {{reference_answer}}, {{input}}, {{expected_output}}, {{output}}, {{file_changes}}
{{variable}} syntaxdefinePromptTemplate(fn) from @agentv/eval, receives context object with all variables + configtarget: to run different llm-grader graders against different named LLM targets in the same eval (useful for grader panels / ensembles)- name: gate
type: composite
assertions:
- name: safety
type: llm-grader
prompt: ./safety.md
- name: quality
type: llm-grader
aggregator:
type: weighted_average
weights: { safety: 0.3, quality: 0.7 }
Aggregator types: weighted_average, all_or_nothing, minimum, maximum, safety_gate
safety_gate: fails immediately if the named gate grader scores below threshold (default 1.0)- name: tool_check
type: tool-trajectory
mode: any_order # any_order | in_order | exact
minimums: # for any_order
knowledgeSearch: 2
expected: # for in_order/exact
- tool: knowledgeSearch
args: { query: "search term" } # partial deep equality match
- tool: documentRetrieve
args: any # any arguments accepted
max_duration_ms: 5000 # per-tool latency assertion
- tool: summarize # omit args to skip argument checking
- name: fields
type: field-accuracy
match_type: exact # exact | date | numeric_tolerance
numeric_tolerance: 0.01 # for numeric_tolerance match_type
aggregation: weighted_average # weighted_average | all_or_nothing
Compares output fields against expected_output fields.
- name: speed
type: latency
max_ms: 5000
- name: budget
type: cost
max_usd: 0.10
- name: tokens
type: token-usage
max_total_tokens: 4000
- name: efficiency
type: execution-metrics
max_tool_calls: 10 # Maximum tool invocations
max_llm_calls: 5 # Maximum LLM calls (assistant messages)
max_tokens: 5000 # Maximum total tokens (input + output)
max_cost_usd: 0.05 # Maximum cost in USD
max_duration_ms: 30000 # Maximum execution duration
target_exploration_ratio: 0.6 # Target ratio of read-only tool calls
exploration_tolerance: 0.2 # Tolerance for ratio check (default: 0.2)
Declarative threshold-based checks on execution metrics. Only specified thresholds are checked.
Score is proportional: passed / total assertions. Missing data counts as a failed assertion.
- type: contains
value: "DENIED"
required: true
Binary check: does output contain the substring? Name auto-generated if omitted.
- type: regex
value: "\\d{3}-\\d{2}-\\d{4}"
Binary check: does output match the regex pattern?
- type: equals
value: "42"
Binary check: does output exactly equal the value (both trimmed)?
- type: is-json
required: true
Binary check: is the output valid JSON?
- type: rubrics
criteria:
- id: accuracy
outcome: Correctly identifies the denied party
weight: 5.0
- id: reasoning
outcome: Provides clear reasoning
weight: 3.0
LLM-judged structured evaluation with weighted criteria. Criteria items support id, outcome, weight, and required fields.
Top-level rubrics: field is deprecated. Use type: rubrics under assertions instead.
See references/rubric-evaluator.md for score-range mode and scoring formula.
Control how the runner handles execution errors (infrastructure failures, not quality failures):
execution:
fail_on_error: false # never halt (default)
# fail_on_error: true # halt on first execution error
When halted, remaining tests get executionStatus: 'execution_error' with failureReasonCode: 'error_threshold_exceeded'.
Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates.
execution:
threshold: 0.8
CLI flag --threshold 0.8 overrides the YAML value. Must be a number between 0 and 1. Mean score is computed from quality results only (execution errors excluded).
The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as <failure>. When no threshold is set, JUnit defaults to 0.5.
# Run evaluation (requires API keys)
agentv eval <file.yaml> [--test-id <id>] [--target <name>] [--dry-run] [--threshold <0-1>]
# Run with OTLP JSON file (importable by OTel backends)
agentv eval <file.yaml> --otel-file traces/eval.otlp.json
# Run a single assertion in isolation (no API keys needed)
agentv eval assert <grader-name> --agent-output "..." --agent-input "..."
# Import agent transcripts for offline grading
agentv import claude --session-id <uuid>
# Re-run only execution errors from a previous run
agentv eval <file.yaml> --retry-errors .agentv/results/runs/<timestamp>/index.jsonl
# Validate eval file
agentv validate <file.yaml>
# Compare results — N-way matrix from a canonical run manifest
agentv compare .agentv/results/runs/<timestamp>/index.jsonl
agentv compare .agentv/results/runs/<timestamp>/index.jsonl --baseline <target> # CI regression gate
agentv compare .agentv/results/runs/<timestamp>/index.jsonl --baseline <target> --candidate <target> # pairwise
agentv compare .agentv/results/runs/<baseline-timestamp>/index.jsonl .agentv/results/runs/<candidate-timestamp>/index.jsonl
# Author assertions directly in the eval file
# Prefer simple assertions when they fit the criteria; use deterministic or LLM-based graders when needed
agentv validate <file.yaml>
Use @agentv/eval to build custom graders in TypeScript/JavaScript:
#!/usr/bin/env bun
import { defineAssertion } from '@agentv/eval';
export default defineAssertion(({ answer, trace }) => ({
pass: answer.length > 0 && (trace?.eventCount ?? 0) <= 10,
reasoning: 'Checks content exists and is efficient',
}));
Assertions support both pass: boolean and score: number (0-1). If only pass is given, score is 1 (pass) or 0 (fail).
#!/usr/bin/env bun
import { defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(({ trace, answer }) => ({
score: trace?.eventCount <= 5 ? 1.0 : 0.5,
assertions: [
{ text: 'Efficient tool usage', passed: (trace?.eventCount ?? 0) <= 5 },
],
}));
Both are used via type: code-grader in YAML with command: [bun, run, grader.ts].
Place assertion files in .agentv/assertions/ — they auto-register by filename:
.agentv/assertions/word-count.ts → type: word-count
.agentv/assertions/sentiment.ts → type: sentiment
No command: needed in YAML — just use type: <filename>.
Use evaluate() from @agentv/core to run evals as a library:
import { evaluate } from '@agentv/core';
const { results, summary } = await evaluate({
tests: [
{
id: 'greeting',
input: 'Say hello',
assertions: [{ type: 'contains', value: 'hello' }],
},
],
target: { provider: 'mock_agent' },
});
console.log(`${summary.passed}/${summary.total} passed`);
Supports inline tests (no YAML) or file-based via specFile.
Type-safe project configuration in agentv.config.ts:
import { defineConfig } from '@agentv/core';
export default defineConfig({
execution: { workers: 5, maxRetries: 2 },
output: { format: 'jsonl', dir: './results' },
limits: { maxCostUsd: 10.0 },
});
Auto-discovered from project root. Validated with Zod.
agentv create assertion <name> # → .agentv/assertions/<name>.ts
agentv create eval <name> # → evals/<name>.eval.yaml + .cases.jsonl
For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the Skill Improvement Workflow guide.
After running evals, perform a human review before iterating. Create feedback.json in the results directory:
{
"run_id": "2026-03-14T10-32-00_claude",
"reviewer": "engineer-name",
"timestamp": "2026-03-14T12:00:00Z",
"overall_notes": "Summary of observations",
"per_case": [
{
"test_id": "test-id",
"verdict": "acceptable | needs_improvement | incorrect | flaky",
"notes": "Why this verdict",
"evaluator_overrides": { "code-grader:name": "Override note" },
"workspace_notes": "Workspace state observations"
}
]
}
Use evaluator_overrides for workspace evaluations to annotate specific grader results (e.g., "code-grader was too strict"). Use workspace_notes for observations about workspace state.
Review workflow: run evals → inspect results (agentv trace show) → write feedback → tune prompts/graders → re-run.
Full guide: https://agentv.dev/guides/human-review/
references/eval-schema.jsonreferences/config-schema.json