Benchmark and optimize AI agents with AgentV evaluations: run benchmarks across providers like Anthropic and OpenAI, write and lint eval YAML files, analyze traces for regressions patterns costs and latency, bootstrap CLI setup in your workspace.
npx claudepluginhub entityprocess/agentv --plugin agentv-devRun AgentV evaluations and optimize agents through eval-driven iteration. Triggers: run evals, benchmark agents, optimize prompts/skills against evals, compare agent outputs across providers, analyze eval results, offline evaluation of recorded sessions. Not for: writing/editing eval YAML without running (use agentv-eval-writer), analyzing existing traces/JSONL without re-running (use agentv-trace-analyst).
Use when reviewing eval YAML files for quality issues, linting eval files before committing, checking eval schema compliance, or when asked to "review these evals", "check eval quality", "lint eval files", or "validate eval structure". Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).
Write, edit, review, and validate AgentV EVAL.yaml / .eval.yaml evaluation files. Use when asked to create new eval files, update or fix existing ones, add or remove test cases, configure graders (`llm-grader`, `code-grader`, `rubrics`), review whether an eval is correct or complete, convert between EVAL.yaml and evals.json using `agentv convert`, or generate eval test cases from chat transcripts (markdown conversation or JSON messages). Do NOT use for creating SKILL.md files, writing skill definitions, or running evals — running and benchmarking belongs to agentv-bench.
Bootstrap AgentV in the current workspace after plugin-manager install. Ensures CLI availability, runs workspace init, and verifies setup artifacts.
Analyze AgentV evaluation traces and result JSONL files using `agentv trace` and `agentv compare` CLI commands. Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs, identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics from AgentV result files. Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance, or measuring skill description quality — those tasks belong to the skill-creator skill.
Evaluate AI agents from the terminal. No server. No signup.
npm install -g agentv
agentv init
agentv eval evals/example.yaml
That's it. Results in seconds, not minutes.
AgentV runs evaluation cases against your AI agents and scores them with deterministic code graders + customizable LLM graders. Everything lives in Git — YAML eval files, markdown judge prompts, JSONL results.
# evals/math.yaml
description: Math problem solving
tests:
- id: addition
input: What is 15 + 27?
expected_output: "42"
assertions:
- type: contains
value: "42"
agentv eval evals/math.yaml
1. Install and initialize:
npm install -g agentv
agentv init
2. Configure targets in .agentv/targets.yaml — point to your agent or LLM provider.
3. Create an eval in evals/:
description: Code generation quality
tests:
- id: fizzbuzz
criteria: Write a correct FizzBuzz implementation
input: Write FizzBuzz in Python
assertions:
- type: contains
value: "fizz"
- type: code-grader
command: ./validators/check_syntax.py
- type: llm-grader
prompt: ./graders/correctness.md
4. Run it:
agentv eval evals/my-eval.yaml
5. Compare results across targets:
agentv compare .agentv/results/runs/<timestamp>/index.jsonl
agentv eval evals/my-eval.yaml # JSONL (default)
agentv eval evals/my-eval.yaml -o report.html # HTML dashboard
agentv eval evals/my-eval.yaml -o results.xml # JUnit XML for CI
Use AgentV programmatically:
import { evaluate } from '@agentv/core';
const { results, summary } = await evaluate({
tests: [
{
id: 'greeting',
input: 'Say hello',
assertions: [{ type: 'contains', value: 'Hello' }],
},
],
});
console.log(`${summary.passed}/${summary.total} passed`);
Full docs at agentv.dev/docs.
git clone https://github.com/EntityProcess/agentv.git
cd agentv
bun install && bun run build
bun test
See AGENTS.md for development guidelines.
MIT
Design and review AI agent systems — architecture patterns, workflow design, and plugin quality review
Share bugs, ideas, or general feedback.
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
A CLI tool for validating AI coding agents
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Access thousands of AI prompts and skills directly in your AI coding assistant. Search prompts, discover skills, save your own, and improve prompts with AI.