Skill

Agent Quality Evaluation

Evaluates Claude agent quality on library questions across three modes: without BK, BK grep-only, and BK full. Scores accuracy, specificity, completeness, source grounding using predefined or custom queries.

ai-ml

developer-tools

npx claudepluginhub chris-xperimntl/agent-knowledge

Tool Access

This skill is limited to using the following tools:

mcp__agent-knowledge__executemcp__agent-knowledge__searchmcp__agent-knowledge__get_full_contextReadGrepGlobWebSearchBash

Preview

Compare how well Claude answers library questions across three access levels:

Supporting Assets

references/output-format.mdreferences/procedures.md

SKILL.md

Similar Skills

agentv-bench

Runs AgentV evaluations to benchmark AI agents, optimize prompts/skills via eval-driven iteration, compare outputs across providers, and analyze results.

12 files

agentv-dev

agent-eval

149

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

awesome-claude-notes

agent-eval

173.8k

Compares coding agents like Claude Code, Aider, and Codex head-to-head on custom repo tasks using YAML definitions, git worktrees, and judges for pass rate, cost, time, consistency metrics.

everything-claude-code

Stats

Stars11

Forks3

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Agent Quality Evaluation

Compare how well Claude answers library questions across three access levels:

Without BK — web search + training knowledge only

BK Grep — Grep/Read/Glob on cloned repos, no vector search

BK Full — vector search + get_full_context + Grep/Read

Arguments

Parse $ARGUMENTS:

No arguments: Show usage help

Quoted string: Run eval for that single question

--predefined: Run all predefined queries

--predefined N: Run predefined query #N only

Workflow

Prerequisites: Call execute with { command: "stores" } to list stores. Abort if none.

Resolve queries: Load from $CLAUDE_PLUGIN_ROOT/evals/agent-quality/queries/predefined.yaml or use arbitrary query.

Load templates: Read agent prompts + judge rubric from $CLAUDE_PLUGIN_ROOT/evals/agent-quality/templates/

Spawn 3 agents in parallel per query (replace {{QUESTION}}, {{STORES}}, {{STORE_PATHS}})

Judge: Score all 4 criteria (1-5): Accuracy, Specificity, Completeness, Source Grounding