From ravn-ai-toolkit
Tests CLAUDE.md and agent definition files for behavioral compliance: auto-generates scenarios from rules, runs LLM-as-judge scoring, reports per-rule pass/fail scores. Optionally mutates failing rules for improvement.
npx claudepluginhub ravnhq/ai-toolkitThis skill is limited to using the following tools:
1. Reads a CLAUDE.md (or agent .md file)
references/script-reference.mdscripts/_common.pyscripts/eval-behavioral.pyscripts/generate-scenarios.pyscripts/mutate-loop.pyscripts/prompts/context-agent.mdscripts/prompts/context-skill.mdscripts/prompts/examples.mdscripts/prompts/system-integration.mdscripts/prompts/system.mdscripts/results/eval-20260323T195637.jsonscripts/results/eval-20260325T133145.jsonscripts/results/eval-20260325T183501.jsonscripts/results/judge_cache/02c3f6f9544e4951502605c5ee79e81a3dc6de94b3e6a97f93db9e3ed99c3abd.jsonscripts/results/judge_cache/076424a4bd13be207c0835d544548596802a64ce86c1a252ee796f587037abad.jsonscripts/results/judge_cache/0a308a587432cd619dcebda0c337bef05fe442e6a94ed7237862fd49952cb130.jsonscripts/results/judge_cache/0f410998853e19071f1711ae00399b8b0d9551505587014ee6e23fbe258e18e0.jsonscripts/results/judge_cache/0ffb76abded26eb9723d5cdd0d96b0d941cfe4fdeca925e08832b9b33cbbc915.jsonscripts/results/judge_cache/118d9b522a5db319e760bfc76b34d45e41b6babfbb9c5591103b7caa8711b0ed.jsonscripts/results/judge_cache/129f1f2ecb65774d92eb0ac4fa164a496648fcd9c7539cd16fb8f1b352f44118.jsonMeasures coding agent compliance to skills, rules, and agent definitions by generating specs and scenarios at 3 prompt strictness levels, running agents, classifying tool calls, and reporting timelines with rates.
Measures coding agent compliance with skills/rules/agents by generating specs/scenarios at 3 strictness levels, running agents, classifying tool calls, and reporting timelines with scores.
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
Share bugs, ideas, or general feedback.
--holistic)claude -p with LLM-as-judge scoringAlways run scripts with uv run --script — never python, never python3, never a bare script name. The scripts declare their own dependencies via inline # /// script metadata; uv run --script resolves all dependencies automatically — no pip install required, ever. Invoking with python or python3 will fail with import errors because the dependencies are not installed in the system environment.
This skill runs long operations (30s-5min per step). Always keep the user informed:
Find the target file to test. Priority order:
0. If user passed --self, target is [SKILL_DIR]/SKILL.md — skip to confirmation below
/eval-agent-md ./CLAUDE.md), use that~/.claude/CLAUDE.md (user global)Read the file and confirm with the user: "I found [filename] at [path] ([N] lines). Testing this file." Wait for user acknowledgment before proceeding to Step 2.
Tell the user: "Generating test scenarios from [filename]... this calls claude -p --model sonnet and takes 30-60 seconds on average."
Before running, mention whether this is a warm or cold generation run:
Run the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:
uv run --script [SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]
# For SKILL.md files, add --skill for workflow-aware scenarios:
# uv run --script [SKILL_DIR]/scripts/generate-scenarios.py --skill [TARGET_FILE]
# For self-testing (implies --skill):
# uv run --script [SKILL_DIR]/scripts/generate-scenarios.py --self
# To also generate integration scenarios (multi-rule interaction tests):
# uv run --script [SKILL_DIR]/scripts/generate-scenarios.py --holistic [TARGET_FILE]
The script auto-detects the repository name from git and saves to /tmp/eval-agent-md-<repo>-scenarios.yaml (e.g., /tmp/eval-agent-md-my-project-scenarios.yaml). Override with --repo-name NAME or -o PATH.
It also reuses an exact-input scenario cache by default; pass --no-scenario-cache to force fresh generation. --no-cache remains as a compatibility alias.
After generation, read the output file and show the user a summary:
Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"
Validation gate: If the output file is missing or contains 0 scenarios, do not proceed. Tell the user: "Scenario generation produced no scenarios. Check that the target file has clearly structured rules (headings, numbered items, or labeled sections)." Then stop.
Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls claude -p twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."
Also summarize the work budget before starting:
Tip: --effort low --runs 3 costs roughly the same as --effort high --runs 1 and gives majority-vote reliability — a practical default for regular compliance checks.
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress ([1/N] scenario_id... PASS/FAIL (Xs)) in real time:
uv run --script [SKILL_DIR]/scripts/eval-behavioral.py \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--claude-md [TARGET_FILE] \
--runs 1 \
--model sonnet
Options the user can control:
--runs N — runs per scenario for majority vote (default: 1, recommend 3 for reliability)--model MODEL — model for test subject (default: sonnet)--compare-models — run across haiku/sonnet/opus and show comparison matrix--workers N — opt into higher concurrency than the safe default--no-judge-cache — force fresh judge verdicts instead of reusing exact-input cache entries--no-subject-cache — force fresh subject responses instead of exact-input cache reuseResults now include multi-dimensional metrics: per-scenario response size (char count, word count) alongside timing and cache stats. This enables better A/B comparison during mutation testing.
Validation gate: If all scenarios return an error or null verdict (e.g., script crash, missing model), do not print a compliance report. Tell the user: "All scenarios failed to produce a verdict — the run may have crashed. Check the output above for errors before interpreting results." Then stop.
Print a compliance report:
## Compliance Report — [filename]
### Per-rule: 8/10 (80%)
| Scenario | Rule | Verdict | Evidence |
|----------|------|---------|----------|
| gate1_think | GATE-1 | PASS | Lists assumptions before code |
| ... | ... | ... | ... |
### Integration: 3/5 (60%) ← only shown with --holistic
| Scenario | Rules Tested | Verdict | Evidence |
|----------|--------------|---------|----------|
| integration_gate1_tdd | GATE-1, TDD | PASS | Assumptions before test before impl |
| ... | ... | ... | ... |
### Combined: 11/15 (73%) [per-rule: 8/10, integration: 3/5]
### Failing Rules
- [rule]: [what went wrong] — suggested fix: [brief suggestion]
If the user says "improve", "fix", or passed --improve:
Tell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."
For performance, explain that scoped mutation checks now reuse the baseline already computed for the current content and only re-evaluate the mutated candidate for the targeted scenario before any full-suite validation.
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:
uv run --script [SKILL_DIR]/scripts/mutate-loop.py \
--target [TARGET_FILE] \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--max-iterations 3 \
--runs 3 \
--model sonnet
This is always dry-run by default. Show the user each suggested mutation and ask before applying.
The mutation loop includes three safety guardrails (disable with --no-boundary-check):
--- markers)When a mutation produces delta=0 (equal correctness), the --neutral-strategy flag controls the decision:
revert (default) — discard neutral mutationskeep — keep neutral mutationssize — keep only if the mutated response is shorter (efficiency win)Parse the user's /eval-agent-md invocation for these common options:
[path] — target file (positional, e.g., /eval-agent-md ./CLAUDE.md)--improve — run mutation loop after testing--runs N — runs per scenario (default: 1, recommend 3 for reliability)--model MODEL — model for test subject (default: sonnet)--self — test this skill's own SKILL.md (implies --skill)--skill / --agent — hint the target type for better scenario generation--holistic — also generate integration scenarios that test multiple rules interacting (priority ordering, conflict resolution, cumulative compliance)--coverage — report rule coverage after scenario generation (shows tested vs untested rules)--effort LEVEL — effort for subject calls: low / medium / high (default: high). Lower effort reduces cost and latency.--gen-effort LEVEL — effort for scenario generation: low / medium / high (default: medium). Use high for complex or densely-ruled files.--save-reference PATH — save scenarios to a stable reference directory for deterministic test suitesSee references/script-reference.md for the full flag reference (caching, workers, compare-models, timeouts).
User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."
Expected behavior: Begin Step 1 immediately without asking for confirmation — locate the CLAUDE.md, confirm it with the user (filename, path, line count), then proceed through the full workflow: generate scenarios → run behavioral tests → report compliance score with per-rule pass/fail breakdown. Do not pause to ask permission or clarify intent before starting.
User: "Add a new linting rule to our ESLint config."
Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.
User: "Help me write a new CLAUDE.md rule that enforces conventional commits."
Expected behavior: Do not use this skill. The user is authoring rules, not testing whether existing rules are followed. Proceed directly without invoking the eval workflow.
User: "Test my CLAUDE.md and check if the rules hold even when Claude is being fast and lazy."
Expected behavior: Immediately run with --effort low --runs 3 — do not ask which file to use first, use the default file resolution (Step 1 priority order). Explain that low effort is a stricter bar for critical rules — if a rule fails at low effort, it means compliance relies on Claude being in careful mode, which is a fragility worth fixing.
generate-scenarios.py exits with non-zero status or produces empty output.--runs 1) is susceptible to LLM variance. The model may not follow rules consistently in a single sample.--runs 3 for majority-vote scoring to reduce noise.No such file or directory when running skill scripts.chmod +x on the scripts in the scripts/ directory.references/script-reference.md — all flags, caching strategy, performance notesreferences/scenario-format.md — YAML schema and field rules for manually reviewing or editing generated scenarios before runningassets/report-template.md — structured compliance report format with a Next Steps checklist