Help us improve
Share bugs, ideas, or general feedback.
From skill-benchmark
Benchmarks Claude Code skills against baseline via isolated eval sessions, layered grading (deterministic checks + LLM-as-judge), behavioral analysis, and USE/DON'T USE reports.
npx claudepluginhub workersio/specHow this skill is triggered — by the user, by Claude, or both
Slash command
/skill-benchmark:skill-benchmarkThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Creates evals, assertions, and evals.json files for skills; runs benchmark harness to measure effectiveness against baseline. Use when testing skill quality.
Evaluates Claude Agent Skills quality via static analysis checks, A/B testing, and multi-model evals to benchmark activation rates and effectiveness.
Share bugs, ideas, or general feedback.
You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).
Methodology based on industry best practices (Anthropic & OpenAI eval guidance):
This benchmark spawns nested claude -p sessions that require elevated privileges to operate in headless mode. The following security-sensitive flags are used, along with their mitigations:
| Flag / Technique | Why Required | Mitigation |
|---|---|---|
--dangerously-skip-permissions | Headless sessions have no human to approve tool calls — without this flag they hang forever | Each session is restricted via --allowedTools to only Read, Write, Edit, Bash, Grep, Glob (plus Skill for with-skill mode). No network tools, no MCP access. |
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT | Claude Code blocks nested claude -p via these env vars. Must unset to spawn child sessions. | env -u only affects the child process environment. The parent session remains protected. |
--append-system-prompt | Ensures the skill is always invoked in with-skill sessions for fair comparison. Without it, models often skip the skill. | Fixed template — only <skill_name> is substituted. No external or user-supplied content is injected. The exact prompt is shown during Step 1 confirmation. |
runs_without_error commands | Task verification requires executing commands like python3 <file> to check if generated code works. | Commands are validated against an allowlist of safe executables and rejected if they contain shell metacharacters. shell=True is NOT used. |
Only run benchmarks with task files and skills from trusted sources. By proceeding past Step 1 confirmation, you acknowledge these security implications.
Follow these steps exactly:
The user can run this skill in two ways:
Option 1: Custom config — User creates a config.yml:
cp .claude/skills/skill-benchmark/config.example.yml .claude/skills/skill-benchmark/config.yml
# edit config.yml
/skill-benchmark
Option 2: Default run — No config needed:
/skill-benchmark
Check for config.yml — Look for it in order: (1) config.yml in the skill directory, (2) ~/.claude/skills/skill-benchmark/config.yml, (3) path passed as argument. If found, read and use those values. If not found, use built-in defaults:
runner_model: sonnetjudge_model: opustask_count: 5negative_controls: 1difficulties: {easy: 2, medium: 2, hard: 1}runs: 1max_turns: 10results_dir: ./skill-bench/resultsWhich skill to benchmark — If skill is set in config.yml, use that. Otherwise ask the user via AskUserQuestion. Search common locations:
.claude/skills/<name>/SKILL.md~/.claude/skills/<name>/SKILL.mdTask set — Ask if they have a custom task set directory, or if you should auto-generate tasks based on the skill's domain.
Confirm settings — Show the user the final config (loaded or default). Include a security notice: "This benchmark will run headless claude -p sessions with --dangerously-skip-permissions and restricted tools (--allowedTools). Sessions execute in isolated sandbox directories. For with-skill sessions, the following system prompt will be appended: 'IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill="<skill_name>" to load the relevant skill instructions.'" Ask the user to confirm before starting.
Set $RESULTS_DIR — Create the results directory with a skill-name and timestamp:
RESULTS_DIR="<results_dir>/<skill_name>-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$RESULTS_DIR"
All subsequent paths (tasks/, sandbox/, outputs/, grades/, report.md) go under $RESULTS_DIR. Do NOT put files directly in the base results_dir — always nest under the timestamped subdirectory.
Read the target skill's SKILL.md file completely. Extract:
Write a brief analysis summary — you'll use this to generate relevant tasks.
If no custom task set was provided, auto-generate tasks. Design tasks following eval best practices:
Positive tasks — Tasks where the skill SHOULD help (the majority):
Negative control (1 task): A task OUTSIDE the skill's domain where the skill should NOT activate or help. This catches false positives — if the skill hurts performance on unrelated tasks, that's a red flag.
Write each task to $RESULTS_DIR/tasks/task-NN-<difficulty>.md:
# Task: <descriptive-name>
difficulty: easy|medium|hard
category: <domain>
type: positive|negative-control
## Prompt
<the exact prompt that will be sent to Claude via `claude -p`>
## Expected Outcome
<clear description of what a correct response looks like>
## Verification Checks
<deterministic checks to run BEFORE LLM grading>
- file_exists: <filename that should be created>
- file_contains: <pattern> in <filename> (or just <pattern> to search all files)
- syntax_valid: <language — run syntax checker>
- runs_without_error: <command to execute, e.g., "python3 <filename>">
## Grading Rubric
- Correctness: <specific criteria for correctness>
- Completeness: <what must be included for full marks>
- Quality: <quality expectations — best practices, clarity, etc.>
## Tags
<comma-separated tags for grouping>
Task design rules:
claude -p sessionsVerification Checks with concrete, deterministic things to test (file exists, code runs, output matches)When the user provides a custom task set, validate each task file before execution:
## Prompt, ## Expected Outcome, ## Verification Checks, ## Grading Rubric). Reject files missing required sections.python3 scripts/run_checks.py --validate <task_file> for each task. This validates runs_without_error entries against the command allowlist and rejects commands with shell metacharacters.WARNING: Never run benchmark tasks from untrusted sources. Task prompts are executed as code-generation instructions with tool access in the nested session.
For each task, run TWO sessions using claude -p. Each session MUST run in its own isolated sandbox directory so they cannot interfere with each other.
If runs > 1 in config, run each task N times. Each run gets its own isolated sandbox and output directory. This accounts for non-determinism in LLM outputs.
Directory structure for multi-run (runs: 3):
sandbox/task-01/run-1/with-skill/
sandbox/task-01/run-1/baseline/
sandbox/task-01/run-2/with-skill/
sandbox/task-01/run-2/baseline/
sandbox/task-01/run-3/with-skill/
sandbox/task-01/run-3/baseline/
outputs/task-01/run-1/with-skill/
outputs/task-01/run-1/baseline/
outputs/task-01/run-2/with-skill/
outputs/task-01/run-2/baseline/
...
grades/task-01/run-1/with-skill-grade.json
grades/task-01/run-1/baseline-grade.json
...
Directory structure for single run (runs: 1, the default):
sandbox/task-01/with-skill/
sandbox/task-01/baseline/
outputs/task-01/with-skill/
outputs/task-01/baseline/
grades/task-01/with-skill-grade.json
grades/task-01/baseline-grade.json
When runs: 1, skip the run-N/ subdirectory level entirely for simpler output.
Aggregation for multi-run: After grading all runs, compute per-task:
Before running ANY sessions, create isolated working directories for EVERY session:
# For runs: 1 (default)
mkdir -p "$RESULTS_DIR/sandbox/task-NN/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/baseline"
# For runs: 3 (multi-run)
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/with-skill"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-1/baseline"
mkdir -p "$RESULTS_DIR/sandbox/task-NN/run-2/with-skill"
# ... etc
Each claude -p call MUST cd into its own sandbox directory first. This prevents:
fibonacci.py to the same place)CRITICAL: Claude Code sets CLAUDECODE=1 and CLAUDE_CODE_ENTRYPOINT=cli to prevent recursive claude -p invocations. These MUST be unset via env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT for the benchmark to spawn child sessions.
Why this is safe: env -u only affects the child process environment — the parent session's env vars are unchanged. The child session runs in an isolated sandbox directory with restricted tools (--allowedTools).
Every claude -p call MUST include --dangerously-skip-permissions — without it, headless sessions hang forever waiting for a human to approve tool use.
Session A — With Skill:
# For runs: 1 → sandbox/task-NN/with-skill, outputs/task-NN/with-skill
# For runs: 3 → sandbox/task-NN/run-R/with-skill, outputs/task-NN/run-R/with-skill
cd "$SANDBOX_DIR" && \
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \
claude -p "<task_prompt>" \
--output-format stream-json \
--verbose \
--dangerously-skip-permissions \
--allowedTools "Skill(<skill_name>),Read,Edit,Bash,Grep,Glob,Write" \
--append-system-prompt "IMPORTANT: Before starting any work, you MUST first call the Skill tool with skill=\"<skill_name>\" to load the relevant skill instructions. Follow whatever instructions the skill provides throughout your work." \
--model <runner_model> \
--max-turns <max_turns> \
> "$OUTPUT_DIR/raw_stream.jsonl" 2>&1
Why --append-system-prompt? Without it, the skill is merely available as a tool — the model must choose to call it. For straightforward tasks, the model often skips the skill entirely and writes code directly. The appended system prompt ensures the skill is always invoked, making the benchmark a fair comparison of "with skill instructions" vs "without".
Session B — Baseline (no skill):
cd "$SANDBOX_DIR" && \
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT \
claude -p "<task_prompt>" \
--output-format stream-json \
--verbose \
--dangerously-skip-permissions \
--allowedTools "Read,Edit,Bash,Grep,Glob,Write" \
--disallowedTools "Skill" \
--model <runner_model> \
--max-turns <max_turns> \
> "$OUTPUT_DIR/raw_stream.jsonl" 2>&1
Why --disallowedTools "Skill"? The Skill tool is a built-in that --allowedTools alone does not restrict. Without explicitly disallowing it, the baseline model may still invoke the skill, contaminating the comparison.
Where $SANDBOX_DIR and $OUTPUT_DIR depend on run count:
runs: 1 → $RESULTS_DIR/sandbox/task-NN/<mode> and $RESULTS_DIR/outputs/task-NN/<mode>runs: N → $RESULTS_DIR/sandbox/task-NN/run-R/<mode> and $RESULTS_DIR/outputs/task-NN/run-R/<mode>IMPORTANT:
--dangerously-skip-permissions is REQUIRED — without it, claude -p hangs waiting for permission approval with no human to click "Allow".<skill_name> with the ACTUAL skill name from Step 1 (e.g., code-commenter). Do NOT leave Skill() empty — that means no skill is loaded and both sessions become identical.cd into the sandbox BEFORE running claude -p. This is the isolation mechanism.> .../raw_stream.jsonl) since you're cd'ing.raw_stream.jsonl to produce THREE files in the output directory:response.json — Extract the last type: "result" event from the JSONL stream.
transcript.json — All stream events collected into a JSON array.
meta.json — Session metadata extracted from response.json. Contains: session_id, model (from modelUsage keys), skill_name, mode, stop_reason, duration_ms, duration_api_ms, num_turns, total_cost_usd, and usage (input/output/cache tokens). The scripts/parse_stream.py script handles this extraction — run it with --help for the full field list.
If a session fails or times out, log the error in meta.json and mark it as a failed run (score: 0).
Use a two-layer grading approach: deterministic checks first, then LLM-as-judge. This catches clear failures fast and uses the model for nuanced assessment.
For each session output, run the deterministic checks script:
# For runs: 1 → sandbox/task-NN/<mode>, grades/task-NN/<mode>-checks.json
# For runs: N → sandbox/task-NN/run-R/<mode>, grades/task-NN/run-R/<mode>-checks.json
python3 "scripts/run_checks.py" \
"$RESULTS_DIR/tasks/task-NN-<difficulty>.md" \
"$SANDBOX_DIR" \
"$GRADES_DIR/<mode>-checks.json"
This script reads the ## Verification Checks section from the task file and runs each check (file_exists, syntax_valid, runs_without_error, file_contains) in the sandbox directory.
Save results to $RESULTS_DIR/grades/task-NN/<mode>-checks.json:
{
"file_exists": true,
"syntax_valid": true,
"runs_without_error": true,
"file_contains": {"def add": true, "def subtract": true},
"all_passed": true
}
If deterministic checks fail (file missing, syntax error, runtime crash), the task gets a correctness ceiling of 50 regardless of LLM grading — the code doesn't work.
For each task, launch a grader subagent (use the Agent tool with subagent_type: "general-purpose" and model set to the judge model).
The grader prompt MUST include:
result field from response.json)Also tell the grader to READ the actual files the session created in the sandbox directory ($RESULTS_DIR/sandbox/task-NN/<mode>/) to verify correctness — don't just grade the text output, verify the code actually exists and is correct.
Grade EACH output independently (do not show the grader both outputs — this prevents comparison bias).
Grading criteria and default weights:
The grader MUST return a structured response. Instruct it to output JSON:
{
"deterministic_checks_passed": true|false,
"correctness": { "score": 0-100, "justification": "..." },
"completeness": { "score": 0-100, "justification": "..." },
"quality": { "score": 0-100, "justification": "..." },
"efficiency": { "score": 0-100, "justification": "..." },
"weighted_total": 0-100,
"summary": "..."
}
Save grades to the corresponding grades directory:
runs: 1 → $RESULTS_DIR/grades/task-NN/with-skill-grade.json and baseline-grade.jsonruns: N → $RESULTS_DIR/grades/task-NN/run-R/with-skill-grade.json and baseline-grade.jsonYou can run graders for different tasks/runs in parallel using background agents.
Before generating the report, analyze the transcript.json files for behavioral signals. This is critical — scores alone don't tell the full story.
For each session, run the analyze script:
# For runs: 1 → outputs/task-NN/<mode>/
# For runs: N → outputs/task-NN/run-R/<mode>/
python3 "scripts/analyze_transcript.py" \
"$OUTPUT_DIR/transcript.json" \
"$OUTPUT_DIR/behavior.json"
This extracts from transcript.json:
Output (behavior.json):
{
"tool_calls": {"Read": 2, "Write": 1, "Bash": 3, "Edit": 0},
"total_tool_calls": 6,
"thrashing_detected": false,
"errors_encountered": 0,
"errors_recovered": 0
}
After all grading and analysis is complete, generate the final report.
Read all grade files, meta files, and behavior files, then compute:
total_cost_usd and usage fieldsWrite the report to $RESULTS_DIR/report.md using this format:
# Skill Benchmark Report: <skill-name>
Date: <YYYY-MM-DD HH:MM>
Runner Model: <model> | Judge Model: <model> | Tasks: <N> | Runs: <R>
## Verdict: <emoji> <VERDICT>
**Skill scores <X>% <higher/lower> than baseline on average.**
## Summary
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Avg Score | X% | Y% | +/-Z% |
| Correctness | X% | Y% | +/-Z% |
| Completeness | X% | Y% | +/-Z% |
| Quality | X% | Y% | +/-Z% |
| Efficiency | X% | Y% | +/-Z% |
## Deterministic Check Pass Rate
| Condition | Pass Rate |
|-----------|-----------|
| With Skill | X/N tasks (Y%) |
| Baseline | X/N tasks (Y%) |
## Per-Task Breakdown
| # | Task | Type | Difficulty | Skill | Baseline | Delta | Winner |
|---|------|------|-----------|-------|----------|-------|--------|
| 1 | ... | positive | easy | X% | Y% | +/-Z% | Skill/Baseline |
| N | ... | negative | - | X% | Y% | +/-Z% | ... |
## Negative Control Results
<How did the skill perform on out-of-domain tasks? If it hurt performance, flag this.>
## Where Skill Helps
- <identified patterns where skill outperformed baseline>
## Where Skill Hurts
- <identified patterns where baseline outperformed skill>
## Behavioral Analysis
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Avg Tool Calls | X | Y | +/-Z |
| Avg Turns | X | Y | +/-Z |
| Thrashing Detected | X/N | Y/N | |
| Avg Duration (s) | X | Y | +/-Z |
| Avg Cost | $X | $Y | +/-$Z |
| Total Cost | $X | $Y | +/-$Z |
## Recommendations
- <actionable suggestions based on the results>
- <suggestions for improving the skill if it underperformed>
- <flag if skill hurts negative control tasks>
Present the report to the user and tell them where the full results are saved.
scripts/parse_stream.py — Parse raw_stream.jsonl → response.json, transcript.json, meta.jsonscripts/analyze_transcript.py — Analyze transcript.json → behavior.json (tool counts, thrashing, errors)scripts/run_checks.py — Run deterministic verification checks from task file against sandboxRun any script with --help for full usage details.
claude -p session fails: log the error, score as 0, continue with remaining tasks