From skills-toolkit
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
npx claudepluginhub full-stack-biz/claude-skills-toolkit --plugin skills-toolkitThis skill is limited to using the following tools:
**Purpose:** Empirically validate Claude Code skills through evaluation-driven testing. Proves skills actually help Claude (with data) rather than guessing.
Creates new Claude Code skills from scratch, modifies and improves existing ones, runs evaluations and benchmarks performance with variance analysis, optimizes descriptions for triggering accuracy.
Creates new Claude Code skills from scratch, modifies and improves existing ones, evaluates with test cases and benchmarks including variance analysis, and optimizes descriptions for triggering accuracy.
Creates new Claude Code skills from scratch, edits and improves existing ones, runs evals, benchmarks performance with variance analysis, and optimizes descriptions for triggering accuracy.
Share bugs, ideas, or general feedback.
Purpose: Empirically validate Claude Code skills through evaluation-driven testing. Proves skills actually help Claude (with data) rather than guessing.
Skills must be measured, not assumed. This pipeline provides systematic evidence: Does the skill improve Claude's performance? By how much? What should we improve next?
Validate new skills — Test a newly created skill against baseline Claude performance without it. Benchmark improvements — Measure impact of skill refinements across multiple iterations. Comparative analysis — Prove skill effectiveness with timing, token usage, and pass rates side-by-side. Iteration planning — Data-driven decisions on what to improve next.
Empirical Evidence — Intuition ≠ proof. Collect actual performance data: pass rates, tokens, timing.
Parallel Testing — For each eval, test WITH skill AND baseline SIMULTANEOUSLY (2 agents per eval, in parallel). Eliminates confounding variables from different test runs.
Workspace Isolation — Each iteration lives in workspace/iteration-N/. Keeps history clean. Easy to compare iteration 1 → 2 → 3 performance.
Schema Consistency — All JSON outputs follow strict schemas (evals.json, grading.json, benchmark.json, timing.json). Enables reliable aggregation and comparison.
Phase 1: Setup → Identify skill + confirm what it does
Phase 2: Create Evals → Interview user → write test cases + assertions
Phase 3: Run Tests → Launch 2 agents per eval (with_skill + baseline) in parallel
Phase 4: Grade Results → Evaluate outputs against assertions
Phase 5: Aggregate → Run ruby script to compute benchmark.json
Phase 6: Review Summary → Show comparison table + improvement suggestions
Phase 7: Iterate → Update skill + run next iteration (or stop if satisfied)
All evaluation artifacts live in a centralized ./evals/ directory at project root:
./evals/ ← Project root, NOT inside skill directory
├── <skill-name-1>/
│ ├── evals.json
│ └── workspace/
│ ├── iteration-1/
│ │ ├── eval-1/, eval-2/, eval-3/ ← Per-eval directories
│ │ └── benchmark.json ← Aggregated results
│ └── iteration-2/
│ └── ...
└── <skill-name-2>/
└── ...
Why centralized?
Goal: Identify the skill being tested. Confirm what it does. Prepare workspace.
Use AskUserQuestion to ask which skill to test:
question: "Which skill do you want to test?"
header: "Skill Selection"
options: [
{label: "skill-creator", description: "Testing skill-creator from skills/ directory"},
{label: "skill-refiner", description: "Testing skill-refiner from skills/ directory"},
{label: "Other skill", description: "Testing a skill not listed above"}
]
⏸️ Collect user input. Wait for response before proceeding.
Read the skill's SKILL.md to understand:
Show user a summary:
Skill: <name>
Location: <path>
Purpose: <one-line summary from description>
Question: Which testing mode do you want?
Use AskUserQuestion with 2 options:
questions: [
{
question: "Which testing mode would you like?",
header: "Workflow Mode",
options: [
{
label: "Quick Workflow",
description: "Fast validation: only test WITH skill (no baseline), no timing metrics. Just pass/fail on assertions. Perfect for quick checks."
},
{
label: "Full Pipeline",
description: "Complete analysis: test WITH skill + baseline parallel, measure tokens/timing, aggregate & compare. For comprehensive benchmarking."
}
],
multiSelect: false
}
]
⏸️ Collect user input.
If Quick Workflow selected:
If Full Pipeline selected:
Create directory structure in project root (standardized):
./evals/<skill-name>/
./evals/<skill-name>/workspace/iteration-1/
Example for skill-creator:
./evals/skill-creator/
./evals/skill-creator/workspace/iteration-1/
Log workspace path. All subsequent phases write to this directory (NOT inside the skill's directory).
Goal: Build evaluation cases that measure skill effectiveness.
Ask 3 questions (use AskUserQuestion with free-form "Other" option):
Question 1: "What are 2-3 core scenarios this skill should handle?"
Example: "skill-creator should handle: creating a new skill from scratch,
converting a slash command, improving an existing skill"
Question 2: "For each scenario, what makes a GOOD response?"
Example: "Good responses: skill has clear name, description with trigger phrases,
correct SKILL.md structure, efficient references/"
Question 3: "What should FAIL (baseline without skill)?"
Example: "Baseline will miss best practices, create vague descriptions,
skip necessary structure"
⏸️ Collect responses. Store in memory.
Create evals.json with structure:
{
"skill_name": "<name>",
"skill_path": "<path/to/SKILL.md>",
"description": "<purpose summary>",
"evals": [
{
"id": 1,
"name": "<scenario name>",
"prompt": "<complete user prompt for this eval>",
"expected_output": "<description of what good output looks like>",
"files": []
},
{
"id": 2,
...
}
]
}
Write to ./evals/<skill-name>/evals.json (project root, standardized location).
For each eval (one per directory), create workspace/iteration-1/eval-N/eval_metadata.json:
{
"eval_id": N,
"skill_path": "<path/to/SKILL.md>",
"assertions": [
{
"text": "Output includes a clear skill name",
"type": "presence",
"target": "SKILL.md frontmatter"
},
{
"text": "Description has specific trigger phrases",
"type": "quality",
"target": "SKILL.md frontmatter description field"
},
{
"text": "References are organized with one-level nesting",
"type": "structure",
"target": "references/ directory"
}
]
}
Create directories: workspace/iteration-1/eval-1/, eval-2/, etc. (with_skill and baseline subdirs will be created in Phase 3).
If user chose "Quick Workflow" in Step 1.2b, follow this streamlined path instead of Phases 3-7.
Goal: Test the skill itself. No baseline comparison, no timing metrics. Just: Does it work?
For EACH eval in evals.json, launch ONE agent (no parallel baseline):
Agent (WITH_SKILL_ONLY):
Type: general-purpose
Prompt: "
You are testing a Claude Code skill. Your job: HELP THE USER ACCOMPLISH THEIR TASK
using the skill provided below.
SKILL TO USE:
<read and include full SKILL.md content>
USER TASK:
<eval-N prompt from evals.json>
After completing the task, save all outputs to:
./evals/<skill-name>/workspace/iteration-1/eval-N/with_skill/outputs/
NO timing metrics needed for quick workflow. Just save your work."
⏸️ Wait for agent completion before proceeding to next eval.
For each eval, manually grade against assertions:
./evals/<skill-name>/workspace/iteration-1/eval-N/with_skill/grading.json{
"eval_id": N,
"configuration": "with_skill",
"assertions_evaluated": [
{"text": "...", "passed": true/false, "evidence": "..."}
],
"summary": {
"assertions_total": X,
"assertions_passed": Y,
"pass_rate": Y/X
}
}
Display simple pass/fail summary (no benchmark.json):
QUICK VALIDATION RESULTS: <skill-name>
====================================
Eval 1: <Scenario> ✓ PASS (5/5 assertions)
Eval 2: <Scenario> ✓ PASS (4/5 assertions)
Eval 3: <Scenario> ✗ FAIL (2/5 assertions)
Summary: 11/15 assertions passed (73%)
Status: Ready to refine or deploy
Ask user:
question: "What would you like to do?"
header: "Next Steps"
options: [
{label: "Run full pipeline", description: "Move to complete benchmarking with baseline comparison"},
{label: "Refine skill", description: "Update skill based on failed assertions"},
{label: "Done", description: "Quick validation complete"}
]
Goal: Execute 2 agents per eval (WITH skill + BASELINE) in parallel. Capture outputs.
For EACH eval, launch 2 agents SIMULTANEOUSLY in one Agent tool call:
Agent 1 (WITH_SKILL):
Agent type: general-purpose
Prompt: "
You are testing a Claude Code skill. Your job: HELP THE USER ACCOMPLISH THEIR TASK
using the skill provided below.
SKILL TO USE:
<read and include full SKILL.md content>
USER TASK:
<eval-N prompt from evals.json>
After completing the task, save all outputs (code, files, notes) to:
./evals/<skill-name>/workspace/iteration-1/eval-N/with_skill/outputs/
Then create a file ./evals/<skill-name>/workspace/iteration-1/eval-N/with_skill/timing.json with:
{
\"total_tokens\": <count>,
\"duration_ms\": <milliseconds>,
\"model\": \"claude-opus-4-6\"
}
"
Agent 2 (BASELINE):
Agent type: general-purpose
Prompt: "
You are testing a Claude Code skill by providing a BASELINE. Your job: HELP THE USER
accomplish their task WITHOUT any special skill or methodology.
USER TASK:
<eval-N prompt from evals.json>
NO SKILLS AVAILABLE. Use standard Claude capabilities only.
After completing the task, save all outputs (code, files, notes) to:
./evals/<skill-name>/workspace/iteration-1/eval-N/baseline/outputs/
Then create a file ./evals/<skill-name>/workspace/iteration-1/eval-N/baseline/timing.json with:
{
\"total_tokens\": <count>,
\"duration_ms\": <milliseconds>,
\"model\": \"claude-opus-4-6\"
}
"
⏸️ Wait for both agents to complete before proceeding to Phase 4.
Goal: Evaluate each agent's outputs against assertions in eval_metadata.json.
For EACH eval:
with_skill/outputs/ filesbaseline/outputs/ fileseval-N/eval_metadata.json:
Create workspace/iteration-1/eval-N/with_skill/grading.json:
{
"eval_id": N,
"configuration": "with_skill",
"assertions_evaluated": [
{
"text": "Output includes a clear skill name",
"passed": true,
"evidence": "SKILL.md line 2: name: skill-tester"
},
{
"text": "Description has specific trigger phrases",
"passed": true,
"evidence": "Description mentions: 'validating', 'benchmarks', 'comparing performance'"
}
],
"summary": {
"assertions_total": 3,
"assertions_passed": 3,
"pass_rate": 1.0
}
}
Create identical file for baseline/grading.json with its results.
Goal: Compute benchmark.json with summary stats (pass rates, tokens, timing).
Execute the aggregation script (see references/eval-schema.md for full script):
ruby skills/skill-tester/scripts/aggregate_benchmark.rb \
./evals/<skill-name>/workspace/iteration-1
Example:
ruby skills/skill-tester/scripts/aggregate_benchmark.rb \
./evals/skill-creator/workspace/iteration-1
This reads all grading.json and timing.json files from the standardized location, outputs benchmark.json:
{
"skill_name": "skill-tester",
"iteration": 1,
"timestamp": "2026-03-04T10:30:00Z",
"evals": [
{
"eval_id": 1,
"with_skill": {
"pass_rate": 1.0,
"assertions_passed": 3,
"assertions_total": 3,
"avg_tokens": 2500,
"avg_duration_ms": 8000
},
"baseline": {
"pass_rate": 0.67,
"assertions_passed": 2,
"assertions_total": 3,
"avg_tokens": 1800,
"avg_duration_ms": 5000
},
"delta": {
"pass_rate": 0.33,
"tokens": 700,
"duration_ms": 3000
}
}
],
"summary": {
"with_skill_avg_pass_rate": 0.95,
"baseline_avg_pass_rate": 0.70,
"improvement": 0.25,
"avg_tokens_with_skill": 2400,
"avg_tokens_baseline": 1900,
"token_cost": 500
}
}
Goal: Show user a clear comparison table + next steps.
Display results in human-readable format:
EVALUATION RESULTS: skill-tester iteration-1
==============================================
Eval 1: [Scenario Name]
WITH SKILL: 3/3 (100%) | 2500 tokens | 8s
BASELINE: 2/3 (67%) | 1800 tokens | 5s
DELTA: +33% better | +700 tokens | +3s
Eval 2: [Scenario Name]
WITH SKILL: 2/3 (67%) | 2100 tokens | 7s
BASELINE: 2/3 (67%) | 1900 tokens | 4s
DELTA: Even | +200 tokens | +3s
SUMMARY
-------
With Skill Avg: 96.5% pass rate | 2300 avg tokens
Baseline Avg: 67.0% pass rate | 1850 avg tokens
IMPROVEMENT: +29.5 percentage points
Ask user (AskUserQuestion):
question: "What would you like to do?"
header: "Next Steps"
options: [
{
label: "Iterate (update skill, run next iteration)",
description: "Based on results, modify the skill and run workspace/iteration-2"
},
{
label: "Stop (satisfied with results)",
description: "Evaluation complete. Save benchmark data for reference"
},
{
label: "Refine evals (change test cases, rerun)",
description: "Update evals.json and run iteration again with same eval set"
}
]
⏸️ Collect user input.
Goal: If user chose "Iterate", update the skill and run next iteration.
Ask user (AskUserQuestion, free-form):
question: "What would you like to improve about the skill?"
header: "Improvement"
⏸️ Collect feedback.
Use skill-refiner or direct edits to improve the skill based on feedback:
Create workspace/iteration-2/ directory. Repeat Phases 3–6 with updated skill.
Show delta between iteration-1 benchmark and iteration-2 benchmark:
ITERATION COMPARISON
====================
Iteration 1 pass rate: 67% → Iteration 2: 95% (+28 points)
Iteration 1 tokens: 1900 → Iteration 2: 2100 (+200, acceptable)
Loop back to Phase 6 (Step 6.2) to ask: iterate again or stop?
references/ filesFor full schemas, workflow details, and aggregate_benchmark.rb script code, see:
references/eval-schema.md — JSON schemas for all eval filesreferences/workflow.md — decision points and detailed workflowscripts/aggregate_benchmark.rb — Ruby aggregation script