Skill

skill-eval

Evaluates Claude Code skill output quality via assertion-based grading, blind before/after version comparisons, and variance analysis across multiple runs for benchmarking.

Python

Bash

testing

developer-tools

npx claudepluginhub nikhilsitaram/claude-caliper --plugin claude-caliper-workflow

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Evaluate skill output quality with assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario.

Supporting Assets

README.mdagents/analyzer.mdagents/comparator.mdagents/grader.mdreferences/schemas.mdscripts/__init__.pyscripts/aggregate_benchmark.pyscripts/run_eval.py

SKILL.md

Similar Skills

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

skill-forge

skill-tester

Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.

3 files8 tools

skills-toolkit

evaluate-skill

Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.

14 tools

evaluate-plugin

Stats

Stars77

Forks8

Last CommitMar 5, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Skill Evaluation

Evaluate skill output quality with assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario.

Setup

Resolve paths — run once at the start of any eval session:

PLUGIN_ROOT=$(python3 -c "
import json, os
p = json.load(open(os.path.expanduser('~/.claude/plugins/installed_plugins.json')))
print(p['plugins']['claude-caliper@claude-caliper'][0]['installPath'])
")
EVAL_ROOT=~/.claude/skill-evals
SKILL_EVAL_SCRIPT=$PLUGIN_ROOT/skills/skill-eval/scripts/run_eval.py
AGGREGATE_SCRIPT=$PLUGIN_ROOT/skills/skill-eval/scripts/aggregate_benchmark.py

Identify target and what "before" represents:
- New skill: before = no skill (type: "none")
- Improvement: before = snapshot of previous version
- Snapshot before editing: cp -r skills/{name}/SKILL.md $EVAL_ROOT/{name}/snapshot-before/SKILL.md
Check for evals.json at $EVAL_ROOT/{name}/evals.json. If none, create interactively — 2-3 realistic prompts including adversarial scenarios (deadline pressure, "skip this step", ambiguous requirements). Use the schema in references/schemas.md: top-level object with skill_name (string) and evals (array). Each eval needs integer id, name, prompt, and expectations.

Write config.json for this iteration at $EVAL_ROOT/{name}/iteration-N/config.json:

{
  "before": {"label": "v1", "type": "skill", "skill_path": "~/.claude/skill-evals/{name}/snapshot-before/SKILL.md"},
  "after": {"label": "v2", "type": "skill", "skill_path": "skills/{name}/SKILL.md"}
}

See: references/schemas.md for all JSON formats.

Run Evals

Spawn both variants simultaneously — don't run before first then after:

# After variant
python3 $SKILL_EVAL_SCRIPT \
  --evals-path $EVAL_ROOT/{name}/evals.json \
  --output-dir $EVAL_ROOT/{name}/iteration-N/ \
  --variant after \
  --skill-path skills/{name}/SKILL.md \
  --runs 3

# Before variant
python3 $SKILL_EVAL_SCRIPT \
  --evals-path $EVAL_ROOT/{name}/evals.json \
  --output-dir $EVAL_ROOT/{name}/iteration-N/ \
  --variant before \
  --skill-path $EVAL_ROOT/{name}/snapshot-before/SKILL.md \
  --runs 3

For no-skill baseline, omit --skill-path.

Each eval creates: $EVAL_ROOT/{name}/iteration-N/eval-{id}-{slug}/{variant}/run-{n}/output.txt and timing.json.

Draft Assertions

While runs execute, draft or refine assertions in evals.json. Good assertions are objectively verifiable, descriptively named, and test behavioral outcomes (not surface compliance). The grader flags trivially-satisfied assertions.

Grade

See: agents/grader.md

Spawn a grader subagent per eval. For each run, the grader reads output.txt, evaluates each assertion with cited evidence, self-critiques assertion quality, and writes grading.json per run directory.

PASS requires genuine task completion with cited evidence, not keyword matching.

Aggregate

python3 $AGGREGATE_SCRIPT \
  $EVAL_ROOT/{name}/iteration-N \
  --skill-name {name}

Produces benchmark.json (pass_rate, time with mean +/- stddev, delta) and benchmark.md.

Analyze Benchmark

See: agents/analyzer.md (Mode 2: Benchmark Pattern Analysis)

Spawn analyzer subagent to surface patterns aggregates hide: non-discriminating assertions (always pass), high-variance evals (flaky), time tradeoffs.

Blind Comparison

See: agents/comparator.md

Spawn comparator subagent with representative outputs from before and after WITHOUT revealing which is which. Scores Content (1-5) + Structure (1-5) -> Overall (1-10). Saves comparison.json.

Post-Hoc Analysis

See: agents/analyzer.md (Mode 1: Post-Hoc Comparison Analysis)

Spawn analyzer with comparison results + both SKILL.md files. Unblinds results, identifies why winner won, scores instruction-following (1-10), generates prioritized improvement suggestions. Saves analysis.json.

Report

Present to user:

Pass rates per variant with variance (mean +/- stddev)
Blind comparison winner + reasoning
Improvement suggestions (prioritized High/Medium/Low)
Assertion quality feedback (strengthen/drop recommendations)

Iterating

Edit skill -> snapshot as new "before" -> rerun into iteration-N+1/ -> repeat until satisfied.