Evaluates Claude Code skill output quality via assertion-based grading, blind before/after version comparisons, and variance analysis across multiple runs for benchmarking.
npx claudepluginhub nikhilsitaram/claude-caliper --plugin claude-caliper-workflowThis skill uses the workspace's default tool permissions.
Evaluate skill output quality with assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Share bugs, ideas, or general feedback.
Evaluate skill output quality with assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario.
Resolve paths — run once at the start of any eval session:
PLUGIN_ROOT=$(python3 -c "
import json, os
p = json.load(open(os.path.expanduser('~/.claude/plugins/installed_plugins.json')))
print(p['plugins']['claude-caliper@claude-caliper'][0]['installPath'])
")
EVAL_ROOT=~/.claude/skill-evals
SKILL_EVAL_SCRIPT=$PLUGIN_ROOT/skills/skill-eval/scripts/run_eval.py
AGGREGATE_SCRIPT=$PLUGIN_ROOT/skills/skill-eval/scripts/aggregate_benchmark.py
Identify target and what "before" represents:
type: "none")cp -r skills/{name}/SKILL.md $EVAL_ROOT/{name}/snapshot-before/SKILL.mdCheck for evals.json at $EVAL_ROOT/{name}/evals.json. If none, create interactively — 2-3 realistic prompts including adversarial scenarios (deadline pressure, "skip this step", ambiguous requirements). Use the schema in references/schemas.md: top-level object with skill_name (string) and evals (array). Each eval needs integer id, name, prompt, and expectations.
Write config.json for this iteration at $EVAL_ROOT/{name}/iteration-N/config.json:
{
"before": {"label": "v1", "type": "skill", "skill_path": "~/.claude/skill-evals/{name}/snapshot-before/SKILL.md"},
"after": {"label": "v2", "type": "skill", "skill_path": "skills/{name}/SKILL.md"}
}
See: references/schemas.md for all JSON formats.
Spawn both variants simultaneously — don't run before first then after:
# After variant
python3 $SKILL_EVAL_SCRIPT \
--evals-path $EVAL_ROOT/{name}/evals.json \
--output-dir $EVAL_ROOT/{name}/iteration-N/ \
--variant after \
--skill-path skills/{name}/SKILL.md \
--runs 3
# Before variant
python3 $SKILL_EVAL_SCRIPT \
--evals-path $EVAL_ROOT/{name}/evals.json \
--output-dir $EVAL_ROOT/{name}/iteration-N/ \
--variant before \
--skill-path $EVAL_ROOT/{name}/snapshot-before/SKILL.md \
--runs 3
For no-skill baseline, omit --skill-path.
Each eval creates: $EVAL_ROOT/{name}/iteration-N/eval-{id}-{slug}/{variant}/run-{n}/output.txt and timing.json.
While runs execute, draft or refine assertions in evals.json. Good assertions are objectively verifiable, descriptively named, and test behavioral outcomes (not surface compliance). The grader flags trivially-satisfied assertions.
See: agents/grader.md
Spawn a grader subagent per eval. For each run, the grader reads output.txt, evaluates each assertion with cited evidence, self-critiques assertion quality, and writes grading.json per run directory.
PASS requires genuine task completion with cited evidence, not keyword matching.
python3 $AGGREGATE_SCRIPT \
$EVAL_ROOT/{name}/iteration-N \
--skill-name {name}
Produces benchmark.json (pass_rate, time with mean +/- stddev, delta) and benchmark.md.
See: agents/analyzer.md (Mode 2: Benchmark Pattern Analysis)
Spawn analyzer subagent to surface patterns aggregates hide: non-discriminating assertions (always pass), high-variance evals (flaky), time tradeoffs.
See: agents/comparator.md
Spawn comparator subagent with representative outputs from before and after WITHOUT revealing which is which. Scores Content (1-5) + Structure (1-5) -> Overall (1-10). Saves comparison.json.
See: agents/analyzer.md (Mode 1: Post-Hoc Comparison Analysis)
Spawn analyzer with comparison results + both SKILL.md files. Unblinds results, identifies why winner won, scores instruction-following (1-10), generates prioritized improvement suggestions. Saves analysis.json.
Present to user:
Edit skill -> snapshot as new "before" -> rerun into iteration-N+1/ -> repeat until satisfied.