Skill

/eval — Skill Evaluation

From pds

Tests skills for correct agent behavior via EVAL.md scenarios after modifications, periodic reviews, or model upgrades. Supports manual, scout, and automated bash-script evals.

Markdown

Bash

testing

npx claudepluginhub rmzi/portable-dev-system --plugin pds

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Skills need tests like code does. A modified skill that silently degrades is worse than no skill — it creates false confidence. This protocol defines how to write, run, and report skill evals.

Supporting Assets

secret-scrub.EVAL.md

SKILL.md

Similar Skills

evaluate-skill

Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.

14 tools

evaluate-plugin

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

skill-forge

eval

Runs unit-test style eval checks on skills via evals/evals.json to verify SKILL.md matches expectations. Use after editing skills or to check all.

brain-os

Stats

Stars19

Forks0

Last CommitApr 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

/eval — Skill Evaluation

Skills need tests like code does. A modified skill that silently degrades is worse than no skill — it creates false confidence. This protocol defines how to write, run, and report skill evals.

When to Run

After modifying a skill's protocol or constraints
During periodic review (scout Phase 6)
When a model upgrade may make a skill redundant
On request, to validate skill effectiveness

EVAL.md Format

Each skill can have a companion EVAL.md in its directory. Structure:

---
skill: <skill-name>
---
# Eval: /pds:<skill-name>

## Scenarios

### Scenario: [descriptive name]
**Setup:** [Context — files present, task state, branch state]
**Prompt:** [The situation that should trigger this skill]
**Expected:**
- [ ] [Observable behavior — binary, not subjective]
- [ ] [Another expected behavior]
**Anti-patterns:**
- [ ] [Thing the agent should NOT do]

## Baseline
[Behavior WITHOUT the skill — for A/B comparison]

Rules for scenarios:

Expected behaviors are binary: did it or didn't. No "good enough."
Anti-patterns catch the failure modes the skill was created to prevent.
2-3 scenarios per skill. Cover the happy path and the key failure mode.

Execution Protocol

Manual Eval

Read the skill's EVAL.md
For each scenario: spawn a worker with the setup and prompt
Observe worker behavior against the expected checklist
Record results

Scout Eval (Phase 6)

Scout reads EVAL.md for skills exercised during the swarm. Grades against rubric based on observed agent behavior in that swarm — no re-execution needed.

Results

Record in .claude/eval-results.md:

## [YYYY-MM-DD] /pds:<skill-name>

| Scenario | Result | Notes |
|----------|--------|-------|
| Name | pass/partial/fail | [what happened] |

**Evaluator:** scout | human
**Context:** [what triggered this eval]

Scoring

Result	Criteria
pass	All expected behaviors observed, no anti-patterns
partial	>50% expected behaviors, minor anti-pattern violations
fail	<50% expected behaviors OR critical anti-pattern

Automated Eval

scripts/run-eval.sh runs EVAL.md scenarios statistically — N executions per scenario, LLM-as-judge grading, Wilson score confidence intervals.

Usage

./scripts/run-eval.sh grill              # 5 runs, haiku execution, sonnet grading
./scripts/run-eval.sh grill --runs 20    # 20 runs for tight CI
./scripts/run-eval.sh grill --model sonnet  # sonnet for both execution and grading
make eval SKILL=grill RUNS=20            # via Makefile

How it works

Reads the skill's SKILL.md and EVAL.md
For each scenario, runs N times via claude -p --bare (hermetic — no plugins, just the skill text)
Grades each run with LLM-as-judge (claude -p --model haiku --json-schema)
Reports pass rate with 95% Wilson score confidence interval

Statistical approach

Non-deterministic systems need repetition. A single pass/fail tells you nothing [8].

pass@k — probability of at least one success in k attempts (measures capability)
pass^k — probability ALL k trials succeed (measures consistency for production)
Wilson score CI — proper small-sample confidence interval:
- N=5, 5/5 pass → 95% CI: [57%-100%] (wide — run more to narrow)
- N=10, 9/10 pass → 95% CI: [60%-98%]
- N=20, 18/20 pass → 95% CI: [70%-97%] (actionable)

Choosing run count

Count	Use	CI width
3	Quick smoke test	Very wide
5	Default — catches gross failures	Wide
10	Serious check before shipping	Moderate
20	High confidence, regression baseline	Tight

Cost

Haiku execution + sonnet grading ~ $0.10/run. Sonnet execution + sonnet grading ~ $0.25/run. 20 runs at sonnet ~ $5.00. Use sonnet grading by default — haiku grading produces false positives and false negatives on complex output.

A/B Comparison

To test whether a skill adds value over baseline:

Run a scenario without the skill loaded (baseline behavior)
Run the same scenario with the skill
Compare against the Baseline section in EVAL.md

Use when: questioning whether a model upgrade made a skill redundant.

Closing the Loop

Eval results are only useful if they lead to action. After each eval run:

Record results in .claude/eval-results.md (format in Results section above)
Diagnose failures — read the grader's reasons. Three causes:
- Scenario is wrong — the expected behavior is ambiguous or tests a predetermined answer. Fix: sharpen the scenario, test reasoning quality not specific outputs [11]
- Skill criteria are wrong — the skill's own criteria lead models to a different defensible judgment. Fix: sharpen the criteria to be more mechanical (e.g., boundary count > file count)
- Model variance — genuine non-determinism. Run more trials to narrow the CI. If pass rate stays <50% at N=10, investigate the scenario
Compare against baseline — .claude/eval-results.md tracks historical results. A skill change that drops pass rate is a regression.
Act on the data — don't just record results. Either fix the skill, fix the eval, or document why the current rate is acceptable.

Grader considerations

The grading model affects results. Haiku is cheap but may under-credit rich output from sonnet/opus. If a scenario scores well on haiku-execution + haiku-grading but poorly on sonnet-execution + haiku-grading, the grader may be the bottleneck. Test with --grade-model sonnet to verify.

/eval — Skill Evaluation

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

/eval — Skill Evaluation

Tool Access

Preview

Supporting Assets

SKILL.md

/eval — Skill Evaluation

When to Run

EVAL.md Format

Execution Protocol

Manual Eval

Scout Eval (Phase 6)

Results

Scoring

Automated Eval

Usage

How it works

Statistical approach

Choosing run count

Cost

A/B Comparison

Closing the Loop

Grader considerations

See Also

Similar Skills

Help us improve

/eval — Skill Evaluation

When to Run

EVAL.md Format

Execution Protocol

Manual Eval

Scout Eval (Phase 6)

Results

Scoring

Automated Eval

Usage

How it works

Statistical approach

Choosing run count

Cost

A/B Comparison

Closing the Loop

Grader considerations

See Also