npx claudepluginhub rmzi/portable-dev-system --plugin pdsThis skill uses the workspace's default tool permissions.
Skills need tests like code does. A modified skill that silently degrades is worse than no skill — it creates false confidence. This protocol defines how to write, run, and report skill evals.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Share bugs, ideas, or general feedback.
Skills need tests like code does. A modified skill that silently degrades is worse than no skill — it creates false confidence. This protocol defines how to write, run, and report skill evals.
Each skill can have a companion EVAL.md in its directory. Structure:
---
skill: <skill-name>
---
# Eval: /pds:<skill-name>
## Scenarios
### Scenario: [descriptive name]
**Setup:** [Context — files present, task state, branch state]
**Prompt:** [The situation that should trigger this skill]
**Expected:**
- [ ] [Observable behavior — binary, not subjective]
- [ ] [Another expected behavior]
**Anti-patterns:**
- [ ] [Thing the agent should NOT do]
## Baseline
[Behavior WITHOUT the skill — for A/B comparison]
Rules for scenarios:
EVAL.mdScout reads EVAL.md for skills exercised during the swarm. Grades against rubric based on observed agent behavior in that swarm — no re-execution needed.
Record in .claude/eval-results.md:
## [YYYY-MM-DD] /pds:<skill-name>
| Scenario | Result | Notes |
|----------|--------|-------|
| Name | pass/partial/fail | [what happened] |
**Evaluator:** scout | human
**Context:** [what triggered this eval]
| Result | Criteria |
|---|---|
| pass | All expected behaviors observed, no anti-patterns |
| partial | >50% expected behaviors, minor anti-pattern violations |
| fail | <50% expected behaviors OR critical anti-pattern |
scripts/run-eval.sh runs EVAL.md scenarios statistically — N executions per scenario, LLM-as-judge grading, Wilson score confidence intervals.
./scripts/run-eval.sh grill # 5 runs, haiku execution, sonnet grading
./scripts/run-eval.sh grill --runs 20 # 20 runs for tight CI
./scripts/run-eval.sh grill --model sonnet # sonnet for both execution and grading
make eval SKILL=grill RUNS=20 # via Makefile
SKILL.md and EVAL.mdclaude -p --bare (hermetic — no plugins, just the skill text)claude -p --model haiku --json-schema)Non-deterministic systems need repetition. A single pass/fail tells you nothing [8].
| Count | Use | CI width |
|---|---|---|
| 3 | Quick smoke test | Very wide |
| 5 | Default — catches gross failures | Wide |
| 10 | Serious check before shipping | Moderate |
| 20 | High confidence, regression baseline | Tight |
Haiku execution + sonnet grading ~ $0.10/run. Sonnet execution + sonnet grading ~ $0.25/run. 20 runs at sonnet ~ $5.00. Use sonnet grading by default — haiku grading produces false positives and false negatives on complex output.
To test whether a skill adds value over baseline:
Use when: questioning whether a model upgrade made a skill redundant.
Eval results are only useful if they lead to action. After each eval run:
.claude/eval-results.md (format in Results section above).claude/eval-results.md tracks historical results. A skill change that drops pass rate is a regression.The grading model affects results. Haiku is cheap but may under-credit rich output from sonnet/opus. If a scenario scores well on haiku-execution + haiku-grading but poorly on sonnet-execution + haiku-grading, the grader may be the bottleneck. Test with --grade-model sonnet to verify.
/pds:verify — work output verification (eval checks skill adherence)