Help us improve
Share bugs, ideas, or general feedback.
From evaluate-plugin
Batch evaluates all skills in a plugin by running /evaluate:skill on those with eval cases, then aggregates a plugin-level quality report. Use for auditing plugin quality or pre-release checks.
npx claudepluginhub laurigates/claude-plugins --plugin evaluate-pluginHow this skill is triggered — by the user, by Claude, or both
Slash command
/evaluate-plugin:evaluate-plugin-batchopusThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Batch evaluate all skills in a plugin. Runs `/evaluate:skill` for each skill, then produces a plugin-level quality report.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Tests skills for correct agent behavior via EVAL.md scenarios after modifications, periodic reviews, or model upgrades. Supports manual, scout, and automated bash-script evals.
Share bugs, ideas, or general feedback.
Batch evaluate all skills in a plugin. Runs /evaluate:skill for each skill, then produces a plugin-level quality report.
| Use this skill when... | Use alternative when... |
|---|---|
| Auditing all skills in a plugin before release | Evaluating a single skill -> /evaluate:skill |
| Establishing quality baselines across a plugin | Viewing past results -> /evaluate:report |
| Checking overall plugin quality after refactoring | Need structural compliance -> plugin-compliance-check.sh |
find $1/skills -name "SKILL.md" -maxdepth 3find $1/skills -name "evals.json" -maxdepth 3Parse these from $ARGUMENTS:
| Parameter | Default | Description |
|---|---|---|
<plugin-name> | required | Name of the plugin to evaluate |
--create-missing-evals | false | Generate evals for skills that lack them |
--parallel N | 1 | Max concurrent skill evaluations |
Find all skills in the plugin:
<plugin-name>/skills/*/SKILL.md
List them and count the total.
For each skill, check if evals.json exists:
--create-missing-evals: include, will create evals during evaluationReport the breakdown:
Found N skills in <plugin-name>:
- M with eval cases
- K without eval cases (skipped | will create)
For each included skill, invoke /evaluate:skill via the SlashCommand tool:
SlashCommand: /evaluate:skill <plugin-name>/<skill-name> [--create-evals]
If --parallel N is set and N > 1, batch evaluations into groups of N. Otherwise, run sequentially.
Track progress with TodoWrite — mark each skill as it completes.
After all skill evaluations complete, read each skill's benchmark.json and aggregate:
bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin-name>
Write aggregated results to <plugin-name>/eval-results/plugin-benchmark.json.
Print a plugin-level summary table:
## Plugin Evaluation: <plugin-name>
| Skill | Evals | Pass Rate | Status |
|-------|-------|-----------|--------|
| skill-a | 4 | 100% | PASS |
| skill-b | 3 | 67% | PARTIAL |
| skill-c | 5 | 80% | PASS |
**Overall**: 82% pass rate across N eval cases
Rank skills by pass rate. Flag any below 50% as needing attention.
| Context | Command |
|---|---|
| List plugin skills | ls -d <plugin>/skills/*/SKILL.md |
| Check for evals | find <plugin>/skills -name evals.json |
| Count skills | ls -d <plugin>/skills/*/SKILL.md | wc -l |
| Aggregate results | bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin> |
| Flag | Description |
|---|---|
--create-missing-evals | Generate eval cases for skills without them |
--parallel N | Max concurrent evaluations (default: 1) |