Help us improve
Share bugs, ideas, or general feedback.
From evaluate-plugin
Views evaluation results and benchmark reports for Claude Code skills and plugins. Reviews past evals, compares benchmark runs, and tracks quality trends via tables.
npx claudepluginhub laurigates/claude-plugins --plugin evaluate-pluginHow this skill is triggered — by the user, by Claude, or both
Slash command
/evaluate-plugin:evaluate-reporthaikuThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
View evaluation results and benchmark reports for a skill or plugin.
Reviews evaluation results for Claude Code skills. Presents judge scores and skill outputs for human feedback, then proposes SKILL.md improvements based on user input.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Evaluates skill output quality via assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario. Use for benchmarking, comparing skill versions, or triggered by /skill-eval.
Share bugs, ideas, or general feedback.
View evaluation results and benchmark reports for a skill or plugin.
| Use this skill when... | Use alternative when... |
|---|---|
| Want to see results from a past evaluation | Need to run new evaluations -> /evaluate:skill |
| Comparing benchmark runs over time | Want improvement suggestions -> /evaluate:improve |
| Reviewing quality trends for a plugin | Need structural compliance -> plugin-compliance-check.sh |
Parse these from $ARGUMENTS:
| Parameter | Default | Description |
|---|---|---|
<target> | required | plugin/skill-name for skill or plugin-name for plugin |
--latest | true | Show most recent benchmark only |
--history | false | Show all benchmark iterations |
--compare | false | Compare current vs previous benchmark |
Determine if $ARGUMENTS refers to a skill or plugin:
/: treat as plugin-name/skill-name, look for <plugin>/skills/<skill>/eval-results/benchmark.json/: treat as plugin-name, look for <plugin>/eval-results/plugin-benchmark.jsonIf the target file does not exist, report that no evaluation results are available and suggest running /evaluate:skill or /evaluate:plugin.
Skill-level report (--latest or default):
Read benchmark.json and display:
## Evaluation Report: <plugin/skill-name>
**Last run**: <timestamp>
**Runs per eval**: N
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | 85% | 42% | +43% |
| Duration | 14s | 12s | +2s |
### Per-Eval Results
| Eval ID | Description | Pass Rate | Failures |
|---------|-------------|-----------|----------|
| eval-001 | Basic usage | 100% | — |
| eval-002 | Edge case | 67% | assertion X |
Skill-level history (--history):
Read history.json and display improvement iterations:
## Evaluation History: <plugin/skill-name>
| Version | Date | Pass Rate | Changes |
|---------|------|-----------|---------|
| v3 (current) | 2026-03-04 | 85% | Improved step 3 instructions |
| v2 | 2026-03-01 | 72% | Added error handling |
| v1 | 2026-02-28 | 55% | Initial evaluation |
Skill-level comparison (--compare):
Read current and previous benchmarks, show delta:
## Comparison: <plugin/skill-name>
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| Pass Rate | 72% | 85% | +13% |
| Duration | 16s | 14s | -2s |
Improved evals: eval-002, eval-004
Regressed evals: none
Plugin-level report:
Read plugin-benchmark.json and display:
## Plugin Report: <plugin-name>
| Skill | Evals | Pass Rate | Status |
|-------|-------|-----------|--------|
| skill-a | 4 | 100% | PASS |
| skill-b | 3 | 67% | NEEDS WORK |
**Overall**: 82% pass rate
**Skills evaluated**: N / M total
| Context | Command |
|---|---|
| Find skill benchmark | cat <plugin>/skills/<skill>/eval-results/benchmark.json | jq .summary |
| Find plugin benchmark | cat <plugin>/eval-results/plugin-benchmark.json | jq .summary |
| Check for history | cat <plugin>/skills/<skill>/eval-results/history.json | jq '.iterations | length' |
| List available results | find <plugin> -name benchmark.json -o -name plugin-benchmark.json |
| Flag | Description |
|---|---|
--latest | Most recent benchmark (default) |
--history | Show all iterations |
--compare | Compare current vs previous |