Help us improve
Share bugs, ideas, or general feedback.
From all-skills
Evaluates and benchmarks Agent Skills using static analysis and A/B testing. Measures activation accuracy, quality scorecards, and description optimization.
npx claudepluginhub vinnie357/claude-skills --plugin qaHow this skill is triggered — by the user, by Claude, or both
Slash command
/all-skills:claude-skills-benchmarkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.
Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.
Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.
Diagnoses and optimizes Agent Skills (SKILL.md) by scanning session transcripts for underused skills, wasted context, and CSO issues, then outputting a prioritized report.
Share bugs, ideas, or general feedback.
Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.
Activate when:
/benchmark-skills commandRun these checks against every skill to produce a quality scorecard:
| Check | Pass Criteria |
|---|---|
| Description length | Non-empty, max 1024 chars |
| Description has "Use when" | Contains activation triggers |
| Description third person | No "I can", "You can" |
| Name kebab-case | Matches ^[a-z0-9]+(-[a-z0-9]+)*$ |
| Name max 64 chars | Length check |
| No reserved words | No "anthropic"/"claude" in name |
| SKILL.md max 500 lines | Line count |
| Has examples | Contains code blocks or example sections |
| Reference depth | No nested references (one level only) |
| Anti-fabrication present | Contains anti-fab rules or references core:anti-fabrication |
| Source documented | Skill appears in plugin's sources.md |
Classify each skill for appropriate evaluation:
Capability Uplift: Enhances Claude's core abilities (coding, analysis, reasoning). Stable across model versions. Test by comparing base model performance with and without the skill.
Encoded Preference: Encodes user-specific workflows, formatting, and conventions. May need updates when models change. Test by verifying fidelity to the encoded workflow.
For each skill, create:
Compare skill performance using independent agents:
Test across model tiers to verify consistency:
| Model | Target Pass Rate |
|---|---|
| Haiku | 70%+ |
| Sonnet | 85%+ |
| Opus | 95%+ |
If Haiku fails but others pass, instructions may rely on implicit reasoning — make them more explicit.
The description determines activation accuracy. Optimize for:
The benchmark produces a table per plugin:
Skill | Plugin | Desc | Lines | Refs | Examples | Score
─────────────────┼──────────┼──────┼─────────┼──────┼──────────┼──────
git | core | Pass | 120/500 | Pass | Pass | 9/11
claude-skills | cl-code | Pass | 380/500 | Pass | Pass | 10/11
mise test:skills-quality — runs all static checks, produces scorecard/benchmark-skills — full analysis with category classification and quality assessmenttemplates/evaluation-checklist.mdFollow the cycle: Test, Measure, Analyze, Refine, Verify.
Stop iterating when: