From all-skills
Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.
npx claudepluginhub vinnie357/claude-skills --plugin alliumThis skill uses the workspace's default tool permissions.
Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.
Activate when:
/benchmark-skills commandRun these checks against every skill to produce a quality scorecard:
| Check | Pass Criteria |
|---|---|
| Description length | Non-empty, max 1024 chars |
| Description has "Use when" | Contains activation triggers |
| Description third person | No "I can", "You can" |
| Name kebab-case | Matches ^[a-z0-9]+(-[a-z0-9]+)*$ |
| Name max 64 chars | Length check |
| No reserved words | No "anthropic"/"claude" in name |
| SKILL.md max 500 lines | Line count |
| Has examples | Contains code blocks or example sections |
| Reference depth | No nested references (one level only) |
| Anti-fabrication present | Contains anti-fab rules or references core:anti-fabrication |
| Source documented | Skill appears in plugin's sources.md |
Classify each skill for appropriate evaluation:
Capability Uplift: Enhances Claude's core abilities (coding, analysis, reasoning). Stable across model versions. Test by comparing base model performance with and without the skill.
Encoded Preference: Encodes user-specific workflows, formatting, and conventions. May need updates when models change. Test by verifying fidelity to the encoded workflow.
For each skill, create:
Compare skill performance using independent agents:
Test across model tiers to verify consistency:
| Model | Target Pass Rate |
|---|---|
| Haiku | 70%+ |
| Sonnet | 85%+ |
| Opus | 95%+ |
If Haiku fails but others pass, instructions may rely on implicit reasoning — make them more explicit.
The description determines activation accuracy. Optimize for:
The benchmark produces a table per plugin:
Skill | Plugin | Desc | Lines | Refs | Examples | Score
─────────────────┼──────────┼──────┼─────────┼──────┼──────────┼──────
git | core | Pass | 120/500 | Pass | Pass | 9/11
claude-skills | cl-code | Pass | 380/500 | Pass | Pass | 10/11
mise test:skills-quality — runs all static checks, produces scorecard/benchmark-skills — full analysis with category classification and quality assessmenttemplates/evaluation-checklist.mdFollow the cycle: Test, Measure, Analyze, Refine, Verify.
Stop iterating when: