From claude-code
Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.
npx claudepluginhub vinnie357/claude-skills --plugin claude-codeThis skill uses the workspace's default tool permissions.
Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.
Activate when:
/benchmark-skills commandRun these checks against every skill to produce a quality scorecard:
| Check | Pass Criteria |
|---|---|
| Description length | Non-empty, max 1024 chars |
| Description has "Use when" | Contains activation triggers |
| Description third person | No "I can", "You can" |
| Name kebab-case | Matches ^[a-z0-9]+(-[a-z0-9]+)*$ |
| Name max 64 chars | Length check |
| No reserved words | No "anthropic"/"claude" in name |
| SKILL.md max 500 lines | Line count |
| Has examples | Contains code blocks or example sections |
| Reference depth | No nested references (one level only) |
| Anti-fabrication present | Contains anti-fab rules or references core:anti-fabrication |
| Source documented | Skill appears in plugin's sources.md |
Classify each skill for appropriate evaluation:
Capability Uplift: Enhances Claude's core abilities (coding, analysis, reasoning). Stable across model versions. Test by comparing base model performance with and without the skill.
Encoded Preference: Encodes user-specific workflows, formatting, and conventions. May need updates when models change. Test by verifying fidelity to the encoded workflow.
For each skill, create:
Compare skill performance using independent agents:
Test across model tiers to verify consistency:
| Model | Target Pass Rate |
|---|---|
| Haiku | 70%+ |
| Sonnet | 85%+ |
| Opus | 95%+ |
If Haiku fails but others pass, instructions may rely on implicit reasoning — make them more explicit.
The description determines activation accuracy. Optimize for:
The benchmark produces a table per plugin:
Skill | Plugin | Desc | Lines | Refs | Examples | Score
─────────────────┼──────────┼──────┼─────────┼──────┼──────────┼──────
git | core | Pass | 120/500 | Pass | Pass | 9/11
claude-skills | cl-code | Pass | 380/500 | Pass | Pass | 10/11
mise test:skills-quality — runs all static checks, produces scorecard/benchmark-skills — full analysis with category classification and quality assessmenttemplates/evaluation-checklist.mdFollow the cycle: Test, Measure, Analyze, Refine, Verify.
Stop iterating when: