Skill

claude-skills-benchmark

Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.

testing

code-quality

Install

npx claudepluginhub vinnie357/claude-skills --plugin claude-code

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

167.4k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

167.4k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

167.4k

Stats

Parent Repo Stars11

Parent Repo Forks5

Last CommitMar 7, 2026

Used By2 plugins

Actions

View Source View Plugin View on GitHub View README

Skill Benchmarking

Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.

When to Use

Activate when:

Assessing skill quality across a plugin or marketplace
Measuring skill activation accuracy (false positives/negatives)
Comparing skill versions or skill-vs-no-skill performance
Running the /benchmark-skills command
Reviewing skill descriptions for optimization

Static Analysis Checks

Run these checks against every skill to produce a quality scorecard:

Check	Pass Criteria
Description length	Non-empty, max 1024 chars
Description has "Use when"	Contains activation triggers
Description third person	No "I can", "You can"
Name kebab-case	Matches `^[a-z0-9]+(-[a-z0-9]+)*$`
Name max 64 chars	Length check
No reserved words	No "anthropic"/"claude" in name
SKILL.md max 500 lines	Line count
Has examples	Contains code blocks or example sections
Reference depth	No nested references (one level only)
Anti-fabrication present	Contains anti-fab rules or references `core:anti-fabrication`
Source documented	Skill appears in plugin's `sources.md`

Skill Categories

Classify each skill for appropriate evaluation:

Capability Uplift: Enhances Claude's core abilities (coding, analysis, reasoning). Stable across model versions. Test by comparing base model performance with and without the skill.

Encoded Preference: Encodes user-specific workflows, formatting, and conventions. May need updates when models change. Test by verifying fidelity to the encoded workflow.

Evaluation Methodology

Writing Evals

For each skill, create:

5-10 representative prompts that should trigger the skill
3-5 out-of-scope prompts that should NOT trigger the skill
Expected behavior criteria for each prompt

A/B Testing

Compare skill performance using independent agents:

Agent A runs with the skill loaded
Agent B runs without the skill
A comparator judges outputs blindly
Track: pass rate, token usage, execution time

Multi-Model Testing

Test across model tiers to verify consistency:

Model	Target Pass Rate
Haiku	70%+
Sonnet	85%+
Opus	95%+

If Haiku fails but others pass, instructions may rely on implicit reasoning — make them more explicit.

Description Optimization

The description determines activation accuracy. Optimize for:

Reducing false positives: Too-broad descriptions waste context. Add domain-specific terms.
Reducing false negatives: Too-narrow descriptions miss valid prompts. Add synonyms and related terms.
Target: 90%+ true positive rate, <5% false positive rate

Scorecard Output

The benchmark produces a table per plugin:

Skill            | Plugin   | Desc | Lines   | Refs | Examples | Score
─────────────────┼──────────┼──────┼─────────┼──────┼──────────┼──────
git              | core     | Pass | 120/500 | Pass | Pass     | 9/11
claude-skills    | cl-code  | Pass | 380/500 | Pass | Pass     | 10/11

Running Benchmarks

Static analysis: mise test:skills-quality — runs all static checks, produces scorecard
Command: /benchmark-skills — full analysis with category classification and quality assessment
Manual evals: Use the evaluation checklist template at templates/evaluation-checklist.md

Iteration

Follow the cycle: Test, Measure, Analyze, Refine, Verify.

Stop iterating when:

Eval pass rates meet targets across model tiers
Description triggers accurately
Token usage is reasonable relative to benefit
No user-reported issues for 2+ model versions

Anti-Fabrication Requirements

Core Principles

Base all outputs on actual analysis using tool execution
Execute validation tools before making claims
Mark uncertain information as "requires analysis" or "needs validation"
Use precise, factual language without superlatives

Prohibited Language

Superlatives: Avoid "excellent", "comprehensive", "advanced", "optimal", "perfect"
Unsubstantiated Metrics: Never fabricate percentages, success rates, or performance numbers
Assumed Capabilities: Do not claim features exist without tool verification

Validation Requirements

File Claims: Use Read or Glob tools before claiming files exist
Test Results: Only report test outcomes after actual execution
Performance Claims: Base on actual measurement or analysis