Skill

claude-skills-benchmark

Evaluates and benchmarks Agent Skills using static analysis and A/B testing. Measures activation accuracy, quality scorecards, and description optimization.

developer-tools

code-quality

npx claudepluginhub vinnie357/claude-skills --plugin qa

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/all-skills:claude-skills-benchmark

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.

SKILL.md

129 lines · ~1.2k tokens

Similar Skills

claude-skills-benchmark

Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.

claude-code

benchmark-skills

Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.

bopen-tools

skill-optimizer

40.4k

Diagnoses and optimizes Agent Skills (SKILL.md) by scanning session transcripts for underused skills, wasted context, and CSO issues, then outputting a prioritized report.

antigravity-awesome-skills

Stats

LanguageNushell

Stars19

Forks6

MaintenanceExcellent

Last CommitJun 10, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Skill Benchmarking

Evaluate Agent Skills through static analysis and evaluation-driven methodology. Source: Anthropic's skill evaluation guidance.

When to Use

Activate when:

Assessing skill quality across a plugin or marketplace
Measuring skill activation accuracy (false positives/negatives)
Comparing skill versions or skill-vs-no-skill performance
Running the /benchmark-skills command
Reviewing skill descriptions for optimization

Static Analysis Checks

Run these checks against every skill to produce a quality scorecard:

Check	Pass Criteria
Description length	Non-empty, max 1024 chars
Description has "Use when"	Contains activation triggers
Description third person	No "I can", "You can"
Name kebab-case	Matches `^[a-z0-9]+(-[a-z0-9]+)*$`
Name max 64 chars	Length check
No reserved words	No "anthropic"/"claude" in name
SKILL.md max 500 lines	Line count
Has examples	Contains code blocks or example sections
Reference depth	No nested references (one level only)
Anti-fabrication present	Contains anti-fab rules or references `core:anti-fabrication`
Source documented	Skill appears in plugin's `sources.md`

Skill Categories

Classify each skill for appropriate evaluation:

Capability Uplift: Enhances Claude's core abilities (coding, analysis, reasoning). Stable across model versions. Test by comparing base model performance with and without the skill.

Encoded Preference: Encodes user-specific workflows, formatting, and conventions. May need updates when models change. Test by verifying fidelity to the encoded workflow.

Evaluation Methodology

Writing Evals

For each skill, create:

5-10 representative prompts that should trigger the skill
3-5 out-of-scope prompts that should NOT trigger the skill
Expected behavior criteria for each prompt

A/B Testing

Compare skill performance using independent agents:

Agent A runs with the skill loaded
Agent B runs without the skill
A comparator judges outputs blindly
Track: pass rate, token usage, execution time

Multi-Model Testing

Test across model tiers to verify consistency:

Model	Target Pass Rate
Haiku	70%+
Sonnet	85%+
Opus	95%+

If Haiku fails but others pass, instructions may rely on implicit reasoning — make them more explicit.

Description Optimization

The description determines activation accuracy. Optimize for:

Reducing false positives: Too-broad descriptions waste context. Add domain-specific terms.
Reducing false negatives: Too-narrow descriptions miss valid prompts. Add synonyms and related terms.
Target: 90%+ true positive rate, <5% false positive rate

Scorecard Output

The benchmark produces a table per plugin:

Skill            | Plugin   | Desc | Lines   | Refs | Examples | Score
─────────────────┼──────────┼──────┼─────────┼──────┼──────────┼──────
git              | core     | Pass | 120/500 | Pass | Pass     | 9/11
claude-skills    | cl-code  | Pass | 380/500 | Pass | Pass     | 10/11

Running Benchmarks

Static analysis: mise test:skills-quality — runs all static checks, produces scorecard
Command: /benchmark-skills — full analysis with category classification and quality assessment
Manual evals: Use the evaluation checklist template at templates/evaluation-checklist.md

Iteration

Follow the cycle: Test, Measure, Analyze, Refine, Verify.

Stop iterating when:

Eval pass rates meet targets across model tiers
Description triggers accurately
Token usage is reasonable relative to benefit
No user-reported issues for 2+ model versions

Anti-Fabrication Requirements

Core Principles

Base all outputs on actual analysis using tool execution
Execute validation tools before making claims
Mark uncertain information as "requires analysis" or "needs validation"
Use precise, factual language without superlatives

Prohibited Language

Superlatives: Avoid "excellent", "comprehensive", "advanced", "optimal", "perfect"
Unsubstantiated Metrics: Never fabricate percentages, success rates, or performance numbers
Assumed Capabilities: Do not claim features exist without tool verification

Validation Requirements

File Claims: Use Read or Glob tools before claiming files exist
Test Results: Only report test outcomes after actual execution
Performance Claims: Base on actual measurement or analysis

claude-skills-benchmark

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

claude-skills-benchmark

Popularity

Invocation

Context Preview

SKILL.md

Skill Benchmarking

When to Use

Static Analysis Checks

Skill Categories

Evaluation Methodology

Writing Evals

A/B Testing

Multi-Model Testing

Description Optimization

Scorecard Output

Running Benchmarks

Iteration

Anti-Fabrication Requirements

Core Principles

Prohibited Language

Validation Requirements

Similar Skills

Help us improve

Skill Benchmarking

When to Use

Static Analysis Checks

Skill Categories

Evaluation Methodology

Writing Evals

A/B Testing

Multi-Model Testing

Description Optimization

Scorecard Output

Running Benchmarks

Iteration

Anti-Fabrication Requirements

Core Principles

Prohibited Language

Validation Requirements