This skill should be used when the user asks to "compare skill versions", "benchmark this skill", "what improved", "diff these skills", "measure skill improvement", or "before and after analysis". Compares two states of a skill — different versions, before/after improvement, or one skill against another — and produces a dimension-by-dimension delta report with quantified improvement. Use this skill whenever someone needs proof that an improvement actually improved something, or to compare a skill against a reference standard. [EXPLICIT]
From jm-adknpx claudepluginhub javimontano/jm-adk-alfaThis skill is limited to using the following tools:
agents/guardian.mdagents/lead.mdagents/specialist.mdagents/support.mdevals/evals.jsonknowledge/body-of-knowledge.mdknowledge/knowledge-graph.mdprompts/meta.mdprompts/primary.mdprompts/variations/deep.mdprompts/variations/quick.mdreferences/comparison-framework.mdtemplates/output.docx.mdtemplates/output.htmlSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Quantified comparison between two skill states. Measures improvement dimension by dimension, identifies regressions, and produces evidence that changes actually made things better — or exposes that they didn't. [EXPLICIT]
Without measurement, improvement is anecdotal. This skill turns "I think it's better" into "Clarity improved from 6 to 9, but density dropped from 8 to 7 due to added filler in section X." [EXPLICIT]
/benchmark-skill /path/to/skill-v1 /path/to/skill-v2 # compare two directories
/benchmark-skill /path/to/skill --against-standard # compare against gold standard
| Mode | Input | Output | Use When |
|---|---|---|---|
| Version comparison | Two skill directory paths | Delta report: dimension-by-dimension + net assessment | Measuring improvement after surgeon-skill or manual edits |
| Standard comparison | One skill path + --against-standard | Gap-to-standard report: how far from 10/10 | Evaluating a skill's absolute quality without a prior version |
For each skill state (A and B, or skill and standard): [EXPLICIT]
| Attribute | State A | State B | Delta |
|---|---|---|---|
| Total files | +N/-N | ||
| Total lines | +N/-N | ||
| SKILL.md lines | +N/-N | ||
| Reference files | +N/-N | ||
| Evals count | +N/-N | ||
| Has agents/ | |||
| Has scripts/ |
Apply the 10-criterion rubric from x-ray-skill's references/quality-rubric.md to both states independently. Produce parallel scorecards. [EXPLICIT]
Scoring consistency: Use the same evidence standards for both states. If you give State A a 7 on clarity because of 2 ambiguous pronouns, State B with 1 ambiguous pronoun should score higher — don't use different scales.
| # | Criterion | State A | State B | Delta | Direction |
|---|---|---|---|---|---|
| 1 | Foundation | /10 | /10 | Better/Worse/Same | |
| 2 | Truthfulness | /10 | /10 | ||
| 3 | Quality | /10 | /10 | ||
| 4 | Density | /10 | /10 | ||
| 5 | Simplicity | /10 | /10 | ||
| 6 | Clarity | /10 | /10 | ||
| 7 | Precision | /10 | /10 | ||
| 8 | Depth | /10 | /10 | ||
| 9 | Coherence | /10 | /10 | ||
| 10 | Value | /10 | /10 | ||
| Average | /10 | /10 |
Apply the 13-point meta-validation gate to both: [EXPLICIT]
| # | Checkpoint | State A | State B | Changed? |
|---|---|---|---|---|
| 1-13 | ... | PASS/FAIL | PASS/FAIL | Fixed/Regressed/Same |
A regression is any dimension where State B scores lower than State A, or any gate checkpoint that passed in A but fails in B. [EXPLICIT]
Regressions are the most important finding. Improvements are expected (that's the point of making changes). Regressions are unexpected — they indicate unintended consequences.
For each regression: [EXPLICIT]
For each dimension that improved by 2+ points: [EXPLICIT]
IMPROVED: More dimensions improved than regressed, average increased, no gate regressions [EXPLICIT]
LATERAL: Similar average, trade-offs balance (gained depth, lost density) [EXPLICIT]
REGRESSED: More dimensions regressed than improved, or gate regressions [EXPLICIT]
TRANSFORMED: Fundamental structural change — scores aren't directly comparable (e.g., single-file → multi-file) [EXPLICIT]
# Benchmark Report: {skill-name}
**Compared:** {State A description} vs {State B description}
**Net Assessment:** {IMPROVED / LATERAL / REGRESSED / TRANSFORMED}
## Summary
| Metric | State A | State B | Delta |
|--------|---------|---------|-------|
| Average score | /10 | /10 | +N |
| Gate pass | /13 | /13 | +N |
| Dimensions improved | | {count} | |
| Dimensions regressed | | {count} | |
| Dimensions unchanged | | {count} | |
## Dimension-by-Dimension
| # | Criterion | A | B | Delta | Key Driver |
|---|-----------|---|---|-------|-----------|
| 1-10 | ... | /10 | /10 | +/-N | {what caused the change} |
## Gate Changes
| # | Checkpoint | A | B | Change |
|---|-----------|---|---|--------|
{only rows that changed}
## Regressions (if any)
| Dimension | A→B | Cause | Severity | Trade-off? |
|-----------|-----|-------|----------|-----------|
{each regression with root cause analysis}
## Top Improvements
| Dimension | A→B | Driver | Genuine? |
|-----------|-----|--------|----------|
{improvements of 2+ points}
## Recommendation
{specific next action based on the assessment}
| Failure | Signal | Recovery |
|---|---|---|
| States are identical | All deltas are zero | Report: "No changes detected between states." Verify paths are correct. |
| State A doesn't exist | Path invalid or SKILL.md missing | Ask user for correct path. Suggest git log to find prior versions. |
| Incomparable states | Fundamental restructure (single-file → 10-file) | Report as TRANSFORMED. Score both independently. Note that direct delta comparison is misleading for structural overhauls. |
| Score inconsistency | Same unchanged section gets different scores in A vs B | Re-score the unchanged section using State A's evidence. Consistency requires anchoring to the first assessment. |
Bad:
State A: 7.2/10. State B: 8.1/10. Improvement: +0.9. Good job. [EXPLICIT]
No dimension breakdown. No regression detection. No evidence. [EXPLICIT]
Good:
Net: IMPROVED (+0.9 average, 0 regressions, 3 gate fixes) [EXPLICIT]
Key improvements: Depth 5→8 (added failure modes + 4 edge cases), [EXPLICIT]
Clarity 6→9 (added glossary, eliminated ambiguous pronouns). [EXPLICIT]
Key trade-off: Density 9→8 (added content increased value but reduced compression). [EXPLICIT]
Regressions: None. [EXPLICIT]
Gate: 10/13 → 13/13 (fixed: Good vs Bad example, 2 missing edge cases). [EXPLICIT]
Before delivering the benchmark report: [EXPLICIT]
| File | Content | Load When |
|---|---|---|
references/comparison-framework.md | Scoring consistency protocols, delta classification rules, trade-off analysis framework, TRANSFORMED assessment criteria | Always — needed for rigorous comparison |
Author: Javier Montano | Last updated: March 18, 2026