From skill-reviewer
Evaluate a skill's design quality against best practices from Anthropic's skill-building guidance and real-world patterns from high-quality skills.
npx claudepluginhub omriariav/omri-cc-stuff --plugin skill-reviewerThis skill uses the workspace's default tool permissions.
Evaluate a skill's design quality against best practices from Anthropic's skill-building guidance and real-world patterns from high-quality skills.
Guides browser automation using Playwright and Puppeteer for web testing, scraping, and AI agents. Covers selectors, auto-waits, test isolation, and anti-detection patterns.
Provides checklists for code reviews covering functionality, code quality, security, performance, tests, and maintainability. Use for PRs, audits, team standards, or training.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size calculation, and execution readiness checks.
Evaluate a skill's design quality against best practices from Anthropic's skill-building guidance and real-world patterns from high-quality skills.
| /reflect | /skill-reviewer | |
|---|---|---|
| Input | Current conversation history | Skill files on disk |
| Evaluates | What worked/failed during usage | Design, structure, documentation quality |
| When | After using a skill | Any time, even before first use |
| Output | Conversation learnings, SKILL.md patches | Scored evaluation with actionable recommendations |
/skill-reviewer publisher-lookup → Review by skill name (auto-discovers path)
/skill-reviewer ./skills/my-skill/ → Review by path
/skill-reviewer --fix publisher-lookup → Review + apply quick fixes automatically
/skill-reviewer --compare reflect taboolar → Side-by-side comparison of two skills
$ARGUMENTS = [--fix] <skill-path-or-name> OR --compare <left-skill> <right-skill>
Parse the argument before starting:
--compare, expect exactly two skill names following it. Run comparison mode (Phase 7).--fix, the next token is the skill name. Run normal evaluation then apply fixes (Phase 6).Step 1 - Locate the skill. Search these paths in order, stop at first match:
./, /, ~/)~/.claude/skills/<name>/SKILL.md~/.claude/plugins/*/skills/<name>/SKILL.md~/Code/taboola-pm-skills/plugins/<name>/skills/<name>/SKILL.md.claude/skills/<name>/SKILL.md**/skills/<name>/SKILL.md (exclude .git/, node_modules/, .venv/)If no SKILL.md found: Stop. Report "Skill not found: <name>. Searched [list paths checked]. Try providing the full path."
If multiple matches: List all candidates and ask "Which skill did you mean?" before proceeding.
Step 2 - Resolve symlinks. If the located path is a symlink, resolve it once to the real path. Work from the real path for all subsequent steps. Do not follow further symlinks inside the skill directory.
Step 3 - Inventory the skill root. Check for these standard owned subfolders only:
SKILL.md (required)LEARNINGS.mdCLAUDE.mdreferences/ - list filesscripts/ - list filestemplates/ - list filesdata/ - list filesexamples/ - list filesDo NOT recurse into nested plugins/, .git/, node_modules/, vendor/, .venv/, or additional SKILL.md files beyond the root unless the main SKILL.md explicitly delegates to them.
Step 4 - Read SKILL.md fully. Extract frontmatter fields and note body structure (sections, line count).
Step 5 - Read supporting files referenced by SKILL.md or needed to score a dimension. Do not read files speculatively.
Step 6 - Count metrics: SKILL.md line count, number of files per subfolder.
Before scoring, check if this skill has been reviewed before:
python3 scripts/history.py --skill <skill-name>
If prior reviews exist, note the previous score and grade. A re-review should explain what changed since the last evaluation.
Read references/evaluation-framework.md for the 9 categories. Use the signal words table as hints only — confirm classification semantically from the surrounding text, not keyword presence alone.
A skill should fit ONE category cleanly. If evidence is mixed, mark confidence as low and explain why. Do not force a category.
Read references/evaluation-framework.md for the full rubric with scoring anchors.
For each dimension, use keyword patterns as hints — confirm semantically before assigning a score. When evidence is ambiguous, round down and note what would raise the score.
D1: Progressive Disclosure (0-3) - Does it use the file system for context engineering?
D2: Description Quality (0-3) - Is the frontmatter description trigger-focused?
D3: Gotchas Section (0-3) - Is there a section documenting known failures?
D4: Don't State the Obvious (0-3) - Is content non-obvious to Claude?
D5: Avoid Railroading (0-3) - Is there flexibility for Claude to adapt?
D6: Setup & Configuration (0-3) - Are user-specific values configurable?
/Users/specific-name/, hardcoded emails or IDs baked into the skillD7: Memory & Data Storage (0-3) - Does it retain useful state across runs?
D8: Scripts & Composable Code (0-3) - Does it provide reusable scripts?
D9: Frontmatter Quality (0-3) - Is the YAML frontmatter well-configured?
Bash without subcommand scope, no argument-hintD10: Hooks Integration (0-2) - Does it use on-demand hooks?
score.py auto-detects the structural anti-patterns (persona-only, script-wrapper, monolithic, hardcoded-paths, no-error-handling, stale-references, overpermissioned unscoped Bash, no-output-format). Trust its output for those.
The following three require your semantic judgment — score.py does not detect them:
| Anti-Pattern | How to Detect (Claude only) |
|---|---|
| Redundant knowledge | Read the content — is this generic best-practice Claude already knows, or org-specific? |
| Category straddling | After classifying in Phase 2, does the skill pull equally hard in 2+ directions? |
| Overpermissioned (unused tools) | Do the listed allowed-tools all appear in the workflow? Score.py only catches unscoped Bash. |
Read references/report-template.md and fill it completely. Every section is required except "Comparison to Gold Standard" (see benchmarks.md for when to include it).
Score total: D1+D2+D3+D4+D5+D6+D7+D8+D9 (each 0-3) + D10 (0-2) = max 29 Grade: 25-29=A, 20-24=B, 15-19=C, 10-14=D, 0-9=F
Only auto-apply safe, non-destructive improvements. Show each change before applying.
Auto-apply:
argument-hint, tighten allowed-tools scope## Common Mistakes\n\n<!-- TODO: document gotchas from real usage -->references/ directory structure if SKILL.md > 150 lines and no subfolders existLEARNINGS.md only if the skill produces outputs where cross-run retention makes sense (not for all skills)Never auto-apply:
Run Phase 1-4 for each skill independently. Then produce a side-by-side table:
## Comparison: [left-skill] vs [right-skill]
| Dimension | [left] | [right] | Winner |
|-----------|--------|---------|--------|
| D1 | X | Y | ... |
...
| **Total** | X/29 | Y/29 | ... |
## What [left] does better
...
## What [right] does better
...
## What each could adopt from the other
...
Run this before Phase 3, not after. It gives D1, D2, D3, D6, D7, D8, D9, D10 plus anti-pattern detection in seconds:
python3 scripts/score.py <skill-directory> --json
Use --json so the output is machine-parseable. Feed the structural scores directly into the report table. Only D4 and D5 require your semantic judgment — add them to the structural total for the full score.
Adjust a structural score only if you have specific semantic evidence that contradicts it, and note the reason.
Read config.json at startup. Key settings:
history_log — path to history log (default: ~/.claude/plugin-data/skill-reviewer/history.log)default_verbose — show evidence in table outputbenchmarks_check — whether to attempt benchmark comparison (default: true)auto_fix_threshold — only offer --fix mode if score is below this (default: 15)After generating a report, append a one-line entry using the history script:
python3 scripts/history.py --append "YYYY-MM-DD | <skill-name> | <score>/29 | <grade> | <top-issue>"
To view past evaluations for a skill:
python3 scripts/history.py --skill <name>
This allows tracking improvement over time when a skill is reviewed repeatedly.
After every review session, check if anything new was discovered that isn't already in LEARNINGS.md:
score.pyIf yes, append it to LEARNINGS.md under the relevant section (Gotchas, Patterns That Work, or Version History). Keep entries specific: include the skill name, what happened, and the fix or insight.
This trigger is the mechanism that makes D3 improve over time — LEARNINGS.md feeds back into the gotchas section as the skill accumulates real failure data.
See examples/publisher-lookup-review.md for a complete example output to calibrate report quality.
references/evaluation-framework.md — 10-dimension scoring rubric with anchorsreferences/report-template.md — output format templatereferences/benchmarks.md — gold standard skills and their known pathsscripts/score.py — structural scoring script (D1-D3, D6-D10 + anti-patterns)scripts/history.py — view and append to the evaluation history logexamples/publisher-lookup-review.md — complete example report for calibrationconfig.json — configurable history path and behaviour settingsLEARNINGS.md — captured gotchas and patterns from real usageSymlink resolution: Resolve the symlink once, work from the real path. Do not chase further symlinks inside the skill. Some skills in taboola-pm-skills are symlinked 2-3 levels deep.
Nested plugins: Some repos have nested plugins/*/skills/*/plugins/ structures. Do not recurse into them — they are artifacts, not skill substructures.
Multiple matches: Glob searches will find the same skill in both the plugin source and its symlinked location. Prefer the real path (plugin source) over the symlink.
Stateless skills are not broken: A thin skill that wraps a script and has no LEARNINGS.md may score D7=0 correctly. Don't inflate the score or auto-add LEARNINGS.md unless retention genuinely helps.
Keyword classification is unreliable: "PR" appears in Runbooks and CI/CD skills. "template" appears in Scaffolding and Business Process skills. Always read the surrounding text before classifying.