Skill

eval-skill

Evaluates a single skill using static analysis and qualitative review to determine if it's worth keeping, well-built, or redundant.

code-quality

Stats

Actions

Evaluate Skill

Deep-evaluate a single skill using lint (deterministic rules) and qualitative review, both individually and in context of the full setup.

Hard Rules

Never give a verdict without running the checks. Read the actual file content and check all rubric categories before assigning a verdict.
Every category must be checked. Both the individual rubric AND the contextual analysis must be fully evaluated.
Read before you judge. Read the actual SKILL.md content (and reference files if they exist).
Don't manufacture problems. If the skill is good, say so. Only report real issues.
Always end with a short summary.
Record the exact start time and compute the exact duration at the end.

Step 1: Ask Output Preference

Before doing anything else, ask the user:

Where should i present the results?

Terminal - print the report here in the conversation

File - write a markdown report to a file (you'll choose the path)

Wait for their answer before proceeding.

Step 2: Select the Skill

Determine the skill path. If the user says a skill name, find it under skills/<name>/SKILL.md.

Step 3: Run Lint (Static Analysis)

Determine the setup context path (usually the current working directory).

uv run python skills/skill/scripts/run_skill_eval.py <skill-path> <context-path> recommended

If no context path, pass - as the second argument.

Read the JSON output. It contains diagnostics, token count, and contextual findings.

Step 4: Read Actual Files

Read the skill's actual content:

The SKILL.md file
All files in the skill's subdirectories (reference files). Check the COMBINED content.
The skill's guidelines.md (if it exists)

Also read for context (don't check these, they're context for evaluating the target skill): 4. All OTHER skill SKILL.md files in the workspace 5. CLAUDE.md 6. Hooks in .claude/settings.json

Step 5: Individual Rubric (Qualitative Review)

Read rubric/skills-rubric.md for the issue categories and what to flag.

Check the skill against all categories. For each issue found, cite specific evidence from the content.

Verdict: KEEP (no issues or minor only), REVIEW (multiple issues), REMOVE (fundamentally broken/redundant)

Step 6: Contextual Analysis

Read rubric/contextual-analysis.md and evaluate all 5 contextual dimensions.

Check redundancy against three sources:

Claude's default behavior (generic advice = redundant)
Other skills in the workspace (overlap = partially redundant)
CLAUDE.md content (duplication = wasted tokens)

Step 7: Produce the Report

Read report-format.md for the full report structure.

The report must include:

Lint results: each failed rule with WHY it failed
Qualitative review issues (by category)
Contextual analysis
+/!/x sections (good, improve, broken)
Final verdict with suggestions

At the very end:

Evaluated with: harness-eval v{version} (claude-code-plugin)
Duration: [X minutes Y seconds]

Get {version} by running: uv run python -c "import importlib.metadata; print(importlib.metadata.version('harness-eval'))"

If the user chose terminal: print the report in the conversation.

If the user chose file: write the report as markdown to the path they specified (or suggest skill-report.md in the current directory). Tell them the file path when done.

eval-skill

eval-skill

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Evaluate Skill

Hard Rules

Step 1: Ask Output Preference

Step 2: Select the Skill

Step 3: Run Lint (Static Analysis)

Step 4: Read Actual Files

Step 5: Individual Rubric (Qualitative Review)

Step 6: Contextual Analysis

Step 7: Produce the Report

Similar Skills

Similar Skills