From evaluate-plugin
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
npx claudepluginhub laurigates/claude-plugins --plugin evaluate-pluginThis skill is limited to using the following tools:
Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.
| Use this skill when... | Use alternative when... |
|---|---|
| Want to test if a skill produces correct results | Need structural validation -> scripts/plugin-compliance-check.sh |
| Validating skill improvements before merging | Want to file feedback about a session -> /feedback:session |
| Benchmarking a skill against a baseline | Need to check skill freshness -> /health:audit |
| Creating eval cases for a new skill | Want to review code quality -> /code-review |
find $1/skills -name "SKILL.md" -maxdepth 3find $1/skills -name "evals.json" -maxdepth 3Parse these from $ARGUMENTS:
| Parameter | Default | Description |
|---|---|---|
<plugin/skill-name> | required | Path as plugin-name/skill-name |
--create-evals | false | Generate eval cases if none exist |
--runs N | 1 | Number of runs per eval case |
--baseline | false | Also run without skill for comparison |
Parse $ARGUMENTS to extract <plugin-name> and <skill-name>. The skill file lives at:
<plugin-name>/skills/<skill-name>/SKILL.md
Read the SKILL.md to confirm it exists and understand what the skill does.
Run the compliance check to confirm the skill passes basic structural validation:
bash scripts/plugin-compliance-check.sh <plugin-name>
If structural issues are found, report them and stop. Behavioral evaluation on a structurally broken skill is wasted effort.
Look for <plugin-name>/skills/<skill-name>/evals.json.
If the file exists: read and validate it against the evals.json schema (see evaluate-plugin/references/schemas.md).
If the file does not exist AND --create-evals is set: Analyze the SKILL.md and generate eval cases:
id: Unique identifier (e.g., eval-001)description: What this test validatesprompt: The user prompt to simulateexpectations: List of assertion strings the output should satisfytags: Categorization tags<plugin-name>/skills/<skill-name>/evals.json.If the file does not exist AND --create-evals is NOT set: Report that no eval cases exist and suggest running with --create-evals.
For each eval case, for each run (up to --runs N):
<plugin-name>/skills/<skill-name>/eval-results/runs/<eval-id>-run-<N>/subagent_type: general-purpose) that:
timing.json.transcript.md.If --baseline is set, repeat Step 4 but without loading the skill content. This creates a comparison point to measure skill effectiveness.
Use the same eval prompts and record results in a parallel baseline/ subdirectory.
For each run, delegate grading to the eval-grader agent via Task:
Task subagent_type: eval-grader
Prompt: Grade this eval run against the assertions.
Eval case: <eval case from evals.json>
Transcript: <path to transcript.md>
Output artifacts: <list of created/modified files>
The grader produces grading.json for each run.
Compute aggregate statistics across all runs:
If --baseline was used, also compute:
Write aggregated results to <plugin-name>/skills/<skill-name>/eval-results/benchmark.json.
Print a summary table:
## Evaluation Results: <plugin/skill-name>
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | 85% | 42% | +43% |
| Duration | 14s | 12s | +2s |
| Runs | 3 | 3 | — |
### Per-Eval Breakdown
| Eval | Description | Pass Rate | Status |
|------|-------------|-----------|--------|
| eval-001 | Basic usage | 100% | PASS |
| eval-002 | Edge case | 67% | PARTIAL |
| eval-003 | Boundary | 100% | PASS |
| Context | Command |
|---|---|
| Check skill exists | ls <plugin>/skills/<skill>/SKILL.md |
| Check evals exist | ls <plugin>/skills/<skill>/evals.json |
| Read evals | cat <plugin>/skills/<skill>/evals.json | jq . |
| Create results dir | mkdir -p <plugin>/skills/<skill>/eval-results/runs |
| Write JSON | jq -n '<expression>' > file.json |
| Aggregate results | bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin> |
| Flag | Description |
|---|---|
--create-evals | Generate eval cases from SKILL.md analysis |
--runs N | Number of runs per eval case (default: 1) |
--baseline | Run without skill for comparison |