Help us improve
Share bugs, ideas, or general feedback.
From skilltest
Runs four-layer tests on Claude Code plugin skills: structure validation, trigger accuracy, multi-turn sessions, and value comparisons using Python scripts like validate.py and run_trigger_eval.py.
npx claudepluginhub danielscholl/claude-sdlc --plugin skilltestHow this skill is triggered — by the user, by Claude, or both
Slash command
/skilltest:test-frameworkThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Test framework for validating Claude Code plugin skills across four layers:
Scaffolds pytest smoke tests and runs behavioral tests for Claude Code skills in Docker harness. Generates golden files, runs pytest, reports LLM verdicts and costs.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Audits Claude Code skills for structure compliance, triggering accuracy, instruction quality, and best practices. Scores 0-100 with prioritized improvement recommendations.
Share bugs, ideas, or general feedback.
Test framework for validating Claude Code plugin skills across four layers: structure, trigger accuracy, session behavior, and skill value.
Traditional testing verifies deterministic behavior. AI plugin skills are probabilistic — the same prompt can trigger different skills across runs, routing is inferred not explicit, quality degrades silently, and models improve over time (making skills redundant). This framework addresses all four failure modes.
| Layer | What it tests | Speed | Script |
|---|---|---|---|
| L1 Structure | Plugin spec compliance, naming, cross-refs | ~0.1s | validate.py |
| L2 Triggers | Skill description accuracy (precision/recall) | ~30s/query | run_trigger_eval.py |
| L3 Sessions | Multi-turn routing, context, boundaries | 2-3 min | session_test.py |
| L4 Value | Does the skill actually help? (with vs without) | 5+ min | compare_skill.py |
All scripts live in skills/test-framework/scripts/ and accept a --root
flag pointing to the plugin directory being tested (defaults to cwd).
# L1: Validate plugin structure
uv run skills/test-framework/scripts/validate.py --root .
# L1: Validate a specific skill
uv run skills/test-framework/scripts/validate.py --root . skills/my-skill/
# L2: Dry-run trigger eval (validate eval set structure)
uv run skills/test-framework/scripts/run_trigger_eval.py \
--eval-set tests/evals/triggers/my-skill.json \
--skill-path skills/my-skill \
--dry-run
# L2: Run trigger eval against claude
uv run skills/test-framework/scripts/run_trigger_eval.py \
--eval-set tests/evals/triggers/my-skill.json \
--skill-path skills/my-skill \
--runs-per-query 3
# L3: Run a session scenario
uv run skills/test-framework/scripts/session_test.py \
--scenario tests/evals/scenarios/my-workflow.json \
--verbose
# L4: Compare skill value (with vs without)
uv run skills/test-framework/scripts/compare_skill.py \
--skill my-skill \
--scenario tests/evals/scenarios/my-workflow.json \
--runs 3 --verbose
# All layers for one skill
uv run skills/test-framework/scripts/test_skill.py my-skill
# Test inventory
uv run skills/test-framework/scripts/test_skill.py --inventory
Validates .claude-plugin/plugin.json, agents, skills, commands, MCP config,
and cross-references.
What it checks:
.claude-plugin/plugin.json — required fields, semver, agent path references.mcp.json — server configs have command/url (optional)CLAUDE.md — exists with meaningful content (optional but recommended)uv run skills/test-framework/scripts/validate.py --root . --json
Tests whether a skill's description causes Claude to activate for the right prompts. See eval-schemas.md for JSON format.
Create tests/evals/triggers/{skill-name}.json:
{
"skill_name": "my-skill",
"evals": [
{"query": "realistic prompt that should trigger this skill", "should_trigger": true},
{"query": "near-miss prompt that should NOT trigger", "should_trigger": false}
]
}
Guidelines:
| Metric | Meaning |
|---|---|
| Precision | When the skill triggers, how often is it correct? |
| Recall | When the skill should trigger, how often does it? |
| Accuracy | Overall correct rate |
Low recall = description too narrow. Low precision = description too broad.
Tests multi-turn context, routing accuracy, and skill boundaries.
Create tests/evals/scenarios/{name}-workflow.json:
{
"name": "my-skill-workflow",
"description": "Test my-skill routing and context",
"ready_pattern": "❯|\\$|>",
"steps": [
{
"name": "basic-query",
"prompt": "what does my-skill handle?",
"timeout": 90,
"pause_after": 3,
"assertions": [
{"pattern": "expected-keyword", "type": "contains", "description": "Routes correctly"},
{"pattern": "wrong-skill-keyword", "type": "not_contains", "description": "Does NOT invoke wrong skill"}
]
}
]
}
Assertion types: contains, regex, not_contains
Safety: Prompts must be read-only. Action verbs (create, delete, push, deploy)
are blocked automatically when running with --dangerously-skip-permissions.
Measures whether a skill actually helps by running the same scenario with and without the skill loaded.
| Verdict | Delta | Action |
|---|---|---|
| VALUABLE | >+10% | Keep the skill |
| MARGINAL | +1-10% | Review if context cost is worth it |
| REDUNDANT | ~0% (both high) | Model already knows this — consider removing |
| INEFFECTIVE | ~0% (both low) | Rewrite — skill isn't helping |
| HARMFUL | Negative | Remove or rewrite — skill makes things worse |
Copy the Makefile to your plugin root or use ROOT to point at your plugin:
make test # L1 + L2 (fast)
make lint ROOT=/path/to/plugin # L1 only
make integration S=my-skill # L3 for one skill
make benchmark S=my-skill # L4 for one skill
make report ROOT=/path/to/plugin # Test inventory
make test-skill S=my-skill # All layers
skills/{name}/SKILL.md)validate.py to check structuretests/evals/triggers/{name}.json) — 8+ positive, 8+ negativetests/evals/scenarios/{name}-workflow.json)test_skill.py {name} to verify all layers