Help us improve
Share bugs, ideas, or general feedback.
From skill-maker
Generates and runs trigger tests, functional tests, and performance baselines for a skill given path to folder or SKILL.md.
npx claudepluginhub vcode-sh/vibe-tools --plugin skill-makerHow this command is triggered — by the user, by Claude, or both
Slash command
/skill-maker:test path to skill folder or SKILL.mdThe summary Claude sees in its command listing — used to decide when to auto-load this command
# Skill Testing Suite Generate comprehensive test cases for a skill and help the user verify it works correctly. Covers three test areas from the Anthropic testing methodology: triggering, functional, and performance. ## Input The user provides: `$ARGUMENTS` If `$ARGUMENTS` is a path, use it directly. If `$ARGUMENTS` is empty, ask: "Which skill should I test? Provide the path to the skill folder or SKILL.md file." ## Step 1: Read and Analyze the Skill Read SKILL.md and all supporting files. Extract: - Skill name and description - Trigger phrases from the description - Workflow steps f...
/test-skillTests Claude Code skills through RED/GREEN/REFACTOR TDD phases in fresh subagents for validation, checkpoint enforcement, and priming bias prevention. Supports [skill-path] [--phase red|green|refactor].
/skill-evalTests a skill with parallel with-skill and baseline eval runs across multiple prompts, grades against assertions, and iterates on improvements.
/skill-craftCreates skills through checklist-driven phases (intake, design, build, test, ship) or reviews skills/agent prompts with quality gates.
/test-frameworkRuns four-layer test framework for Claude Code plugin skills: validates structure, tests trigger accuracy, runs session scenarios, and compares skill value.
/testRuns TDD workflow: write failing tests first, implement to pass them, refactor. For bugs, applies Prove-It pattern to reproduce and fix.
/testRuns pytest tests for CLI harness on local path or GitHub repo, verifies CLI resolution, and updates TEST.md with results if all pass.
Share bugs, ideas, or general feedback.
Generate comprehensive test cases for a skill and help the user verify it works correctly. Covers three test areas from the Anthropic testing methodology: triggering, functional, and performance.
The user provides: $ARGUMENTS
If $ARGUMENTS is a path, use it directly.
If $ARGUMENTS is empty, ask: "Which skill should I test? Provide the path to the skill folder or SKILL.md file."
Read SKILL.md and all supporting files. Extract:
Goal: Ensure the skill loads at the right times and ONLY the right times.
Generate 3 categories of test queries:
Generate queries that SHOULD activate this skill:
TRIGGERING TESTS - Should Activate:
1. "[exact trigger phrase from description]"
2. "[paraphrased version]"
3. "[indirect request]"
4. "[question about domain]"
5. "[alternative wording]"
Generate queries that should NOT activate this skill:
TRIGGERING TESTS - Should NOT Activate:
1. "[unrelated topic]"
2. "[adjacent domain]"
3. "[keyword in different context]"
4. "[generic overlap]"
5. "[out of scope]"
Generate ambiguous queries where triggering is debatable:
TRIGGERING TESTS - Edge Cases (may or may not trigger):
1. "[ambiguous request]" - Expected: [trigger/no-trigger] because [reason]
2. "[multi-domain]" - Expected: [trigger/no-trigger] because [reason]
3. "[partial overlap]" - Expected: [trigger/no-trigger] because [reason]
Goal: Verify the skill produces correct outputs.
For each major workflow in the skill, generate a test case:
FUNCTIONAL TEST: [Workflow Name]
Given: [Input description and prerequisites]
When: User says "[trigger query]"
Then:
- [ ] [Expected output 1]
- [ ] [Expected output 2]
- [ ] [Expected behavior]
- [ ] No errors or failed tool calls
Generate 2-4 functional test cases covering:
Goal: Establish metrics for measuring skill effectiveness.
PERFORMANCE BASELINE
═══════════════════
Target metrics:
- Trigger accuracy: 90%+ of relevant queries should activate
- False positive rate: <10% of unrelated queries should activate
- Workflow completion: X tool calls expected
- User corrections needed: 0-1 per workflow
- Token estimate: ~[X] tokens per workflow
Comparison (without skill):
- User would need to: [describe manual steps]
- Estimated messages: [X]
- Estimated errors: [X]
- Estimated tokens: [X]
Format all tests as a complete test plan:
SKILL TEST PLAN: [skill-name]
══════════════════════════════
TRIGGERING TESTS (13-21 queries)
├── Should Trigger: [X] queries
├── Should NOT Trigger: [X] queries
└── Edge Cases: [X] queries
FUNCTIONAL TESTS ([X] test cases)
├── Happy path: [description]
├── Secondary: [description]
├── Edge case: [description]
└── Error case: [description]
PERFORMANCE BASELINE
├── Target trigger accuracy: 90%+
├── Expected tool calls: [X]
└── Expected token savings: [X]%
HOW TO RUN THESE TESTS:
1. Enable the skill in Claude.ai or Claude Code
2. For each triggering test, send the query and note if skill loaded
3. For each functional test, run the scenario and check all conditions
4. Track metrics against the performance baseline
5. If trigger accuracy <90%, run /skill-maker:improve with the results
Ask the user: "Would you like me to evaluate the triggering tests now? I can analyze the description and predict which queries would trigger the skill."
If yes, analyze each test query against the skill's description and predict:
Present results with specific description improvements to fix any RISK/FAIL cases.