From skill-forge
Improves existing Claude Code skills by fixing under/over-triggering, refining instructions, adding sub-skills, and evolving architecture based on feedback.
npx claudepluginhub agricidaniel/skill-forgeThis skill uses the workspace's default tool permissions.
Ask the user or analyze logs to identify the problem category:
Optimizes SKILL.md files for Claude Code by scoring across 8 dimensions (YAML completeness, triggers, structure, clarity, tests), proposing improvements, testing changes, and git-reverting non-improvements.
Autonomously optimizes Claude Code skills by iteratively running them on test inputs, scoring against binary evals, reflecting on failures to mutate prompts, and archiving improvements. Invoke via /auto-optimize for skill enhancement or autoresearch.
Refines and validates existing Claude Code skills for clarity, efficiency, and production readiness. Use for improving structure, best practices, token reduction, and production checks.
Share bugs, ideas, or general feedback.
Ask the user or analyze logs to identify the problem category:
Category A: Triggering Issues
Category B: Execution Issues
Category C: Architecture Issues
Category D: Quality Issues
Common causes:
Fix template:
# Before (under-triggers)
description: Analyzes code quality
# After (specific triggers)
description: >
Static code analysis and quality assessment. Checks code style,
complexity, security vulnerabilities, and test coverage. Use when
user says "code review", "code quality", "lint", "static analysis",
"code smell", "code audit", or "check my code".
Fix template:
# Before (over-triggers)
description: Processes documents for review
# After (specific + negative triggers)
description: >
Processes PDF legal documents for contract clause extraction and
compliance review. Use for legal contracts, NDAs, terms of service.
Do NOT use for general document editing, formatting, or non-legal PDFs.
Use structured workspaces to track improvements across iterations:
eval-workspace/
iteration-1/ # First version
eval-0/with_skill/ # Eval results
eval-0/baseline/
benchmark.json # Aggregated metrics
benchmark.md # Human-readable report
feedback.json # User feedback
iteration-2/ # After first improvement
eval-0/with_skill/
eval-0/baseline/
benchmark.json
benchmark.md
feedback.json
The iteration loop:
/skill-forge eval <path> into iteration-<N+1>//skill-forge benchmark <path> with --previous iteration-<N>/feedback.jsonStop iterating when:
For quick fixes without full eval pipeline:
1. Apply the fix
2. Test with the original failing case
3. Test with 3 other cases (regression check)
4. If fix works:
-> Update the directive/SKILL.md
-> Document the learning in references or SKILL.md
5. If fix fails:
-> Diagnose why
-> Try alternative approach
-> Repeat
For triggering issues (Category A), use the automated optimization loop:
python scripts/generate_eval_set.py <path>python scripts/optimize_description.py <path> --eval-set evals.jsonWhen a skill outgrows its tier:
Tier 1 -> Tier 2 (needs scripts):
scripts/Tier 2 -> Tier 3 (needs sub-skills):
skills/{parent}-{child}/SKILL.mdreferences/Tier 3 -> Tier 4 (needs agents):
agents/After evolution:
metadata.version in frontmatter (if present)python scripts/validate_skill.py <path>When a skill needs to adapt behavior by user type:
## Industry Detection
Detect user type from context:
- **Type A**: [signals] -> [behavior]
- **Type B**: [signals] -> [behavior]
When output quality is inconsistent:
## Quality Gates
Before delivering output:
- [ ] [Check 1]
- [ ] [Check 2]
- [ ] [Check 3]
When users need measurable output:
## Scoring (0-100)
| Category | Weight |
|----------|--------|
| Category A | 30% |
| Category B | 30% |
| Category C | 20% |
| Category D | 20% |