From skill-forge
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
npx claudepluginhub agricidaniel/skill-forgeThis skill uses the workspace's default tool permissions.
Run structured evaluations against Claude Code skills to verify triggering,
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Evaluates Claude Code skill output quality via assertion-based grading, blind before/after version comparisons, and variance analysis across multiple runs for benchmarking.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.
Share bugs, ideas, or general feedback.
Run structured evaluations against Claude Code skills to verify triggering, correctness, and quality using a multi-agent pipeline.
Accept eval definitions from:
evals/evals.json or user-specified fileEval set JSON schema:
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"evals": [
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"input_files": [],
"assertions": [
{
"name": "output-has-score",
"check": "Output contains a numeric score between 0-100",
"weight": 1.0
}
],
"should_trigger": true
}
]
}
If no eval set exists, generate one:
python scripts/generate_eval_set.py <skill-path> to produce a starter setCreate the eval workspace outside the skill directory to avoid confusing eval artifacts with skill files. Use a sibling directory or a dedicated location:
eval-workspace/
iteration-1/
eval-0/
eval_metadata.json # Assertions and config for this eval
with_skill/
outputs/ # Skill execution outputs
timing.json # Token count + duration
grading.json # Assertion results + evidence
baseline/
outputs/
timing.json
grading.json
eval-1/
eval_metadata.json
with_skill/
outputs/
timing.json
grading.json
baseline/
outputs/
timing.json
grading.json
benchmark.json # Aggregated metrics
benchmark.md # Human-readable report
For each eval directory, create eval_metadata.json from the eval set entry:
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"assertions": [...],
"should_trigger": true
}
For each eval in the set, spawn two parallel runs:
With-skill run (delegate to agents/skill-forge-executor.md):
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>
Baseline run (delegate to agents/skill-forge-executor.md):
Save timing data to timing.json in each run directory:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Delegate to agents/skill-forge-grader.md for each completed run:
grading.json per run:{
"eval_id": 0,
"run_type": "with_skill",
"assertions": [
{
"name": "output-has-score",
"passed": true,
"evidence": "Found score: 87/100 on line 14"
}
],
"pass_rate": 1.0
}
Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>
This produces benchmark.json and benchmark.md with:
Delegate to agents/skill-forge-analyzer.md to:
Generate a summary report:
# Eval Report: [skill-name] — Iteration [N]
## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |
## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |
## Patterns & Insights
[From analyzer agent]
## Recommendations
[Specific improvements based on failures]
Save user feedback to feedback.json:
{
"reviews": [
{
"run_id": "eval-0-with_skill",
"feedback": "the chart is missing axis labels",
"timestamp": "2026-03-06T12:00:00Z"
}
],
"status": "complete"
}
Pass feedback to /skill-forge evolve for the next iteration.
For rigorous A/B testing between skill versions:
agents/skill-forge-comparator.mdeval-<ID>/with_skill/outputs/ and eval-<ID>/baseline/outputs/eval_metadata.json"timed_out": true in timing.jsonerror.txt in the run directory and continue with remaining evals"passed": null with evidence explaining whyBefore marking an eval run as complete: