From skill-forge
Benchmarks Claude Code skill performance via multiple trials per eval, tracking pass rate, execution time, token usage, and variance. Aggregates to benchmark.json and generates version comparison reports. Use for 'benchmark skill' or performance tracking queries.
npx claudepluginhub agricidaniel/skill-forgeThis skill uses the workspace's default tool permissions.
Measure and compare skill performance across iterations with statistical
Evaluates Claude Code skill output quality via assertion-based grading, blind before/after version comparisons, and variance analysis across multiple runs for benchmarking.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Share bugs, ideas, or general feedback.
Measure and compare skill performance across iterations with statistical rigor using multiple trials, variance analysis, and trend tracking.
Accept configuration as:
evals/evals.json (from /skill-forge eval)Benchmark config schema:
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"eval_set_path": "./evals/evals.json",
"trials_per_eval": 3,
"baseline_type": "no_skill",
"previous_benchmark": null,
"thresholds": {
"min_pass_rate": 0.8,
"max_avg_tokens": 100000,
"max_avg_duration_seconds": 120,
"min_improvement_ratio": 1.0
}
}
For each eval, run trials_per_eval times (default: 3) to get reliable metrics:
timing.json and grading.jsonUse agents/skill-forge-executor.md for parallel execution where possible.
Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>:
Output benchmark.json schema:
{
"skill_name": "my-skill",
"iteration": 1,
"timestamp": "2026-03-06T12:00:00Z",
"summary": {
"total_evals": 10,
"with_skill": {
"pass_rate": 0.87,
"pass_rate_std": 0.05,
"avg_tokens": 45000,
"avg_duration_seconds": 34.2
},
"baseline": {
"pass_rate": 0.60,
"pass_rate_std": 0.08,
"avg_tokens": 62000,
"avg_duration_seconds": 52.1
},
"improvement_ratio": 1.45,
"token_savings_ratio": 0.73,
"time_savings_ratio": 0.66
},
"per_eval": [
{
"eval_id": 0,
"eval_name": "basic-trigger",
"with_skill": {"pass_rate": 1.0, "avg_tokens": 30000, "avg_duration_seconds": 20.1},
"baseline": {"pass_rate": 0.67, "avg_tokens": 50000, "avg_duration_seconds": 45.0},
"trials": 3
}
],
"thresholds_met": {
"min_pass_rate": true,
"max_avg_tokens": true,
"max_avg_duration_seconds": true,
"min_improvement_ratio": true
}
}
If previous_benchmark is provided or prior iteration-<N-1> exists:
benchmark.json# Benchmark Report: [skill-name]
## Iteration [N] vs [N-1]
### Summary
| Metric | Current | Previous | Delta | Threshold | Status |
|--------|---------|----------|-------|-----------|--------|
| Pass Rate | 87% | 78% | +9% | >= 80% | PASS |
| Avg Tokens | 45K | 52K | -13% | <= 100K | PASS |
| Avg Time | 34s | 41s | -17% | <= 120s | PASS |
| Improvement | 1.45x | 1.30x | +0.15x | >= 1.0x | PASS |
### Regressions (Action Required)
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-5 | PASS | FAIL | Output missing required section |
### Improvements
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-3 | FAIL | PASS | Error handling now works |
### Per-Eval Detail
[Full breakdown table]
### Variance Analysis
| Eval | Pass Rate | Std Dev | Trials | Reliability |
|------|-----------|---------|--------|-------------|
| eval-0 | 100% | 0.00 | 3 | High |
| eval-1 | 67% | 0.47 | 3 | Low (investigate) |
### Recommendations
[Based on regressions, low-reliability evals, and threshold failures]
If any threshold fails:
/skill-forge evolve to address issues"trials_completed" vs "trials_requested" in per-eval results"unreliable" in the report