skill-forge-benchmark | skill-forge | ClaudePluginHub

Skill

skill-forge-benchmark

From skill-forge

Benchmarks Claude Code skill performance via multiple trials per eval, tracking pass rate, execution time, token usage, and variance. Aggregates to benchmark.json and generates version comparison reports. Use for 'benchmark skill' or performance tracking queries.

developer-tools

$

npx claudepluginhub agricidaniel/skill-forge

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Measure and compare skill performance across iterations with statistical

SKILL.md

Similar Skills

skill-eval

77

Evaluates Claude Code skill output quality via assertion-based grading, blind before/after version comparisons, and variance analysis across multiple runs for benchmarking.

8 files

claude-caliper-workflow

skill-tester

23

Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.

3 files8 tools

skill-forge-eval

42

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

Stats

Stars42

Forks20

Last CommitMar 6, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

skill-benchmarking

performance-tracking

variance-analysis

skill-comparison

Help us improve

Share bugs, ideas, or general feedback.

Skill Benchmarking & Performance Tracking

Measure and compare skill performance across iterations with statistical rigor using multiple trials, variance analysis, and trend tracking.

Process

Step 1: Define Benchmark Configuration

Accept configuration as:

Existing eval set: Path to evals/evals.json (from /skill-forge eval)
Benchmark config: Custom config with trial count and thresholds

Benchmark config schema:

{
  "skill_name": "my-skill",
  "skill_path": "./my-skill",
  "eval_set_path": "./evals/evals.json",
  "trials_per_eval": 3,
  "baseline_type": "no_skill",
  "previous_benchmark": null,
  "thresholds": {
    "min_pass_rate": 0.8,
    "max_avg_tokens": 100000,
    "max_avg_duration_seconds": 120,
    "min_improvement_ratio": 1.0
  }
}

Step 2: Execute Benchmark Runs

For each eval, run trials_per_eval times (default: 3) to get reliable metrics:

Execute with-skill runs (3x per eval)
Execute baseline runs (3x per eval)
Capture per-run: pass/fail, token count, duration
Save each run's timing.json and grading.json

Use agents/skill-forge-executor.md for parallel execution where possible.

Step 3: Aggregate Results

Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>:

Output benchmark.json schema:

{
  "skill_name": "my-skill",
  "iteration": 1,
  "timestamp": "2026-03-06T12:00:00Z",
  "summary": {
    "total_evals": 10,
    "with_skill": {
      "pass_rate": 0.87,
      "pass_rate_std": 0.05,
      "avg_tokens": 45000,
      "avg_duration_seconds": 34.2
    },
    "baseline": {
      "pass_rate": 0.60,
      "pass_rate_std": 0.08,
      "avg_tokens": 62000,
      "avg_duration_seconds": 52.1
    },
    "improvement_ratio": 1.45,
    "token_savings_ratio": 0.73,
    "time_savings_ratio": 0.66
  },
  "per_eval": [
    {
      "eval_id": 0,
      "eval_name": "basic-trigger",
      "with_skill": {"pass_rate": 1.0, "avg_tokens": 30000, "avg_duration_seconds": 20.1},
      "baseline": {"pass_rate": 0.67, "avg_tokens": 50000, "avg_duration_seconds": 45.0},
      "trials": 3
    }
  ],
  "thresholds_met": {
    "min_pass_rate": true,
    "max_avg_tokens": true,
    "max_avg_duration_seconds": true,
    "min_improvement_ratio": true
  }
}

Step 4: Compare with Previous Iterations

If previous_benchmark is provided or prior iteration-<N-1> exists:

Load previous benchmark.json
Calculate delta per metric:
- Pass rate change
- Token usage change
- Duration change
- New regressions (evals that passed before but fail now)
- New improvements (evals that failed before but pass now)

Step 5: Generate Benchmark Report

# Benchmark Report: [skill-name]

## Iteration [N] vs [N-1]

### Summary
| Metric | Current | Previous | Delta | Threshold | Status |
|--------|---------|----------|-------|-----------|--------|
| Pass Rate | 87% | 78% | +9% | >= 80% | PASS |
| Avg Tokens | 45K | 52K | -13% | <= 100K | PASS |
| Avg Time | 34s | 41s | -17% | <= 120s | PASS |
| Improvement | 1.45x | 1.30x | +0.15x | >= 1.0x | PASS |

### Regressions (Action Required)
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-5 | PASS | FAIL | Output missing required section |

### Improvements
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-3 | FAIL | PASS | Error handling now works |

### Per-Eval Detail
[Full breakdown table]

### Variance Analysis
| Eval | Pass Rate | Std Dev | Trials | Reliability |
|------|-----------|---------|--------|-------------|
| eval-0 | 100% | 0.00 | 3 | High |
| eval-1 | 67% | 0.47 | 3 | Low (investigate) |

### Recommendations
[Based on regressions, low-reliability evals, and threshold failures]

Step 6: Threshold Gating

If any threshold fails:

Flag as FAIL with specific threshold details
List which evals caused the failure
Recommend running /skill-forge evolve to address issues
Do NOT approve for publish until thresholds pass

Error Handling

Flaky trials: If a trial times out or crashes, exclude it from variance calculation and note "trials_completed" vs "trials_requested" in per-eval results
Insufficient trials: If fewer than 2 trials complete for an eval, flag variance as "unreliable" in the report
Missing baseline: If baseline runs fail entirely, report with-skill results only and skip improvement_ratio
Threshold edge cases: If pass_rate equals the threshold exactly, treat as PASS

Integration with Other Sub-Skills

skill-forge-eval: Provides the eval set and grading infrastructure
skill-forge-evolve: Receives benchmark failures as improvement targets
skill-forge-publish: Requires benchmark pass (score >= thresholds) before publish
skill-forge-review: Can include benchmark summary in review report