Help us improve
Share bugs, ideas, or general feedback.
From godmode
Benchmarks shell metric command N times across 2-3 git refs or repo states, checks variance, computes deltas vs baseline, outputs reproducible TSV table and summary. For honest code variant comparisons.
npx claudepluginhub arbazkhan971/godmodeHow this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:benchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- `/godmode:bench`, "benchmark", "compare variants", "A/B metric"
Creates and runs reliable benchmarks to measure code change impacts on performance, including latency, throughput. Supports Node.js (vitest, tinybench), Python (pytest-benchmark), frontend (Lighthouse CI), with warmup, stats.
Use when a backpressured loop needs to run benchmarks on a performance-sensitive project and decide whether a change is a regression, an improvement, or a wash — per-iteration sanity checks and the full pre-done run.
Designs structured benchmarks comparing algorithms, models, or implementations with metrics, test cases, hardware context, and tradeoff analysis to produce reproducible performance reports.
Share bugs, ideas, or general feedback.
/godmode:bench, "benchmark", "compare variants", "A/B metric"optimize for that. bench measures, never modifies source.Ask once, cache for the session:
metric_cmd — shell command printing ONE number to stdout (lower-is-better or higher-is-better, user declares direction).variants — list of 2-3 entries. Each variant is a record:
name — short label (e.g. main, terse_on, feature_branch)prep — EITHER a git ref (git checkout <ref>) OR an inline shell command that puts the repo into the desired state (e.g. export GODMODE_TERSE=1, git checkout feature-branch)teardown — optional shell command to undo prep (default: git checkout - for refs, unset VAR for env)baseline — name of the variant all deltas are computed against. Must match one variants[].name.N — runs per variant. Default 5. Minimum 3. Maximum 20.variance_threshold — stdev/mean ratio that flags a variant noisy. Default 0.05 (5%).metric_cmd must emit a single number; baseline must match a variant; N >= 3.HEAD sha, branch, dirty bit). Refuse to run if working tree is dirty.prep. Abort variant on non-zero exit, mark prep_failed.
b. Run metric_cmd N times. Collect numbers into an array.
c. Compute mean, median, stdev, cv = stdev/mean.
d. IF cv > variance_threshold: retry the FULL N-run block up to 3 times (variance recovery).
e. IF cv still > threshold after 3 retries: mark variant measurement_error, record best-effort numbers.
f. Run teardown. Fail loud if teardown leaves repo dirty.git rev-parse HEAD.delta_pct for each variant vs. baseline: (variant.median - baseline.median) / baseline.median * 100.mean = sum(runs) / N
median = sorted(runs)[N//2] # for even N, average of middle two
stdev = sqrt(sum((x - mean)^2) / (N - 1))
cv = stdev / mean
delta% = (variant.median - baseline.median) / baseline.median * 100
All computed with awk — no python, no bc, no external deps.
Print to stdout a markdown table:
| variant | runs | mean | median | stdev | delta% | status |
|------------|------|---------|---------|--------|---------|----------|
| main | 5 | 124.40 | 124.00 | 2.10 | 0.00 | baseline |
| terse_on | 5 | 98.60 | 99.00 | 1.80 | -20.16 | ok |
| feature | 5 | 131.20 | 130.00 | 12.40 | +4.84 | noisy* |
Append one row per variant to .godmode/bench-results.tsv:
timestamp run_id variant N mean median stdev cv delta_pct status metric_cmd git_ref
run_id is a UUID or date +%s — groups all variants from a single invocation.
Write a one-paragraph summary to .godmode/bench-summary.md, overwriting any prior version:
# Bench <run_id> — <timestamp>
Ran `<metric_cmd>` N=<N> times across <k> variants. Baseline: <baseline>.
Best: <winner> at <median> (<delta>% vs baseline). Worst: <loser> at <median> (<delta>%).
Noisy: <list or "none">. All variants measured from clean HEAD <sha>. Reproduce: re-run
`/godmode:bench` with identical variants file.
.godmode/bench-results.tsv and .godmode/bench-summary.md. Touching any source file = abort + fail loud.git status --porcelain must be empty.cv > threshold ALWAYS triggers retry, even if the delta looks conclusive.NaN and count it against N; never silently drop.run_id with message bench: <run_id> <k> variants.bash, awk, git, and standard Unix utilities. No python, no jq, no node.metric_cmd runs with set -e; set -o pipefail. Non-zero exit = failed run, not a zero measurement.FOR variant in variants:
runs = collect(N)
retries = 0
WHILE cv(runs) > threshold AND retries < 3:
runs = collect(N) # full fresh block, NOT append
retries += 1
IF cv(runs) > threshold:
status = "measurement_error"
ELSE:
status = "ok"
Each retry is a fresh N-run block; never mix runs from different retry attempts.
KEEP a variant's measurement if:
- N valid numeric samples collected
- cv <= variance_threshold (possibly after retries)
- prep and teardown both exited 0
- git state restored cleanly between variants
DISCARD a variant's measurement if:
- prep failed (record prep_failed, do NOT retry)
- >3 retries still noisy (record measurement_error, keep best-effort row, mark unreliable)
- metric_cmd produced NaN for >=N/2 runs
- teardown left repo dirty (abort entire run, unsafe to continue)
On DISCARD of a whole run: git reset --hard to the snapshotted starting sha. Summary notes
the abort reason. TSV still gets partial rows with status=aborted.
STOP when FIRST of:
- all_measured: every variant has status in {ok, measurement_error, prep_failed}
- variance_unrecoverable: >=ceil(k/2) variants are measurement_error (comparison is meaningless)
- budget_exhausted: total metric_cmd invocations > N * k * 4 (N runs * k variants * 4 retries)
- unsafe_state: teardown failed or HEAD drifted mid-run
On stop: always write summary.md and commit results.tsv, even on abort.
bench-results.tsv has exactly one row per variant in the run (k rows per run_id).delta_pct value (0.00 for the baseline row).measurement_error unless cv truly exceeded threshold after 3 retries.bench-summary.md names the winner, loser, baseline, and flags any noisy variants.git rev-parse HEAD after the run matches the snapshotted starting sha.| Failure | Action |
|---|---|
| Dirty working tree at start | Abort before first variant. Tell user to stash or commit. |
metric_cmd non-numeric | Pipe through tail -1 | awk '{print $NF}'. If still non-numeric, record NaN. |
prep fails for a variant | Mark prep_failed, skip runs, continue to next variant. |
| Variance unrecoverable after 3 retries | Mark measurement_error, keep row, emit warning in summary. |
| HEAD drift mid-run | Abort. git reset --hard <starting_sha>. Refuse to emit comparison. |
teardown leaves repo dirty | Abort entire run. Do not proceed to next variant. |
.godmode/bench-results.tsv is append-only. Header written on first create:
timestamp run_id variant N mean median stdev cv delta_pct status metric_cmd git_ref
status is one of: baseline, ok, measurement_error, prep_failed, aborted.
Never rewrite history. One run_id groups k rows. Commit every run.