Skill

running-benchmarks

Use when a backpressured loop needs to run benchmarks on a performance-sensitive project and decide whether a change is a regression, an improvement, or a wash — per-iteration sanity checks and the full pre-done run.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/backpressured:running-benchmarks

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

**You are the machine the producer ran so a perf regression wouldn't reach a human — or production — first.** Benchmarks are backpressure: they say "no" to a slow patch at the boundary where it was introduced, not three releases later when someone files a latency bug.

SKILL.md

77 lines · ~1.5k tokens

Stats

LanguageJavaScript

Stars22

MaintenanceExcellent

Last CommitMay 31, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Running Benchmarks

Overview

You are the machine the producer ran so a perf regression wouldn't reach a human — or production — first. Benchmarks are backpressure: they say "no" to a slow patch at the boundary where it was introduced, not three releases later when someone files a latency bug.

Core principle: a benchmark number is meaningless on its own. A single "after" run proves nothing. A verdict requires a baseline to compare against and a way to separate a real change from run-to-run noise. Everything below exists to enforce those two things, because the natural temptation is to run once and eyeball it — and that is worthless.

When to Use

A backpressured loop is on a performance-sensitive project and needs to confirm a patch didn't regress perf — a quick targeted check each iteration, and the full suite before leaving the loop.

Not for: projects with no benchmark suite or no perf sensitivity — skip benchmarking entirely rather than inventing it. This is not a correctness check; tests cover that.

Step 1 — Discover the suites and their time budgets

If the project has a BACKPRESSURE.md, check it first — it may name the suite(s) and exact commands to use. Otherwise find the commands yourself — package.json scripts, Makefile, a benchmarks//bench/ dir, CI config, README. Identify which suites exist and roughly how long each takes. Expect a spread: fast micro/targeted suites (seconds) and full end-to-end suites (minutes). Note what each one actually measures so you can map a suite to the code path you touched.

Step 2 — Pick the suite for the moment

Each iteration: run the fast, targeted suite that covers the path you changed. It must be cheap enough to run after every patch without rabbit-holing.
Before done: run the full suite once. Targeted-green is necessary, not sufficient — a change can help one path and regress another.

Don't run an 8-minute end-to-end suite to check a change that only touches one micro path; don't declare done off a micro suite alone.

Step 3 — Always measure baseline vs patched

Never judge from the patched numbers alone. Run the unchanged baseline (git stash, or check out the parent commit / a worktree) and your change back-to-back, on the same machine under the same conditions. The baseline is the only thing that makes the "after" number mean anything.

Quiet the machine first — close background load, don't run benchmarks mid-build or while other heavy processes compete. Uncontrolled machine state is the largest real source of noise, and "same conditions" means actually the same, not just same hardware.

Step 4 — Beat the noise with repetition

Benchmark output is noisy (often ±3–5% run to run). One run per side is not evidence.

Run each side multiple times (≥5 when the suite is cheap).
Compare the median (and the min for CPU-bound microbenchmarks — least contended, most stable), not a single sample.
Establish the noise band from the spread of the repeated runs (or a known per-suite figure). The band is your threshold for everything below.

Step 5 — The verdict: regression / improvement / wash

Verdict	Condition	Action
Wash (neutral)	Patched median within the noise band of baseline; distributions overlap	Passing. Move on.
Improvement	Patched consistently faster, gap clears the noise band	Passing. Note the win.
Regression	Patched consistently slower, gap clears the noise band, distributions don't overlap	FAIL. Fix it, or stop and name the blocker — never wave it through.
Ambiguous	Result straddles the band (e.g. ~noise ± a little)	Not a verdict yet. Add runs (≥10/side) and/or run the full suite to resolve before deciding.

If a suspicious result still won't separate from noise after more runs — a small true shift that the suite genuinely can't resolve — do not default it to a pass. Treat it as a blocker and name it (the change may need a less noisy benchmark, or a human call), exactly as you would any other unresolved check.

Step 6 — Record the evidence

"No regression" asserted with no numbers is not a passing check. Record baseline vs patched, run count, and the median/min you compared. Evidence before assertion — the same standard the rest of the loop holds you to.

Common rationalizations

Rationalization	Reality
"I ran it once, the number looks fine"	One run inside a ±5% noise band tells you nothing. Baseline + repeats or it doesn't count.
"It's a tiny bit slower, probably noise"	"Probably noise" is a measurement you didn't take. If it straddles the band, add runs until it resolves.
"The micro suite is green, we're done"	Micro-green can hide a full-suite regression. Run the full suite before done.
"Running the baseline too is double the work"	Without a baseline the patched number is uninterpretable. Half the work producing nothing is not a saving.
"I'll just trust the full suite once at the end"	A regression found at the end is a bisect. Run the targeted suite each iteration.

Red flags — STOP

Reporting a benchmark result with no baseline to compare against.
Declaring "no regression" with no recorded numbers.
Calling an ambiguous straddle-the-band result a pass instead of adding runs.
Skipping the full suite before done because a targeted suite was green.
Treating a real, beyond-noise slowdown as acceptable without fixing it or naming the blocker.

running-benchmarks

Popularity

Invocation

Context Preview

SKILL.md

running-benchmarks

Popularity

Invocation

Context Preview

SKILL.md

Running Benchmarks

Overview

When to Use

Step 1 — Discover the suites and their time budgets

Step 2 — Pick the suite for the moment

Step 3 — Always measure baseline vs patched

Step 4 — Beat the noise with repetition

Step 5 — The verdict: regression / improvement / wash

Step 6 — Record the evidence

Common rationalizations

Red flags — STOP

Similar Skills

Running Benchmarks

Overview

When to Use

Step 1 — Discover the suites and their time budgets

Step 2 — Pick the suite for the moment

Step 3 — Always measure baseline vs patched

Step 4 — Beat the noise with repetition

Step 5 — The verdict: regression / improvement / wash

Step 6 — Record the evidence

Common rationalizations

Red flags — STOP

Similar Skills