Help us improve
Share bugs, ideas, or general feedback.
From backpressured
Use when a backpressured loop needs to run benchmarks on a performance-sensitive project and decide whether a change is a regression, an improvement, or a wash — per-iteration sanity checks and the full pre-done run.
npx claudepluginhub lucasfcosta/backpressured --plugin backpressuredHow this skill is triggered — by the user, by Claude, or both
Slash command
/backpressured:running-benchmarksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**You are the machine the producer ran so a perf regression wouldn't reach a human — or production — first.** Benchmarks are backpressure: they say "no" to a slow patch at the boundary where it was introduced, not three releases later when someone files a latency bug.
Creates and runs reliable benchmarks to measure code change impacts on performance, including latency, throughput. Supports Node.js (vitest, tinybench), Python (pytest-benchmark), frontend (Lighthouse CI), with warmup, stats.
Benchmarks shell metric command N times across 2-3 git refs or repo states, checks variance, computes deltas vs baseline, outputs reproducible TSV table and summary. For honest code variant comparisons.
Enforces sequential performance benchmarks with 60s min runs (30s binary search), 10s warmup, JSON metrics output, and anomaly reruns. For baselines and regressions.
Share bugs, ideas, or general feedback.
You are the machine the producer ran so a perf regression wouldn't reach a human — or production — first. Benchmarks are backpressure: they say "no" to a slow patch at the boundary where it was introduced, not three releases later when someone files a latency bug.
Core principle: a benchmark number is meaningless on its own. A single "after" run proves nothing. A verdict requires a baseline to compare against and a way to separate a real change from run-to-run noise. Everything below exists to enforce those two things, because the natural temptation is to run once and eyeball it — and that is worthless.
backpressured loop is on a performance-sensitive project and needs to confirm a patch didn't regress perf — a quick targeted check each iteration, and the full suite before leaving the loop.Not for: projects with no benchmark suite or no perf sensitivity — skip benchmarking entirely rather than inventing it. This is not a correctness check; tests cover that.
If the project has a BACKPRESSURE.md, check it first — it may name the suite(s) and exact commands to use. Otherwise find the commands yourself — package.json scripts, Makefile, a benchmarks//bench/ dir, CI config, README. Identify which suites exist and roughly how long each takes. Expect a spread: fast micro/targeted suites (seconds) and full end-to-end suites (minutes). Note what each one actually measures so you can map a suite to the code path you touched.
Don't run an 8-minute end-to-end suite to check a change that only touches one micro path; don't declare done off a micro suite alone.
Never judge from the patched numbers alone. Run the unchanged baseline (git stash, or check out the parent commit / a worktree) and your change back-to-back, on the same machine under the same conditions. The baseline is the only thing that makes the "after" number mean anything.
Quiet the machine first — close background load, don't run benchmarks mid-build or while other heavy processes compete. Uncontrolled machine state is the largest real source of noise, and "same conditions" means actually the same, not just same hardware.
Benchmark output is noisy (often ±3–5% run to run). One run per side is not evidence.
| Verdict | Condition | Action |
|---|---|---|
| Wash (neutral) | Patched median within the noise band of baseline; distributions overlap | Passing. Move on. |
| Improvement | Patched consistently faster, gap clears the noise band | Passing. Note the win. |
| Regression | Patched consistently slower, gap clears the noise band, distributions don't overlap | FAIL. Fix it, or stop and name the blocker — never wave it through. |
| Ambiguous | Result straddles the band (e.g. ~noise ± a little) | Not a verdict yet. Add runs (≥10/side) and/or run the full suite to resolve before deciding. |
If a suspicious result still won't separate from noise after more runs — a small true shift that the suite genuinely can't resolve — do not default it to a pass. Treat it as a blocker and name it (the change may need a less noisy benchmark, or a human call), exactly as you would any other unresolved check.
"No regression" asserted with no numbers is not a passing check. Record baseline vs patched, run count, and the median/min you compared. Evidence before assertion — the same standard the rest of the loop holds you to.
| Rationalization | Reality |
|---|---|
| "I ran it once, the number looks fine" | One run inside a ±5% noise band tells you nothing. Baseline + repeats or it doesn't count. |
| "It's a tiny bit slower, probably noise" | "Probably noise" is a measurement you didn't take. If it straddles the band, add runs until it resolves. |
| "The micro suite is green, we're done" | Micro-green can hide a full-suite regression. Run the full suite before done. |
| "Running the baseline too is double the work" | Without a baseline the patched number is uninterpretable. Half the work producing nothing is not a saving. |
| "I'll just trust the full suite once at the end" | A regression found at the end is a bisect. Run the targeted suite each iteration. |