Help us improve
Share bugs, ideas, or general feedback.
From builder-product
Use before running any experiment that will inform a product decision. All six elements — hypothesis, metric, sample size, duration, stopping rule, and decision rule — must be defined before the test starts. Blocks "we'll stop when we see something" completions.
npx claudepluginhub rbraga01/a-team --plugin builder-productHow this skill is triggered — by the user, by Claude, or both
Slash command
/builder-product:ab-test-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
Share bugs, ideas, or general feedback.
A TEST WITHOUT A STOPPING RULE RUNS UNTIL THE RESULT LOOKS GOOD.
"We'll stop when we see something significant" is peeking — it inflates false positive rates until the next meaningless spike ships to production.
Hypothesis + metric + sample size + duration + stopping rule + decision rule IS an experiment.
Trigger before:
One testable claim in the form: If [change], then [metric] will [increase/decrease] by [magnitude] because [mechanism].
If we [specific change to control],
then [primary metric] will [direction] by [minimum detectable effect],
because [causal mechanism].
What does NOT count:
The mechanism matters. A hypothesis without a mechanism cannot be learned from even if it's correct — you won't know why.
The single metric that defines success or failure for this test.
Secondary metrics (guardrails and diagnostics — max 5): watched but do not determine the decision.
Calculated before the test starts. Not estimated — calculated.
Parameters:
- Baseline conversion rate: X%
- Minimum Detectable Effect (MDE): Y% relative improvement
- Statistical power: 80% (standard) or 90% (high-stakes decisions)
- Significance level (α): 0.05 (two-tailed)
Required sample per variant: [use a power calculator or scipy.stats]
If the required sample size exceeds your daily traffic in a reasonable timeframe, either:
Do not reduce the sample size to make the test run faster. An underpowered test is not a test.
Minimum test duration = sample size / daily eligible users, rounded up to the nearest full week.
Rules:
If the calculated duration exceeds 6 weeks, the test is not feasible at current traffic. Either increase the MDE or find a higher-traffic surface.
When will you stop the test, and under what conditions?
Stopping rule:
- Fixed duration: stop after [N days], regardless of interim results
- Early stopping: stop early only if [p < 0.001 and sample size ≥ 75% of target]
(sequential testing / alpha spending — not raw significance peeking)
Interim checks: [none / weekly / at 50% and 100% of target sample size]
No peeking without a pre-specified alpha spending function. Every unplanned interim check at α=0.05 inflates the true false positive rate by approximately 20% per check.
What will you do with the result? Defined before the test starts.
If primary metric increases by ≥ MDE at p < 0.05 AND no guardrail breached: SHIP
If primary metric is flat or negative at full duration: NO SHIP
If primary metric increases but guardrail breached: INVESTIGATE — do not ship without resolution
If test ends with insufficient power (traffic less than expected): EXTEND or REDESIGN
"We'll discuss when we see the results" is not a decision rule. It reintroduces discretion at the point where data should be determining the outcome.
One sentence with all four elements: change, metric, direction + magnitude, mechanism.
Primary metric with baseline. Guardrails with tolerance thresholds.
Use actual baseline data and defined MDE. Do not use online calculators with default inputs — use your actual baseline rate.
Divide sample size by daily eligible traffic. Round up to full weeks. Verify it doesn't exceed 6 weeks.
Fixed duration with explicit early-stopping criteria if needed. No open-ended "we'll stop when significant."
All four outcomes (ship, no ship, guardrail breach, insufficient power) mapped to actions.
Store at product/experiments/<feature>-design-<date>.md before any traffic is split.
These thoughts mean the test is not designed — stop:
When ab-test-design is satisfied, state it like this:
Experiment designed.
File: product/experiments/<feature>-design-<date>.md ✓
Hypothesis: If [change], then [metric] will [direction] by [MDE] because [mechanism] ✓
Primary metric: [name] — baseline: X% (from [data source], last [N] days) ✓
Guardrails: <N metrics with tolerance thresholds listed>
Sample size: <N per variant> (power: 80%/90%, α: 0.05, MDE: Y%) ✓
Duration: <N days> = <sample size> ÷ <daily eligible traffic: M users/day> ✓
[Minimum 7 days / within 6-week maximum ✓]
Stopping rule: fixed at <N days>; early stop if p < 0.001 at ≥ 75% sample ✓
Decision rule:
Ship condition: ≥ MDE at p < 0.05, no guardrail breached ✓
No-ship condition: flat or negative at full duration ✓
Guardrail breach: investigate before decision ✓
Insufficient power: extend or redesign ✓
Sample size and duration must be calculated, not estimated. A decision rule must exist for all four outcomes.
Underpowered tests ship features based on noise. Peeked tests ship features based on chance peaks. Tests without decision rules ship features based on whoever argues most convincingly after the results come in. A properly designed experiment is the only way to distinguish a real 3% improvement from a random fluctuation at that magnitude — and 3% improvements compound into meaningful product quality over time.