Help us improve
Share bugs, ideas, or general feedback.
From builder-growth
Use before running any growth experiment — pricing test, copy variant, onboarding flow, feature gate — that will inform a ship/no-ship decision. All six elements must be defined before the test starts. Blocks "we'll run it for a while and see" completions.
npx claudepluginhub rbraga01/a-team --plugin builder-growthHow this skill is triggered — by the user, by Claude, or both
Slash command
/builder-growth:experiment-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
Share bugs, ideas, or general feedback.
AN EXPERIMENT WITHOUT A STOPPING RULE IS NOT AN EXPERIMENT — IT IS A FEATURE WAITING FOR PERMISSION.
"We'll run it for a while and see" produces results that stop when the team wants them to stop — which is when they look good.
Hypothesis + metric + sample size + duration + stopping rule + decision rule IS an experiment.
Trigger before:
user-research-synthesis)ab-test-design in builder-product (use that skill for product features; use this skill for growth surfaces)Growth experiments share the same statistical requirements as product experiments. The difference is in the metrics and what "conversion" means.
One sentence with four parts: change, metric, direction + magnitude, mechanism.
If we [specific change to the control experience],
then [primary conversion metric] will [increase/decrease] by at least [MDE]%,
because [causal mechanism — why this change affects this metric].
The mechanism matters for learning. If the test result matches the hypothesis but the mechanism was wrong, you cannot predict whether the same change will work elsewhere.
The single metric that determines ship or no-ship.
Growth-specific metric types:
One primary metric per experiment. Multiple primary metrics require multiple experiments or Bonferroni correction — which you are not doing.
Calculate using your primary metric's baseline rate, the MDE, statistical power (80% standard; 90% for revenue decisions), and α = 0.05.
Baseline rate: X% (from last 30 days of data)
MDE: Y% relative (minimum improvement worth shipping)
Power: 80%
α: 0.05
Required sample per variant: N
If calculating manually: use the formula for two-proportion z-test. Otherwise use a validated power calculator with these exact inputs — not defaults.
= required sample size ÷ daily eligible traffic, rounded up to complete weeks.
Growth-specific rules:
When you stop, and what makes you stop early.
Fixed: stop after [N days], regardless of interim results.
Early stop (optional): stop if p < 0.001 AND sample ≥ 80% of target
— applies only with pre-specified sequential testing plan
Interim looks: [none / at 50% and 100% of target] — no open-ended checking
Peeking at results and stopping when p < 0.05 produces a real false positive rate of approximately 0.23 at 5 checks, not 0.05.
What you do with each possible result, defined before the test:
Ship: primary metric ↑ ≥ MDE at p < 0.05, no guardrail breached
No-ship: flat or negative at full duration
Guardrail breached: investigate — do not ship without understanding the trade-off
Insufficient power: extend or redesign — do not interpret underpowered results
"We'll look at the data together and decide" is not a decision rule.
One sentence, all four parts. Run it through funnel-analysis or positioning-audit first if the hypothesis is about a diagnosed leak or a messaging test.
Name it, verify the baseline (last 30 days), and confirm it can be measured within the test duration.
With actual baseline data. Document the inputs — baseline rate, MDE, power, α. If the result requires more traffic than available in 8 weeks, revise the MDE or find a higher-traffic surface.
Sample size ÷ daily traffic. Round up to full weeks. Check against retention measurement requirements.
Fixed duration. Optional early-stop criteria with conservative threshold.
All four outcomes mapped.
Store at growth/experiments/<experiment>-design-<date>.md before any traffic is split.
These thoughts mean the experiment is not designed — stop:
When experiment-design is satisfied, state it like this:
Experiment designed.
File: growth/experiments/<experiment>-design-<date>.md ✓
Hypothesis: If [change], then [metric] will [direction] by [MDE] because [mechanism] ✓
Primary metric: [name] — baseline: X% (from [source], last 30 days) ✓
Guardrails: <N metrics with tolerance thresholds>
Sample size: <N per variant> — inputs: baseline X%, MDE Y%, power 80%, α 0.05 ✓
Duration: <N days> = <sample> ÷ <daily traffic: M> ✓
[Minimum 14 days / within 8-week maximum ✓]
[Retention check: measurement point reached within test duration ✓]
Stopping rule: fixed at <N days>; early stop at p < 0.001 if ≥ 80% sample ✓
Decision rule:
Ship: ≥ MDE, p < 0.05, no guardrail breach ✓
No-ship: flat/negative at full duration ✓
Guardrail breach: investigate ✓
Insufficient power: extend or redesign ✓
Sample size must be calculated, not estimated. Duration must be computed, not guessed.
Growth experiments that stop when they look good produce a library of "successful" tests with no cumulative impact. The reason is false positives from peeking — each stopped-when-good result has a higher-than-nominal false positive rate, so the experiments that shipped produced no real effect. An experiment designed to the standard above produces a result you can act on — and learn from when the mechanism prediction is right or wrong.