Help us improve
Share bugs, ideas, or general feedback.
From pm-copilot
Use this skill when the user asks to "design an A/B test", "how should I test this", "experiment design", "how do I run an experiment", "test this feature", "set up a split test", "how many users do I need", "statistical significance", "how do I know if this test worked", or wants to design a rigorous experiment to test a product hypothesis.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyHow this skill is triggered — by the user, by Claude, or both
Slash command
/pm-copilot:ab-test-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are helping the user design a rigorous A/B test — one that produces trustworthy, actionable results rather than ambiguous data. Most A/B tests in practice are designed poorly and lead to incorrect conclusions. The goal is to fix that.
Design statistically rigorous A/B tests for product features, UI changes, onboarding flows, and pricing experiments. Produces complete test plan with hypothesis, sample size, duration, and results interpretation.
Designs A/B tests with hypotheses, variants, metrics, sample size calculations, duration, pitfalls, and best practices. For statistically validating product changes.
A/B test design — produce an experiment spec with hypothesis, primary metric, MDE, sample size, run time, and decision rule. Also determines when NOT to A/B test and what to do instead. Use when asked to "design an A/B test", "should we test this", "experiment design", "how do we know if this works", "what's the sample size", or "set up an experiment".
Share bugs, ideas, or general feedback.
You are helping the user design a rigorous A/B test — one that produces trustworthy, actionable results rather than ambiguous data. Most A/B tests in practice are designed poorly and lead to incorrect conclusions. The goal is to fix that.
Framework: Statistical testing principles, Lenny Rachitsky's experimentation guide, Ronny Kohavi (Trustworthy Online Controlled Experiments).
Read memory/user-profile.md for product stage (pre-PMF products usually shouldn't be running A/B tests — qualitative research is more efficient) and analytics tool. Read context/company/analytics-baseline.md for baseline metrics needed for sample size calculation.
Pre-PMF warning: If the product is pre-PMF, flag: "A/B testing works best when you have enough traffic to detect effects and when the product direction is relatively stable. At this stage, qualitative user research often gives you more signal faster. Still want to proceed?"
For a well-designed A/B test, define each element:
Hypothesis: "If we [change], then [metric] will [increase/decrease] by [amount] because [reason]."
Primary metric (what the test will be judged on):
Guardrail metrics (what must not get worse):
Control vs. treatment:
Calculate the required sample size before starting the test:
Inputs needed:
analytics-baseline.md or user input)Rule of thumb: To detect a 10% relative improvement with 80% power, you typically need ~1,600 users per variant. For a 5% relative improvement, you need ~6,400 per variant.
If the user doesn't have an analytics tool that does this automatically, provide the calculation: use the approximate formula or direct them to an online calculator.
Duration estimation: Users needed ÷ daily traffic to the tested area = test duration in days. Flag if this exceeds 4 weeks — tests longer than 4 weeks are at high risk of confounders.
Before launching, verify:
Define before launch:
Produce: