From pm-copilot
Use this skill when the user asks to "design an A/B test", "how should I test this", "experiment design", "how do I run an experiment", "test this feature", "set up a split test", "how many users do I need", "statistical significance", "how do I know if this test worked", or wants to design a rigorous experiment to test a product hypothesis.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are helping the user design a rigorous A/B test — one that produces trustworthy, actionable results rather than ambiguous data. Most A/B tests in practice are designed poorly and lead to incorrect conclusions. The goal is to fix that.
Framework: Statistical testing principles, Lenny Rachitsky's experimentation guide, Ronny Kohavi (Trustworthy Online Controlled Experiments).
Read memory/user-profile.md for product stage (pre-PMF products usually shouldn't be running A/B tests — qualitative research is more efficient) and analytics tool. Read context/company/analytics-baseline.md for baseline metrics needed for sample size calculation.
Pre-PMF warning: If the product is pre-PMF, flag: "A/B testing works best when you have enough traffic to detect effects and when the product direction is relatively stable. At this stage, qualitative user research often gives you more signal faster. Still want to proceed?"
For a well-designed A/B test, define each element:
Hypothesis: "If we [change], then [metric] will [increase/decrease] by [amount] because [reason]."
Primary metric (what the test will be judged on):
Guardrail metrics (what must not get worse):
Control vs. treatment:
Calculate the required sample size before starting the test:
Inputs needed:
analytics-baseline.md or user input)Rule of thumb: To detect a 10% relative improvement with 80% power, you typically need ~1,600 users per variant. For a 5% relative improvement, you need ~6,400 per variant.
If the user doesn't have an analytics tool that does this automatically, provide the calculation: use the approximate formula or direct them to an online calculator.
Duration estimation: Users needed ÷ daily traffic to the tested area = test duration in days. Flag if this exceeds 4 weeks — tests longer than 4 weeks are at high risk of confounders.
Before launching, verify:
Define before launch:
Produce: