From pm-delivery
Designs A/B test plans with hypotheses, success metrics, sample sizes, durations, and interpretation guides for product features, UI, onboarding, and pricing experiments.
npx claudepluginhub mohitagw15856/pm-claude-skills --plugin pm-deliveryThis skill uses the workspace's default tool permissions.
Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.
Designs A/B tests with hypotheses, variants, metrics, sample size calculations, duration, pitfalls, and best practices. For statistically validating product changes.
Designs complete A/B test plans from hypotheses, including structured hypothesis, primary/guardrail metrics, variants, sample size, duration, success criteria, and risks.
Produces A/B test experiment specs with hypothesis, primary metric, MDE, sample size, run time, and decision rules. Determines when not to test and suggests alternatives.
Share bugs, ideas, or general feedback.
Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.
Before running any test, confirm:
"We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."
Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.
Use this formula (provide the output, not the formula, to the user):
For common scenarios, provide pre-calculated estimates:
| Baseline Rate | MDE (Relative) | Required Sample per Variant |
|---|---|---|
| 5% | 20% | ~19,000 |
| 10% | 15% | ~14,000 |
| 20% | 10% | ~15,000 |
| 40% | 10% | ~9,500 |
| 60% | 5% | ~42,000 |
Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."
Minimum: 2 full weeks (to capture weekly seasonality) Maximum: 4 weeks (novelty effect distorts results beyond this)
Duration = Required sample ÷ (Daily traffic × % exposed)
Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).
Hypothesis:
[Filled hypothesis template]
Variants:
Primary Metric: [Metric name + how measured] Guardrail Metrics: [Metrics that must not degrade]
Target Segment: [Who sees the test — % of traffic, user type] Traffic Split: [50/50 recommended unless ramp-up needed]
Sample Size Required: ~[N] users per variant Estimated Duration: [X] weeks (based on [Y] daily eligible users) Significance Threshold: 95% confidence, 80% power
Exclusions: [Any user segments to exclude and why]
Rollback Trigger: If [guardrail metric] degrades by [X%], stop the test immediately.
Results Interpretation Guide: