Help us improve
Share bugs, ideas, or general feedback.
From ai-business-skills
Guides setup of valid A/B tests for global marketing: hypothesis formulation, sample size calculation, statistical significance, multi-arm testing, primary vs secondary metrics. Covers Optimizely, VWO, Google Optimize alternatives, and built-in platform tests (Meta, Google).
npx claudepluginhub minhnv0807/ai-business-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/ai-business-skills:19-ab-test-setup-globalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> Run experiments that produce decisions, not noise. Most "A/B tests" in marketing are underpowered, peeked-at, and badly hypothesized — meaning the team learns nothing and ships the louder variant.
Guides setup of A/B tests for ads, landing pages, emails, or products. Covers variable selection, sample size calculation, tracking setup, and statistical significance analysis.
Guides planning, designing, and implementing A/B tests, split tests, multivariate experiments. Covers hypotheses, sample sizes, test types, statistical principles.
Designs and implements A/B tests with statistical rigor: hypothesis framing, sample size calculation, and test type selection.
Share bugs, ideas, or general feedback.
Run experiments that produce decisions, not noise. Most "A/B tests" in marketing are underpowered, peeked-at, and badly hypothesized — meaning the team learns nothing and ships the louder variant.
A valid A/B test answers one question: "Did this change cause a real improvement, or am I seeing noise?"
To answer it credibly you need four things:
If any one of these is missing, you don't have an A/B test — you have a coin flip with extra steps.
Common newbie mistake: running a test for 3 days, seeing variant B 40% higher, declaring victory, and shipping. Three days is too short to absorb day-of-week effects, and small samples produce wild swings. Variant B may revert (or reverse) by day 14.
Read .agents/product-marketing-context.md if it exists. Audience size, average traffic, and current conversion rate determine whether a test is even feasible.
Ask up to 4 questions:
The cardinal rule. Change two things at once and you cannot attribute the result.
If you must test multiple changes, use a multivariate test (MVT) — but those need much more traffic (often 4×–8× a single A/B).
Format: "If we [change X], [metric Y] will increase by [Z%] because [reason]."
The "because" matters: if your hypothesis is wrong but the reasoning was sound, you've still learned something generalizable.
Don't stop early. Statistical tests need adequate data to distinguish signal from noise.
Run for whole weeks, not 3 days, not 10 days. Different weekdays produce different audience behavior — Monday B2B traffic is not Saturday DTC traffic.
Looking at results every hour and stopping when "B looks good" is the most common error in marketing experimentation. Early peeks combined with early stops dramatically inflate false positive rates.
Most marketing teams use 95% confidence (p-value < 0.05) as the bar.
For high-stakes tests (pricing, branding) consider 99% confidence (p < 0.01).
Write down:
A documented test history prevents your team from re-testing things that already failed and from forgetting why you made past decisions.
Sample size per variant ≈ 16 × p × (1 − p) / MDE²
where:
p = baseline conversion rate (e.g. 0.03 = 3%)
MDE = minimum detectable effect, in absolute terms
(e.g. 0.006 = lift from 3% to 3.6%)
This produces sample size for 80% power, 95% confidence, 50/50 split — sensible defaults for most marketing tests.
Current conversion rate is 3%. You want to detect a 20% relative lift (from 3% to 3.6%).
That's slow. Either run it (if the change matters), test something with a bigger expected lift, or get more traffic on the test surface.
Current open rate is 25%. You want to detect a 10% relative lift (to 27.5%).
| Daily volume | Conv. rate | Days needed | Test feasibility |
|---|---|---|---|
| < 100 | any | 2+ months | Skip — focus on traffic first |
| 100–500 | 2–5% | 3–6 weeks | Yes, but be patient |
| 500–2K | 2–5% | 2–3 weeks | Yes — ideal range |
| 2K–10K | 2–5% | 1–2 weeks | Yes — rapid iteration |
| 10K+ | any | days | Yes — multi-arm tests possible |
If volume is below 100/day, A/B testing is statistically wasted — concentrate on increasing traffic before running experiments.
Beyond simple A vs B:
For most teams: stick to A/B until traffic exceeds ~10K/day on the test surface.
Roughly 80% of visitors read the headline; 20% read the body. Optimizing the headline gives the largest expected lift per unit of effort.
Variations to try:
Easy to change, often 5–25% lift potential.
Variations:
| Tool | Best for | Cost |
|---|---|---|
| Meta Ads built-in A/B test | Creative, audience, placement on Meta | Free |
| TikTok Ads Split Test | TikTok ad creative and audience tests | Free |
| Google Ads Experiments | Google Ads campaigns and ad copy | Free |
| Optimizely Web | Enterprise web experimentation, sequential testing | $$$ enterprise |
| VWO | Mid-market web A/B + heatmaps | $199+/mo |
| Convert.com | Privacy-first web testing | $99+/mo |
| PostHog | Product feature flags + experiments + analytics | Free tier, generous |
| GrowthBook | Open-source A/B testing platform | Free / hosted plans |
| Statsig | Product experimentation with feature flags | Free tier |
| AB Tasty | Web experimentation + personalization | $$$ |
| Unbounce / Instapage | Built-in A/B for landing pages | $90+/mo |
| Custom (split URL) | Two pages, 50/50 redirect, GA4/Pixel attribution | Free |
Note: Google Optimize was sunset in September 2023. Migration paths: GA4 + a third-party platform (Optimizely, VWO, Convert) or PostHog/GrowthBook for product-led teams.
1. Build two versions of the page: /landing-a and /landing-b
2. Split traffic 50/50:
- Meta Ads: 2 ad sets, identical audience, different destination URLs
- Google Ads: 2 ads in the same ad group, identical targeting, different URLs
- Email: list-split feature in your ESP
3. Track conversions per variant:
- Meta Pixel custom event with parameter: page_version = "A" / "B"
- GA4 event with custom dimension
- PostHog feature flag exposure event
4. Run for the planned duration. Don't peek mid-test.
5. Export raw counts. Run significance test (calculator below).
Use a calculator. Recommended:
evanmiller.org/ab-testing/chi-squared.htmlabtestguide.com/calc/Inputs:
Outputs:
| p-value | Lift size | Decision |
|---|---|---|
| < 0.05 | > 5% | B wins — implement and document |
| < 0.05 | < 5% | Significant but small — weigh implementation cost |
| 0.05–0.10 | > 10% | Borderline — extend test if feasible |
| > 0.10 | any | No evidence — keep A or design a stronger test |
# A/B Test: [test name]
Created: [YYYY-MM-DD]
Owner: [name]
## 1. Hypothesis
"If we [change X], [metric Y] will increase by [Z%] because [reason]."
## 2. Variants
- Variant A (Control): [current state description]
- Variant B (Challenger): [changed state description]
- Single change: [the one element that differs]
## 3. Metrics
- Primary: [e.g. conversion rate]
- Secondary (guardrails): [e.g. bounce rate, time on page, AOV]
## 4. Sample size & duration
- Baseline (p): [%]
- Minimum detectable effect (MDE): [%]
- Sample needed per variant: [N]
- Daily traffic to test surface: [N]
- Estimated days to complete: [N]
## 5. Setup
- Tool: [Optimizely / VWO / PostHog / Meta built-in / custom]
- Variant A URL or asset: [...]
- Variant B URL or asset: [...]
- Tracking events: [list]
- Split ratio: 50/50
## 6. Timeline
- Start: [date]
- End (planned): [date]
- Review meeting: [date]
## 7. Results (filled in after test ends)
| Variant | Visitors | Conversions | Rate | Lift vs A |
|---------|----------|-------------|------|-----------|
| A | | | | — |
| B | | | | +X% |
p-value: [x]
95% CI on lift: [lower%, upper%]
Significant (p < 0.05): [Yes / No]
## 8. Decision
[Ship B / Keep A / Inconclusive — extend or redesign]
## 9. Action
[Implement variant B globally / Roll back / Schedule next iteration]
## 10. Lessons
[What this teaches generalizable for future tests]