Skill

run-ab-test

Designs, executes, and interprets statistically valid A/B tests for webpages, emails, apps, or ads to validate conversion or engagement hypotheses.

automation

npx claudepluginhub jeffreytse/grimoire --plugin grimoire

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/grimoire:run-ab-test

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Design, execute, and interpret a statistically valid A/B test that produces actionable, trustworthy results for product and marketing decisions.

SKILL.md

55 lines · ~1.3k tokens

Similar Skills

ab-test-setup

Designs and implements A/B tests with statistical rigor: hypothesis framing, sample size calculation, and test type selection.

3 files

superpowers

a-b-test-design

1.3k

Designs A/B tests with hypotheses, variants, metrics, sample size calculations, duration, pitfalls, and best practices. For statistically validating product changes.

prototyping-testing

ab-testing

Guides planning, designing, and implementing A/B tests, split tests, multivariate experiments. Covers hypotheses, sample sizes, test types, statistical principles.

1 file

craft-workspace-webconsulting-skills

Stats

LanguageShell

Stars12

Forks1

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Run A/B Test

Design, execute, and interpret a statistically valid A/B test that produces actionable, trustworthy results for product and marketing decisions.

Why This Is Best Practice

Adopted by: Google, Amazon, Netflix, Booking.com — all run thousands of A/B tests per year as the primary mechanism for product improvement Impact: Netflix reports testing drove $1B+ in annual subscriber retention improvement; Booking.com runs 1,000+ concurrent tests; Kohavi (2020) shows organizations with mature experimentation programs grow 2–4× faster than peers Why best: A/B testing is the gold standard for causal inference on live systems — without it, correlation is mistaken for causation, producing decisions that feel informed but reflect bias.

Sources: Kohavi, Tang & Xu "Trustworthy Online Controlled Experiments" (2020) Cambridge; Optimizely "Statistical Significance Calculator"; VWO "A/B Testing Guide" (2023)

Steps

Write the hypothesis — use this format: "Because [observation/insight], we believe [change] will cause [result] for [audience], which we'll measure by [metric]." A bad hypothesis cannot produce learning regardless of the outcome.
Define primary and secondary metrics — primary metric: the one metric that decides winner/loser; secondary metrics: guardrail metrics that must not deteriorate (e.g., conversion rate is primary; revenue per visitor is guardrail).
Calculate required sample size — use a power calculator: input baseline conversion rate, minimum detectable effect (MDE, typically 10–20%), significance level (α = 0.05), and power (β = 0.80); this gives required visitors per variant.
Estimate test duration — duration = required sample size per variant ÷ daily traffic to the test page; minimum 7 days (weekly seasonality), maximum 4–6 weeks (novelty effect fades); if duration >6 weeks, increase MDE or accept lower power.
Implement the test — use a testing platform (Optimizely, VWO, AB Tasty, Statsig, Eppo, or built-in platform tools); randomize at user level (not session level) to avoid re-assignment; verify traffic split (50/50 for standard A/B).
Run a pre-test check — run the experiment for 24h with both variants identical (A/A test); if results show significant difference, fix the randomization or data collection bug before proceeding.
Monitor for sample ratio mismatch (SRM) — check daily that variant traffic volumes match the intended split ±1%; SRM (e.g., variant A gets 60% of traffic instead of 50%) indicates a tracking bug that invalidates results.
Do not peek — avoid checking results before the planned end date; peeking and stopping early inflates false positive rates from 5% to 30%+ (Kohavi 2020); use sequential testing if early stopping is required.
Read results correctly — declare a winner only if: p < 0.05 (or 95% CI excludes zero) AND primary metric moved in the expected direction AND guardrail metrics are not significantly harmed.
Ship, learn, and document — if significant: ship the winner; if not significant: document the null result as a learning (null results are equally informative); add to the experiment log; plan the next iteration.

Rules

One primary metric per test — multiple primary metrics require multiple hypothesis tests and Bonferroni correction; keep it simple.
Never end a test early because the winner "looks obvious" — this is the most common source of invalid A/B test conclusions.
Sample ratio mismatch must be investigated before reading results — SRM makes all statistics meaningless; it must be resolved or the test must be restarted.
Minimum 7 days runtime for any web/app test — day-of-week effects can make early results reverse completely by day 7.
Null results must be documented — a test that shows no effect tells you what not to build; it is equally valuable to the organization as a positive result.

Common Mistakes

No pre-calculated sample size — ending the test when "it feels like enough data" produces wildly wrong conclusions.
Changing the test mid-run — modifying the variant, traffic percentage, or metric after the test starts invalidates all prior data.
Peeking and stopping early — the #1 source of false positives in A/B testing; p-values are not valid at arbitrary stopping points.
Testing too many variants — 4+ variants require much larger sample sizes and longer run times; 2-variant tests are faster and more reliable.
Ignoring guardrail metrics — a variant that improves CTR while halving revenue per session is not a winner.

When NOT to Use

Traffic too low to reach statistical significance within a reasonable timeframe
Changes that cannot be randomized at the user level (e.g., pricing changes visible to groups)
Qualitative questions (use user interviews, surveys, or usability tests instead)

run-ab-test

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

run-ab-test

Popularity

Invocation

Context Preview

SKILL.md

Run A/B Test

Why This Is Best Practice

Steps

Rules

Common Mistakes

When NOT to Use

Similar Skills

Help us improve

Run A/B Test

Why This Is Best Practice

Steps

Rules

Common Mistakes

When NOT to Use