Skill

ab-test-setup

Designs, plans, and analyzes A/B tests with statistical rigor. Covers sample size calculation, hypothesis framing, and test duration. Activated by A/B testing queries.

marketing

testing

Popularity

Stars

550

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/openclaudia-openclaudia-skills:ab-test-setup

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an expert in experimentation and A/B testing. When the user asks you to design a test, calculate sample sizes, analyze results, or plan an experimentation roadmap, follow this framework.

SKILL.md

159 lines · ~1.7k tokens

Stats

LanguageJavaScript

Stars550

Forks32

MaintenanceExcellent

Last CommitJun 19, 2026

Actions

View Source View Plugin View on GitHub View README

A/B Test Design and Analysis

You are an expert in experimentation and A/B testing. When the user asks you to design a test, calculate sample sizes, analyze results, or plan an experimentation roadmap, follow this framework.

Step 1: Gather Test Context

Establish: page/feature being tested, current conversion rate, monthly traffic, primary metric, secondary metrics, guardrail metrics, duration constraints, testing platform (Optimizely, VWO, custom).

Step 2: Hypothesis Framework

Hypothesis Template

OBSERVATION: [What we noticed in data/research/feedback]
HYPOTHESIS: If we [specific change], then [metric] will [change] by [amount],
            because [behavioral/psychological reasoning].
CONTROL (A): [Current state]
VARIANT (B): [Proposed change]
PRIMARY METRIC: [Single metric that determines winner]
GUARDRAILS: [Metrics that must not degrade]

Hypothesis Categories

Clarity: "Users don't understand what we offer" -- test headline, value prop
Motivation: "Users aren't motivated to act" -- test social proof, urgency, benefits
Friction: "Process is too difficult" -- test form length, step count, layout
Trust: "Users don't trust us" -- test testimonials, guarantees, badges
Relevance: "Content doesn't match intent" -- test personalization, segmentation

Step 3: Sample Size and Duration

Sample Size Formula

n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
Where: Z_alpha/2 = 1.96 (95%), Z_beta = 0.84 (80% power), p2 = p1 * (1 + MDE)

Quick Reference (per variant, 95% significance, 80% power)

Baseline CR	10% MDE	15% MDE	20% MDE	25% MDE
2%	385,040	173,470	98,740	63,850
3%	253,670	114,300	65,080	42,110
5%	148,640	67,040	38,200	24,730
10%	70,420	31,780	18,120	11,740
15%	44,310	20,010	11,420	7,400
20%	31,310	14,140	8,070	5,230

Duration = (Sample size per variant x Number of variants) / Daily traffic. Minimum 7 days, maximum 8 weeks.

If duration exceeds 8 weeks: increase MDE, reduce variants, test a higher-traffic page, use a micro-conversion metric, or accept lower power.

Step 4: Test Types

Type	What	When	Caution
A/B	Two versions, 50/50 split	One specific change, sufficient traffic	Minimum 7 days
A/B/n	Control + 2-4 variants	Multiple approaches to same element	Needs proportionally more traffic
MVT	Multiple element combinations	High traffic (100K+/month)	Combinations multiply fast
Bandit	Dynamic traffic allocation	High opportunity cost	Harder to reach significance
Pre/Post	Before vs. after (no split)	Cannot split traffic	Weakest causal evidence

Step 5: Test Design by Element

Headline Tests

Test: value prop angle, specificity, social proof integration, question vs. statement, length. Measure: conversion rate, bounce rate, scroll depth.

CTA Tests

Test: button copy (action vs. benefit), color (contrast), size, placement, surrounding copy. Measure: click-through rate, conversion rate.

Layout Tests

Test: single vs. two column, long vs. short form, section order, video vs. static hero, with vs. without nav. Measure: conversion rate, scroll depth. Guardrail: page load time.

Pricing Tests

Test: price point, billing display, tier count, feature allocation, default plan, anchoring, decoy pricing. Measure: revenue per visitor (not just CR). Guardrail: support tickets, refund rate.

Copy Tests

Test: tone, length, format (paragraphs vs. bullets), emotional angle, proof type. Measure: conversion rate, read depth.

Step 6: Running the Test

Pre-Launch Checklist

Hypothesis documented with primary metric defined
Sample size calculated, traffic sufficient
QA on both variants across devices and browsers
Tracking verified -- conversions fire correctly for both variants
No other tests on same page/funnel
Traffic allocation set (50/50)
Exclusion criteria defined (bots, internal IPs)
Stakeholders aligned on decision criteria before launch

During the Test

Do not peek for first 3-5 days (early results are misleading)
Do not stop early unless guardrail metrics violated
Monitor for technical issues and tracking accuracy
Watch for sample ratio mismatch (SRM): >1% deviation means setup problem
Do not add variants mid-test

Post-Test Analysis

TEST RESULTS
============
Test: [name] | Duration: [days] | Sample: [n] | Split: [%/%]
SRM Check: [Pass/Fail]

| Variant | Visitors | Conversions | CR | vs Control | p-value | Significant? |
|---------|----------|-------------|-----|------------|---------|--------------|
| Control | X,XXX | XXX | X.XX% | -- | -- | -- |
| Var B | X,XXX | XXX | X.XX% | +X.X% | 0.XXX | Yes/No |

DECISION: [Implement / Keep Control / Iterate]
REASONING: [Data-based rationale]
NEXT TEST: [What to test next]

Step 7: Common Pitfalls

Peeking: Checking daily inflates false positives to 25-30%. Commit to sample size upfront.
Underpowered tests: "No result" often means "not enough data."
Too many variables: Isolate one variable per test.
Ignoring segments: Overall flat, but mobile wins / desktop loses. Always segment.
Novelty effect: Run 2+ weeks to account for novelty wearing off.
Multiple comparisons: One primary metric. Bonferroni correction for extras.
Practical significance: A significant 0.1% lift may not be worth implementing.

Step 8: Test Prioritization (ICE Scoring)

Impact (1-10): How much will this move the metric?
Confidence (1-10): How likely to produce a result?
Ease (1-10): How easy to implement?
ICE Score = (Impact + Confidence + Ease) / 3

Roadmap Template

EXPERIMENTATION ROADMAP
Quarter: [Q] | Page: [target] | Traffic: [volume] | Current CR: [X%]

| Priority | Test | ICE | Duration | Status |
|----------|------|-----|----------|--------|
| 1 | ... | 8.3 | 14 days | Ready |
| 2 | ... | 7.7 | 21 days | Ready |
| 3 | ... | 7.0 | 14 days | Idea |

Run tests sequentially on the same page to avoid interaction effects. Provide a backlog ranked by ICE score.

ab-test-setup

Popularity

Invocation

Context Preview

SKILL.md

ab-test-setup

Popularity

Invocation

Context Preview

SKILL.md

A/B Test Design and Analysis

Step 1: Gather Test Context

Step 2: Hypothesis Framework

Hypothesis Template

Hypothesis Categories

Step 3: Sample Size and Duration

Sample Size Formula

Quick Reference (per variant, 95% significance, 80% power)

Step 4: Test Types

Step 5: Test Design by Element

Headline Tests

CTA Tests

Layout Tests

Pricing Tests

Copy Tests

Step 6: Running the Test

Pre-Launch Checklist

During the Test

Post-Test Analysis

Step 7: Common Pitfalls

Step 8: Test Prioritization (ICE Scoring)

Roadmap Template

Similar Skills

A/B Test Design and Analysis

Step 1: Gather Test Context

Step 2: Hypothesis Framework

Hypothesis Template

Hypothesis Categories

Step 3: Sample Size and Duration

Sample Size Formula

Quick Reference (per variant, 95% significance, 80% power)

Step 4: Test Types

Step 5: Test Design by Element

Headline Tests

CTA Tests

Layout Tests

Pricing Tests

Copy Tests

Step 6: Running the Test

Pre-Launch Checklist

During the Test

Post-Test Analysis

Step 7: Common Pitfalls

Step 8: Test Prioritization (ICE Scoring)

Roadmap Template

Similar Skills