Skill

A/B Test Analysis

From faos-analyst

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/faos-analyst:ab-test-analysis

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

SKILL.md

206 lines · ~1.7k tokens

Stats

LanguageTeX

Parent stars18

Parent forks8

MaintenanceGood

Last CommitApr 7, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data. tags: [experimentation, ab-testing, analytics, data-driven]

A/B Test Analysis

Analyze experiment results with statistical rigor and produce a clear Ship / Investigate / Extend / Stop recommendation.

This skill complements ab-test-setup (which handles experiment design). Use this skill when you have results to analyze.

Purpose

Most A/B test interpretations are wrong — teams either call tests too early, ignore guardrail metrics, or ship on directional trends without statistical significance. This skill enforces disciplined analysis.

When to Use

An A/B test has completed its planned duration
You have conversion data for control and variant groups
Stakeholders are asking "did the test win?"
You need to decide: ship, extend, or kill

When NOT to Use

Designing or setting up an experiment (use ab-test-setup)
The test hasn't reached minimum sample size yet
You're analyzing observational data (not a controlled experiment)

Required Data (Ask If Missing)

Field	Description
Primary metric	What the test is trying to improve (e.g., conversion rate)
Control group	Sample size (N) and conversions (C) for the control
Variant group	Sample size (N) and conversions (C) for the variant
Test duration	How long the test ran
Planned duration	How long it was designed to run
Guardrail metrics	Metrics that must not degrade (e.g., revenue, page load time)
MDE	Minimum Detectable Effect used in power calculation

Analysis Process

Step 1: Validate the Setup

Before analyzing results, check:

Sample size adequate? Compare actual N to planned N from power analysis
Duration sufficient? Must cover at least 1–2 full business cycles (e.g., weekday + weekend)
SRM check? Sample Ratio Mismatch — control and variant should have ~equal N (within 1%). If skewed, the test is invalid.
No novelty effects? If you can, check early vs. late behavior. New UI elements get more clicks initially.

If any check fails, the test results may be unreliable. Flag this before proceeding.

Step 2: Calculate Core Statistics

For conversion rate tests:

Control conversion rate: p_c = C_control / N_control
Variant conversion rate: p_v = C_variant / N_variant
Relative lift: (p_v - p_c) / p_c × 100%

Pooled proportion: p = (C_control + C_variant) / (N_control + N_variant)
Standard error: SE = sqrt(p × (1-p) × (1/N_control + 1/N_variant))
Z-score: Z = (p_v - p_c) / SE
P-value: two-tailed from Z

95% Confidence Interval: (p_v - p_c) ± 1.96 × SE

Step 3: Assess Significance

Criterion	Threshold	Status
Statistical significance	p-value < 0.05	Pass / Fail
Practical significance	Lift > MDE	Pass / Fail
Confidence interval	Does CI exclude 0?	Pass / Fail

Both statistical AND practical significance are required to ship.

Step 4: Check Guardrail Metrics

For each guardrail metric:

Guardrail	Control	Variant	Change	Status
[metric name]	[value]	[value]	[+/- %]	OK / Warning / Degraded

A guardrail is degraded if it shows a statistically significant negative change.

Step 5: Make the Decision

Use this decision matrix:

Primary Metric	Guardrails	Recommendation
Significant positive	All OK	Ship — roll out to 100%
Significant positive	Some degraded	Investigate — understand trade-off before deciding
Not significant, positive trend	All OK	Extend — run longer if sample size was insufficient
Not significant, flat	All OK	Stop — no effect detected, free up the experiment slot
Significant negative	Any	Don't Ship — revert and learn from the result

Output Format

# A/B Test Results: [Test Name]

## Summary

| Field | Value |
| --- | --- |
| Test name | [name] |
| Hypothesis | [We believed X would cause Y] |
| Primary metric | [metric name] |
| Duration | [start] — [end] ([N] days) |
| Decision | **Ship / Investigate / Extend / Stop / Don't Ship** |

---

## Results

| Group | Sample Size | Conversions | Rate |
| --- | --- | --- | --- |
| Control | [N] | [C] | [rate]% |
| Variant | [N] | [C] | [rate]% |

**Relative lift:** [+/- X.X%]
**P-value:** [value]
**95% CI:** [[lower]%, [upper]%]
**Statistically significant:** Yes / No
**Practically significant:** Yes / No (MDE was [X]%)

---

## Guardrail Metrics

| Metric | Control | Variant | Change | Status |
| --- | --- | --- | --- | --- |
| [metric] | [val] | [val] | [change] | OK / Warning |

---

## Recommendation

**Decision: [Ship / Investigate / Extend / Stop / Don't Ship]**

**Rationale:** [2–3 sentences explaining the decision]

**Next steps:**
1. [action]
2. [action]

---

## Learnings

- [What we learned from this test, regardless of outcome]
- [How this informs future experiments]

Common Pitfalls

Pitfall	Why It's Wrong	Correct Approach
Peeking at results daily	Inflates false positive rate	Wait for planned duration and sample size
Calling it at p=0.06	"Almost significant" isn't significant	Set the threshold before the test, stick to it
Ignoring guardrails	Winning on one metric while losing on another	Always check guardrails before shipping
Post-hoc segmentation	Finding "it worked for mobile users!" after the fact is data mining	Pre-register segments or treat as hypothesis for next test
Running too many variants	Each variant needs full sample size	Limit to 1–2 variants per test
Not learning from losses	"It didn't work" is not a learning	Document WHY it didn't work and what to try next

Anti-Patterns

Avoid	Why	Instead
"Directional win"	Not a statistical standard	Require p < 0.05 and lift > MDE
Shipping without guardrail check	May degrade critical metrics	Always check before shipping
Ending early because it "looks good"	Sequential testing bias	Run to planned duration
Not documenting learnings	Same failed experiments get repeated	Maintain an experiment log

References

Kohavi, R., Tang, D., & Xu, Y. Trustworthy Online Controlled Experiments (2020)
Evan Miller's A/B Test Calculator
Sample Size Calculator

A/B Test Analysis

Popularity

Invocation

Context Preview

SKILL.md

A/B Test Analysis

Popularity

Invocation

Context Preview

SKILL.md

name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data. tags: [experimentation, ab-testing, analytics, data-driven]

A/B Test Analysis

Purpose

When to Use

When NOT to Use

Required Data (Ask If Missing)

Analysis Process

Step 1: Validate the Setup

Step 2: Calculate Core Statistics

Step 3: Assess Significance

Step 4: Check Guardrail Metrics

Step 5: Make the Decision

Output Format

Common Pitfalls

Anti-Patterns

References

Similar Skills

name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data. tags: [experimentation, ab-testing, analytics, data-driven]

A/B Test Analysis

Purpose

When to Use

When NOT to Use

Required Data (Ask If Missing)

Analysis Process

Step 1: Validate the Setup

Step 2: Calculate Core Statistics

Step 3: Assess Significance

Step 4: Check Guardrail Metrics

Step 5: Make the Decision

Output Format

Common Pitfalls

Anti-Patterns

References

Similar Skills