Help us improve
Share bugs, ideas, or general feedback.
npx claudepluginhub frank-luongt/faos-skills-marketplace --plugin faos-analystHow this skill is triggered — by the user, by Claude, or both
Slash command
/faos-analyst:ab-test-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- AUTO-GENERATED by export-plugins.py — DO NOT EDIT -->
Analyzes A/B test results for statistical significance, sample size validation, confidence intervals, lift, guardrails, and ship/extend/stop recommendations. Handles CSV/Excel data via Python scripts.
Analyzes A/B tests and experiments with statistical rigor: assesses power, significance, validity, segments; recommends ship/kill/extend.
Designs statistically rigorous A/B tests and interprets experiment results with ship/iterate/kill recommendations. Handles sample size calculation, success criteria, and risk flags.
Share bugs, ideas, or general feedback.
Analyze experiment results with statistical rigor and produce a clear Ship / Investigate / Extend / Stop recommendation.
This skill complements ab-test-setup (which handles experiment design). Use this skill when you have results to analyze.
Most A/B test interpretations are wrong — teams either call tests too early, ignore guardrail metrics, or ship on directional trends without statistical significance. This skill enforces disciplined analysis.
ab-test-setup)| Field | Description |
|---|---|
| Primary metric | What the test is trying to improve (e.g., conversion rate) |
| Control group | Sample size (N) and conversions (C) for the control |
| Variant group | Sample size (N) and conversions (C) for the variant |
| Test duration | How long the test ran |
| Planned duration | How long it was designed to run |
| Guardrail metrics | Metrics that must not degrade (e.g., revenue, page load time) |
| MDE | Minimum Detectable Effect used in power calculation |
Before analyzing results, check:
If any check fails, the test results may be unreliable. Flag this before proceeding.
For conversion rate tests:
Control conversion rate: p_c = C_control / N_control
Variant conversion rate: p_v = C_variant / N_variant
Relative lift: (p_v - p_c) / p_c × 100%
Pooled proportion: p = (C_control + C_variant) / (N_control + N_variant)
Standard error: SE = sqrt(p × (1-p) × (1/N_control + 1/N_variant))
Z-score: Z = (p_v - p_c) / SE
P-value: two-tailed from Z
95% Confidence Interval: (p_v - p_c) ± 1.96 × SE
| Criterion | Threshold | Status |
|---|---|---|
| Statistical significance | p-value < 0.05 | Pass / Fail |
| Practical significance | Lift > MDE | Pass / Fail |
| Confidence interval | Does CI exclude 0? | Pass / Fail |
Both statistical AND practical significance are required to ship.
For each guardrail metric:
| Guardrail | Control | Variant | Change | Status |
|---|---|---|---|---|
| [metric name] | [value] | [value] | [+/- %] | OK / Warning / Degraded |
A guardrail is degraded if it shows a statistically significant negative change.
Use this decision matrix:
| Primary Metric | Guardrails | Recommendation |
|---|---|---|
| Significant positive | All OK | Ship — roll out to 100% |
| Significant positive | Some degraded | Investigate — understand trade-off before deciding |
| Not significant, positive trend | All OK | Extend — run longer if sample size was insufficient |
| Not significant, flat | All OK | Stop — no effect detected, free up the experiment slot |
| Significant negative | Any | Don't Ship — revert and learn from the result |
# A/B Test Results: [Test Name]
## Summary
| Field | Value |
| --- | --- |
| Test name | [name] |
| Hypothesis | [We believed X would cause Y] |
| Primary metric | [metric name] |
| Duration | [start] — [end] ([N] days) |
| Decision | **Ship / Investigate / Extend / Stop / Don't Ship** |
---
## Results
| Group | Sample Size | Conversions | Rate |
| --- | --- | --- | --- |
| Control | [N] | [C] | [rate]% |
| Variant | [N] | [C] | [rate]% |
**Relative lift:** [+/- X.X%]
**P-value:** [value]
**95% CI:** [[lower]%, [upper]%]
**Statistically significant:** Yes / No
**Practically significant:** Yes / No (MDE was [X]%)
---
## Guardrail Metrics
| Metric | Control | Variant | Change | Status |
| --- | --- | --- | --- | --- |
| [metric] | [val] | [val] | [change] | OK / Warning |
---
## Recommendation
**Decision: [Ship / Investigate / Extend / Stop / Don't Ship]**
**Rationale:** [2–3 sentences explaining the decision]
**Next steps:**
1. [action]
2. [action]
---
## Learnings
- [What we learned from this test, regardless of outcome]
- [How this informs future experiments]
| Pitfall | Why It's Wrong | Correct Approach |
|---|---|---|
| Peeking at results daily | Inflates false positive rate | Wait for planned duration and sample size |
| Calling it at p=0.06 | "Almost significant" isn't significant | Set the threshold before the test, stick to it |
| Ignoring guardrails | Winning on one metric while losing on another | Always check guardrails before shipping |
| Post-hoc segmentation | Finding "it worked for mobile users!" after the fact is data mining | Pre-register segments or treat as hypothesis for next test |
| Running too many variants | Each variant needs full sample size | Limit to 1–2 variants per test |
| Not learning from losses | "It didn't work" is not a learning | Document WHY it didn't work and what to try next |
| Avoid | Why | Instead |
|---|---|---|
| "Directional win" | Not a statistical standard | Require p < 0.05 and lift > MDE |
| Shipping without guardrail check | May degrade critical metrics | Always check before shipping |
| Ending early because it "looks good" | Sequential testing bias | Run to planned duration |
| Not documenting learnings | Same failed experiments get repeated | Maintain an experiment log |