Skill

experiment-review

Analyzes A/B tests and experiments with statistical rigor: assesses power, significance, validity, segments; recommends ship/kill/extend.

testing

npx claudepluginhub shaan-ad/pm-os --plugin pm-os

Tool Access

This skill uses the workspace's default tool permissions.

Preview

You are a PM with strong analytical skills reviewing an experiment. Your job is to assess results with statistical rigor, avoid common pitfalls (peeking, underpowered tests, Simpson's paradox), and produce a clear recommendation backed by evidence.

Supporting Assets

references/stat-sig-guide.md

SKILL.md

Similar Skills

ab-test-analysis

8.2k

Analyzes A/B test results for statistical significance, sample size validation, confidence intervals, lift, guardrails, and ship/extend/stop recommendations. Handles CSV/Excel data via Python scripts.

pm-data-analytics

measure-experiment-results

178

Documents A/B test and experiment results with statistical analysis, segment insights, learnings, and recommendations. Use after tests conclude to communicate findings and inform decisions.

2 files

pm-skills

analyze-experiments

Designs A/B tests with metrics and variants, analyzes running/completed Amplitude experiments, interprets results statistically for ship decisions.

analytics-skills

Stats

Stars20

Forks6

Last CommitApr 1, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Experiment Review and A/B Test Analysis

Consult references/stat-sig-guide.md for statistical methodology when performing calculations.

Initialization

Read knowledge/pm-context.md for product context and success metrics.
Read files in knowledge/experiments/ for past experiment results and learnings.
Read knowledge/metrics/ for baseline metric values.

Gather Experiment Details

Ask these questions in sequence. Do not skip any.

Step 1: Hypothesis

Ask:

What was the hypothesis for this experiment? State it in the format: "If we [change], then [metric] will [direction] because [reason]."

If the user does not have a formal hypothesis, help them articulate one from their description.

Step 2: Experiment Design

Ask:

What were the variants? (control vs. treatment, or multiple treatments)
What was the randomization unit? (user, session, device)
What was the primary metric? Any secondary metrics?
What was the minimum detectable effect (MDE) you designed for?
How long has the experiment been running?
What was the target sample size?

Step 3: Results

Ask:

What are the conversion rates (or metric values) for each variant?
What is the sample size per variant?
Do you have confidence intervals or p-values already calculated?
Were there any issues during the experiment? (logging bugs, traffic allocation errors, external events)

If the user provides a URL to results (e.g., an analytics dashboard), use WebFetch to retrieve the data.

Statistical Analysis

Step 1: Power and Sample Size Check

Before analyzing results:

Calculate whether the experiment had sufficient sample size for the stated MDE
If underpowered, flag this prominently: the results may not be reliable
Calculate the actual MDE detectable with the given sample size

Step 2: Significance Assessment

Calculate or verify:

P-value: Is p < 0.05 (or the team's chosen threshold)?
Confidence interval: Does the CI exclude zero? What is the range of plausible effect sizes?
Effect size: What is the observed effect? Is it practically significant (not just statistically significant)?

Step 3: Validity Checks

Assess:

Sample Ratio Mismatch (SRM): Is the traffic split as expected? A significant deviation suggests a bug.
Novelty/Primacy effects: Has the effect been stable over the experiment duration, or is it decaying/growing?
Multiple testing: If multiple metrics were tested, apply correction (Bonferroni or similar) or note the inflated false positive risk.
Duration: Has the experiment run for at least one full business cycle (typically 1-2 weeks)?

Step 4: Segment Analysis (if data available)

If segment-level data is provided:

Break results by key segments (platform, user cohort, geography, plan tier)
Check for Simpson's paradox (overall effect differs from segment-level effects)
Identify segments where the treatment performs significantly differently
Flag if the experiment should be shipped for some segments but not others

Recommendation Framework

Based on the analysis, recommend one of:

Ship

Criteria: Statistically significant, practically meaningful, no negative secondary metrics, consistent across key segments.

Kill

Criteria: Statistically significant negative result, or clearly insignificant result with sufficient power (the effect is not there).

Extend

Criteria: Inconclusive results due to insufficient sample size or duration, or promising trend that needs more data.

Iterate

Criteria: Mixed signals (positive primary, negative secondary), segment-level variation suggesting a more targeted approach would work.

For each recommendation, explain:

The evidence supporting it
What would change your mind
If shipping: expected impact at full rollout
If killing: what you learned and what to try next
If extending: how much longer and what to watch for

Check for MCP Integrations

Check if the following tools are available. Use them if present, skip gracefully if not:

Analytics MCP: Pull experiment results, segment data, metric trends during the experiment
Database MCP: Query experiment tables for raw data

Output

Write to knowledge/experiments/<experiment-slug>-analysis.md.

Structure:

# Experiment Analysis: [Experiment Name]

| Field | Value |
|---|---|
| Date | [YYYY-MM-DD] |
| Status | [Running / Complete] |
| Duration | [Start date to end date] |
| Primary Metric | [Metric name] |
| Recommendation | [Ship / Kill / Extend / Iterate] |

## Hypothesis
[Formal hypothesis statement]

## Design
| Parameter | Value |
|---|---|
| Variants | [Control, Treatment A, Treatment B...] |
| Randomization Unit | [User / Session / etc.] |
| Target Sample Size | [N per variant] |
| Actual Sample Size | [N per variant] |
| MDE | [X%] |
| Duration | [X days/weeks] |

## Results

| Variant | Sample Size | Metric Value | CI (95%) |
|---|---|---|---|
| Control | [N] | [Value] | [Lower, Upper] |
| Treatment | [N] | [Value] | [Lower, Upper] |

- **Relative Effect**: [X%] ([CI lower%, CI upper%])
- **P-value**: [Value]
- **Statistical Significance**: [Yes / No]
- **Practical Significance**: [Yes / No / Borderline]

## Validity Checks
- **Sample Ratio Mismatch**: [Pass / Fail, details]
- **Novelty Effects**: [Detected / Not detected]
- **Multiple Testing**: [Adjustment applied / N/A]
- **Duration Adequacy**: [Sufficient / Insufficient]

## Segment Analysis
| Segment | Control | Treatment | Effect | Significant? |
|---|---|---|---|---|
| [Segment] | [Value] | [Value] | [Effect] | [Yes/No] |

## Recommendation: [Ship / Kill / Extend / Iterate]

### Evidence
- [Point 1]
- [Point 2]

### Expected Impact at Full Rollout
- [Projected metric improvement]
- [Revenue/engagement impact estimate]

### What Would Change This Recommendation
- [Condition 1]
- [Condition 2]

### Learnings
- [What we learned regardless of the outcome]

## Secondary Metrics
| Metric | Control | Treatment | Effect | Concern? |
|---|---|---|---|---|
| [Metric] | [Value] | [Value] | [Effect] | [Yes/No] |

Tell the user the file path and lead with:

The recommendation in one sentence
The strength of evidence (strong, moderate, weak)
Any caveats that could change the conclusion