Designs A/B tests and product experiments with hypothesis structuring, sample size calculation, duration estimation, and result interpretation. Unified framework for both product and marketing experiments.
From forged-claude-codenpx claudepluginhub dokkabei97/forged-claude-code --plugin forged-claude-codeThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Details PluginEval's skill quality evaluation: 3 layers (static, LLM judge), 10 dimensions, rubrics, formulas, anti-patterns, badges. Use to interpret scores, improve triggering, calibrate thresholds.
Designs statistically sound experiments so you make decisions based on data, not opinions. Especially critical when traffic is low and every experiment counts.
| Trigger | Behavior |
|---|---|
| Testing a product change | Full experiment design |
| "A/B test", "experiment" | Interactive experiment builder |
| Interpreting test results | Statistical analysis guide |
# Experiment: [Name]
## Hypothesis
> If we [change X], then [metric Y] will [improve/decrease] by [Z%],
> because [rationale based on user behavior/data].
## Design
| Field | Value |
|-------|-------|
| **Type** | A/B / A/B/n / Multivariate |
| **Primary Metric** | [e.g., conversion rate] |
| **Guardrail Metrics** | [metrics that should NOT worsen] |
| **Baseline** | [current value of primary metric] |
| **MDE** | [Minimum Detectable Effect, e.g., 5%] |
| **Significance** | 95% (α = 0.05) |
| **Power** | 80% (β = 0.20) |
## Sample Size & Duration
| Variant | Traffic Split | Required Sample | Est. Duration |
|---------|--------------|----------------|---------------|
| Control (A) | 50% | [N] | [days] |
| Variant (B) | 50% | [N] | [days] |
## Variants
### Control (A)
[Current experience — no changes]
### Variant (B)
[Specific change being tested]
## Success Criteria
- **Win**: Primary metric improves by ≥ MDE AND guardrails stable
- **Lose**: Primary metric does not improve OR guardrails degrade
- **Inconclusive**: Not enough data (extend or abandon)
## Risks & Mitigations
- [Risk]: [Mitigation]
For conversion rate experiments (95% significance, 80% power):
| Baseline Rate | MDE 5% | MDE 10% | MDE 20% |
|---|---|---|---|
| 1% | 380K | 95K | 24K |
| 5% | 72K | 18K | 4.6K |
| 10% | 34K | 8.6K | 2.2K |
| 20% | 16K | 3.9K | 1K |
| 50% | 3.1K | 780 | 200 |
Per variant. Total = N × number of variants.
When you don't have enough users for traditional A/B testing:
| Strategy | When | How |
|---|---|---|
| Accept larger MDE | >20% effect expected | Reduces sample size |
| Sequential testing | Need early stopping | Bayesian approach |
| Before/after | No traffic for split | Compare time periods |
| Qualitative | <100 users | User interviews + usability tests |
| Fake door test | Testing demand | Measure clicks on non-existent feature |
## Experiment Results: [Name]
| Metric | Control | Variant | Δ | p-value | Significant? |
|--------|---------|---------|---|---------|-------------|
| [Primary] | [N] | [N] | [+/-]% | [p] | Yes/No |
| [Guardrail] | [N] | [N] | [+/-]% | [p] | - |
## Decision: [Ship / Don't Ship / Iterate]
**Rationale**: [Why this decision based on data]
## Learnings
- [What we learned about user behavior]
- [Implications for future experiments]
| Tool | Purpose |
|---|---|
| Write | Generate experiment design documents |
| Read | Reference existing experiment history |
Will:
Will Not: