A/B testing, experimentation, statistical analysis. Triggers on: /godmode:experiment, "A/B test", "split test", "statistical significance", "Optimizely", "Statsig", "GrowthBook".
From godmodenpx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:experiment# Detect experimentation SDK
grep -l "statsig\|optimizely\|growthbook\|launchdarkly" \
package.json pyproject.toml 2>/dev/null
# Check for existing experiment configs
find . -name "*experiment*" -o -name "*ab_test*" \
| grep -v node_modules | head -10
EXPERIMENT DISCOVERY:
Baseline metric: <name> = <current value>
Traffic: <daily users hitting the surface>
Platform: none | Statsig | Optimizely | GrowthBook
Risk: low (revenue) | medium (growth) | high (minor UX)
IF no platform: recommend Statsig (best free tier)
IF baseline unknown: measure for 7 days first
IF traffic < 1000/day: experiment may take months
EXPERIMENT DESIGN:
Name: <experiment-slug>
Hypothesis: If we <change>, then <metric> will
<improve> by <magnitude> because <reasoning>.
METRICS:
| Type | Metric | Baseline | Target |
|-----------|-------------|----------|--------|
| Primary | <OEC> | <val> | +<X>% |
| Guardrail | latency p95 | <val> | < +10% |
| Guardrail | error rate | <val> | < +0.1%|
| Guardrail | revenue | <val> | > -2% |
| Secondary | <metric> | <val> | +<X>% |
POWER ANALYSIS:
Alpha: 0.05 (5% false positive rate)
Power: 0.80 (80% detection probability)
Baseline rate: <current conversion — e.g., 3.2%>
MDE: <smallest meaningful change — e.g., 10% rel>
Variants: <2 for A/B, N for multivariate>
THRESHOLDS:
Minimum power: 0.80
Maximum alpha: 0.05
Minimum MDE: what matters to the business
Minimum duration: 7 days (avoid day-of-week bias)
IF sample size > 30 days of traffic: increase MDE
or increase traffic allocation
DETERMINISTIC HASHING (recommended):
hash(user_id + experiment_id) % 10000
0-4999 = Control, 5000-9999 = Treatment
RULES:
Same user always gets same variant
No storage required — computed from hash
Works across client and server
Increasing % adds users, never flips existing ones
DECISION RULES (frequentist):
SHIP: p < 0.05 AND effect >= MDE
AND no guardrail regressions
KILL: p < 0.05 AND effect negative
WAIT: p >= 0.05 AND sample not reached
INCONCLUSIVE: sample reached AND p >= 0.05
WHEN to use Bayesian:
Need probability of being better (not just p-value)
Want continuous monitoring without peeking penalty
Business prefers "95% chance B is better" language
EXPERIMENT RESULTS:
Duration: <start> — <end>
Participants: <N> (Control: <N>, Treatment: <N>)
SRM CHECK:
Expected: 50/50, Actual: <actual>/<actual>
Chi-squared p: <value>
IF p < 0.01: SRM detected — STOP, do not interpret
PRIMARY METRIC:
| Variant | Value | vs Control | p-value | Sig? |
|-----------|-------|-----------|---------|------|
| Control | <val> | — | — | — |
| Treatment | <val> | +<X>% | <p> | Y/N |
GUARDRAILS:
latency p95: <PASS|FAIL> (< +10% threshold)
error rate: <PASS|FAIL> (< +0.1% threshold)
revenue: <PASS|FAIL> (> -2% threshold)
LIFECYCLE:
IDEA → DESIGN → REVIEW → QUEUED → RAMPING →
LIVE → ANALYZING → DECIDED → CLEANUP
RAMP SCHEDULE:
5% → 25% → 50% → 100% (hold 7+ days each)
CLEANUP (within 2 weeks of decision):
Remove losing variant code
Delete feature flag
Archive experiment config
PRE-LAUNCH CHECKLIST:
| Check | Status |
|-----------------------------------|--------|
| Hypothesis specific & falsifiable | ? |
| Primary metric (OEC) defined | ? |
| Guardrails defined | ? |
| Sample size achievable | ? |
| Power >= 0.80 | ? |
| Assignment deterministic & sticky | ? |
| Exposure logged once per user | ? |
| Control unchanged | ? |
| Mutual exclusion for conflicts | ? |
Commit: "experiment: <name> — <N> variants, <metric>, <statistical method>"
Never ask to continue. Loop autonomously until done.
1. SDK: statsig, optimizely, growthbook, launchdarkly
2. Analytics: Amplitude, Mixpanel, Segment
3. Existing: experiment configs, assignment logic
Print: Experiment: {name} — {status}. Split: {control}%/{treatment}%. p-value: {p}. Lift: {lift}%. Verdict: {verdict}.
timestamp experiment metric p_value lift status
KEEP if: validated, implemented, metrics confirmed
DISCARD if: failed validation OR tests broke
OR guardrails tripped
STOP when ALL of:
- All 9 pre-launch checks pass
- Exposure logging fires once per user
- Sample size documented with MDE, alpha, power
- Guardrails defined and monitored
- Cleanup plan scheduled