Skill

experimentation

A/B testing + feature flag disiplini. Hypothesis design, MDE + power + sample size hesabı, SRM check, guardrail metric, sequential testing korumalı ramp, decision rubric (ship/kill/iterate), flag lifecycle (release/experiment/ops/permission), stale flag cleanup.

npx claudepluginhub resultakak/argos --plugin argos

Tool Access

This skill uses the workspace's default tool permissions.

Preview

`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md`

SKILL.md

Similar Skills

using-superpowers

185.1k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Stars0

Forks0

Last CommitMay 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Experimentation

Ortak Doktrin

agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md default-load sayılır (agents/coordination.md §11). Bu skill'in çıktısı Critical / High / Medium / Low + kanıt formatında olmak zorunda — spekülatif Critical yasak. Sahiplik dışı bulgu ilgili agent'a delege; karar yetkisi eşiği aşılırsa kullanıcı onayı zorunlu.

Felsefe

Hypothesis-first. Deney başlamadan önce metric + MDE + power + duration.
Guardrail zorunlu. Primary kazanırken silent regresyon yakala.
Pre-register. HARKing önlemi; docs/experiments/<id>.md commit.
Reversible. Her flag kill switch + auto-rollback guardrail.
Cleanup. Stale flag = tech debt; ship/kill sonrası 14 gün.

Ne Zaman Kullanılır

Yeni feature ship öncesi conversion etkisi
Pricing / UX karar
ML model A/B
Onboarding flow change
Flag inventory + cleanup sprint
A/A test (production randomization sanity)

Workflow

1) Hypothesis design

**Hypothesis**: Yeni checkout button color (green → orange) conversion artırır.
**Primary**: checkout_completion_rate (binary, baseline 12.3%)
**Secondary**: AOV, time_to_checkout
**Guardrail**: p99_latency (< 500ms), error_rate (< 0.5%), refund_rate (< 2%)
**MDE**: +%2 relative (12.3% → 12.55%)
**Power**: 80%, alpha 5% two-sided
**Sample size**: 38,400 / variant
**Duration**: 14 gün (weekly seasonal kapsamak için)
**Unit**: user_id (cross-device tutarlılık)

2) Sample size hesabı

# binary metric
import math
def sample_size_binary(p, mde, alpha=0.05, power=0.8):
    # n per variant, two-sided
    z_alpha = 1.96   # alpha 5%
    z_beta = 0.84    # power 80%
    sigma2 = p * (1 - p)
    n = (z_alpha + z_beta)**2 * 2 * sigma2 / mde**2
    return math.ceil(n)

# p=0.123, mde=0.0025 absolute → ~38,400

CUPED variance reduction: pre-experiment covariate (pre_period_metric) → %30-50 sample düşer.

3) Pre-register

# docs/experiments/2026-05-checkout-button-color.md

**Status**: Pre-registered 2026-05-09
**Owner**: @growth-team
**Author**: @ali
**Reviewers**: @sre @data @product

## Hypothesis
[yukarıdaki blok]

## Decision rule (pre-spec)
- Primary p < 0.05 + lift > +1% + guardrail OK → Ship
- Primary p < 0.05 + lift in [0, 1%] → Iterate
- Primary p >= 0.05 + CI [-MDE, +MDE] kapsıyor → Inconclusive
- Guardrail kırmızı (herhangi tarafta) → Kill immediate

## Subgroup analysis (pre-spec)
- iOS vs Android (cihaz)
- New vs returning (segment)
- Cherry-pick yasak: pre-spec olmayan subgroup analiz **yapılmaz**.

4) Tooling / platform seç

Platform	Güç	Zayıflık
GrowthBook	OSS, bayesian + frequentist, SQL-native	Self-host gerekir
LaunchDarkly	Enterprise, SRM dashboard built-in	Maliyet
Statsig	Generous free tier, bayesian default	Vendor lock
Unleash	OSS basic feature flag	İstatistik motor zayıf
In-house	Tam kontrol	>50 deney/yıl + dedicated team şart

Plugin tercih: GrowthBook veya LaunchDarkly.

5) Flag manifesti

# .flags/checkout-button-color.yaml
key: checkout-button-color
type: experiment       # release | experiment | ops | permission | customer
created: 2026-05-09
owner: "@growth-team"
hypothesis: "docs/experiments/2026-05-checkout-button-color.md"
cleanup_deadline: 2026-06-15
kill_switch: true
default_value: control
variants:
  - { key: control,   weight: 50, description: "green button" }
  - { key: treatment, weight: 50, description: "orange button" }
targeting:
  rules:
    - { attribute: country, op: in, values: [TR, US, DE] }

CI lint: kod referansı varsa flag dosyası şart, yoksa PR red.

6) Ramp & abort

ramp:
  - { day: 1,  exposure_pct: 1,   guardrail_strict: true,  abort_p99_ms: 600 }
  - { day: 3,  exposure_pct: 5,   guardrail_strict: true,  abort_p99_ms: 550 }
  - { day: 5,  exposure_pct: 25,  guardrail_strict: false, abort_p99_ms: 550 }
  - { day: 7,  exposure_pct: 50,  guardrail_strict: false }
  - { day: 14, exposure_pct: 100, analysis: true }
abort_rules:
  - srm_chi_square_p_lt: 0.001
  - error_rate_gt: 0.5%
  - revenue_decline_gt: 1%

7) SRM check (daily)

-- daily exposure ratio
with daily as (
  select
    exposure_date::date as d,
    variant,
    count(*) as n
  from experiment_exposure
  where experiment_id = 'checkout-button-color'
    and exposure_date >= '2026-05-09'
  group by 1, 2
)
select
  d,
  sum(case when variant = 'control'   then n end) as control,
  sum(case when variant = 'treatment' then n end) as treatment,
  -- chi-square approx
  (sum(case when variant = 'control'   then n end) -
   sum(case when variant = 'treatment' then n end))::float
   / sqrt(sum(n)) as z_score
from daily
group by d
order by d;
-- |z_score| > 3.29 → p < 0.001 → ABORT

8) Analysis

# frequentist t-test (continuous metric)
from scipy import stats
t, p = stats.ttest_ind(treatment_aov, control_aov, equal_var=False)

# binary metric — proportion z-test
from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest(
    [treatment_conv, control_conv],
    [treatment_n, control_n]
)

# confidence interval
import statsmodels.stats.api as sms
ci = sms.CompareMeans(...).tconfint_diff(alpha=0.05)

Bayesian alternatif (GrowthBook native): posterior probability treatment > control.

9) Decision rubric (pre-spec çıktı)

Primary lift	p-value	Guardrail	Decision
+2.1%	0.003	OK	Ship (flag 100%, cleanup PR)
+0.3%	0.04	OK	Iterate (effect MDE altı; hypothesis revize)
-0.5%	0.07	OK	Kill (negatif yön + non-sig)
+1.8%	0.001	error_rate +%15	Kill (guardrail)
flat	0.45	OK	Inconclusive (CI [-MDE, +MDE] kapsıyor; uzat veya kill)

10) Cleanup

Ship sonrası 14 gün içinde:

# kod cleanup
rg "isFlagEnabled.*checkout-button-color" src/ --files-with-matches
# her referansı kaldır + treatment kodu default'a al

Kill sonrası 14 gün içinde:

# flag manifest sil, kod cleanup
rm .flags/checkout-button-color.yaml

Stale flag inventory (quarterly):

select key, type, created, owner, cleanup_deadline
from feature_flags
where status = 'active'
  and cleanup_deadline < current_date
  and type in ('release', 'experiment');

Checklist

Antipattern

Pre-register yok — HARKing.
Power analysis yok — underpowered.
Guardrail yok — silent regresyon.
SRM check yok.
Sequential peeking — sabit horizon ihlali.
Subgroup cherry-pick pre-spec olmadan.
Flag cleanup yok — tech debt.
Release/experiment flag karışık.
A/A test yok.
Customer harm review yok (dark pattern, pricing manipulation).
Novelty effect ignore (ilk 3 gün veri sapma).
Network effect ignore (sosyal, marketplace, B2B → cluster randomization).

Örnek Agent Davranışı

User: /experiment-design checkout button color
Agent (experiment-designer):
1. Hypothesis: green → orange checkout button conversion +2% relative.
2. Primary checkout_completion_rate, baseline 12.3%.
3. Guardrail p99 < 500ms, error < 0.5%, refund < 2%.
4. MDE 0.0025 absolute, power 80% → n = 38,400 / variant.
5. Duration 14 gün (weekly seasonal).
6. Pre-register: docs/experiments/2026-05-checkout-button-color.md.
7. Flag manifest: .flags/checkout-button-color.yaml.
8. Ramp 1/5/25/50/100 + abort SRM + p99 + error.
9. SRM daily SQL Looker dashboard.
10. Analysis T+14: frequentist + bayesian; pre-spec decision rubric.
11. Cleanup deadline 2026-06-15 ship olursa; kill olursa 14 gün.

Çıktı Formatı

# Experiment Design: <id>

## Hypothesis
- Primary + secondary + guardrail metric
- MDE + power + sample size + duration

## Pre-register (`docs/experiments/<id>.md`)

## Flag manifest (`.flags/<key>.yaml`)

## Ramp + abort

## SRM check (daily SQL)

## Decision rubric (pre-spec)

## Cleanup deadline