From argos
A/B testing + feature flag disiplini. Hypothesis design, MDE + power + sample size hesabı, SRM check, guardrail metric, sequential testing korumalı ramp, decision rubric (ship/kill/iterate), flag lifecycle (release/experiment/ops/permission), stale flag cleanup.
npx claudepluginhub resultakak/argos --plugin argosThis skill uses the workspace's default tool permissions.
`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md`
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md
default-load sayılır (agents/coordination.md §11). Bu skill'in çıktısı
Critical / High / Medium / Low + kanıt formatında olmak zorunda — spekülatif
Critical yasak. Sahiplik dışı bulgu ilgili agent'a delege; karar yetkisi eşiği
aşılırsa kullanıcı onayı zorunlu.
docs/experiments/<id>.md commit.**Hypothesis**: Yeni checkout button color (green → orange) conversion artırır.
**Primary**: checkout_completion_rate (binary, baseline 12.3%)
**Secondary**: AOV, time_to_checkout
**Guardrail**: p99_latency (< 500ms), error_rate (< 0.5%), refund_rate (< 2%)
**MDE**: +%2 relative (12.3% → 12.55%)
**Power**: 80%, alpha 5% two-sided
**Sample size**: 38,400 / variant
**Duration**: 14 gün (weekly seasonal kapsamak için)
**Unit**: user_id (cross-device tutarlılık)
# binary metric
import math
def sample_size_binary(p, mde, alpha=0.05, power=0.8):
# n per variant, two-sided
z_alpha = 1.96 # alpha 5%
z_beta = 0.84 # power 80%
sigma2 = p * (1 - p)
n = (z_alpha + z_beta)**2 * 2 * sigma2 / mde**2
return math.ceil(n)
# p=0.123, mde=0.0025 absolute → ~38,400
CUPED variance reduction: pre-experiment covariate (pre_period_metric) → %30-50 sample düşer.
# docs/experiments/2026-05-checkout-button-color.md
**Status**: Pre-registered 2026-05-09
**Owner**: @growth-team
**Author**: @ali
**Reviewers**: @sre @data @product
## Hypothesis
[yukarıdaki blok]
## Decision rule (pre-spec)
- Primary p < 0.05 + lift > +1% + guardrail OK → Ship
- Primary p < 0.05 + lift in [0, 1%] → Iterate
- Primary p >= 0.05 + CI [-MDE, +MDE] kapsıyor → Inconclusive
- Guardrail kırmızı (herhangi tarafta) → Kill immediate
## Subgroup analysis (pre-spec)
- iOS vs Android (cihaz)
- New vs returning (segment)
- Cherry-pick yasak: pre-spec olmayan subgroup analiz **yapılmaz**.
| Platform | Güç | Zayıflık |
|---|---|---|
| GrowthBook | OSS, bayesian + frequentist, SQL-native | Self-host gerekir |
| LaunchDarkly | Enterprise, SRM dashboard built-in | Maliyet |
| Statsig | Generous free tier, bayesian default | Vendor lock |
| Unleash | OSS basic feature flag | İstatistik motor zayıf |
| In-house | Tam kontrol | >50 deney/yıl + dedicated team şart |
Plugin tercih: GrowthBook veya LaunchDarkly.
# .flags/checkout-button-color.yaml
key: checkout-button-color
type: experiment # release | experiment | ops | permission | customer
created: 2026-05-09
owner: "@growth-team"
hypothesis: "docs/experiments/2026-05-checkout-button-color.md"
cleanup_deadline: 2026-06-15
kill_switch: true
default_value: control
variants:
- { key: control, weight: 50, description: "green button" }
- { key: treatment, weight: 50, description: "orange button" }
targeting:
rules:
- { attribute: country, op: in, values: [TR, US, DE] }
CI lint: kod referansı varsa flag dosyası şart, yoksa PR red.
ramp:
- { day: 1, exposure_pct: 1, guardrail_strict: true, abort_p99_ms: 600 }
- { day: 3, exposure_pct: 5, guardrail_strict: true, abort_p99_ms: 550 }
- { day: 5, exposure_pct: 25, guardrail_strict: false, abort_p99_ms: 550 }
- { day: 7, exposure_pct: 50, guardrail_strict: false }
- { day: 14, exposure_pct: 100, analysis: true }
abort_rules:
- srm_chi_square_p_lt: 0.001
- error_rate_gt: 0.5%
- revenue_decline_gt: 1%
-- daily exposure ratio
with daily as (
select
exposure_date::date as d,
variant,
count(*) as n
from experiment_exposure
where experiment_id = 'checkout-button-color'
and exposure_date >= '2026-05-09'
group by 1, 2
)
select
d,
sum(case when variant = 'control' then n end) as control,
sum(case when variant = 'treatment' then n end) as treatment,
-- chi-square approx
(sum(case when variant = 'control' then n end) -
sum(case when variant = 'treatment' then n end))::float
/ sqrt(sum(n)) as z_score
from daily
group by d
order by d;
-- |z_score| > 3.29 → p < 0.001 → ABORT
# frequentist t-test (continuous metric)
from scipy import stats
t, p = stats.ttest_ind(treatment_aov, control_aov, equal_var=False)
# binary metric — proportion z-test
from statsmodels.stats.proportion import proportions_ztest
z, p = proportions_ztest(
[treatment_conv, control_conv],
[treatment_n, control_n]
)
# confidence interval
import statsmodels.stats.api as sms
ci = sms.CompareMeans(...).tconfint_diff(alpha=0.05)
Bayesian alternatif (GrowthBook native): posterior probability treatment > control.
| Primary lift | p-value | Guardrail | Decision |
|---|---|---|---|
| +2.1% | 0.003 | OK | Ship (flag 100%, cleanup PR) |
| +0.3% | 0.04 | OK | Iterate (effect MDE altı; hypothesis revize) |
| -0.5% | 0.07 | OK | Kill (negatif yön + non-sig) |
| +1.8% | 0.001 | error_rate +%15 | Kill (guardrail) |
| flat | 0.45 | OK | Inconclusive (CI [-MDE, +MDE] kapsıyor; uzat veya kill) |
Ship sonrası 14 gün içinde:
# kod cleanup
rg "isFlagEnabled.*checkout-button-color" src/ --files-with-matches
# her referansı kaldır + treatment kodu default'a al
Kill sonrası 14 gün içinde:
# flag manifest sil, kod cleanup
rm .flags/checkout-button-color.yaml
Stale flag inventory (quarterly):
select key, type, created, owner, cleanup_deadline
from feature_flags
where status = 'active'
and cleanup_deadline < current_date
and type in ('release', 'experiment');
docs/experiments/).flags/)User: /experiment-design checkout button color
Agent (experiment-designer):
1. Hypothesis: green → orange checkout button conversion +2% relative.
2. Primary checkout_completion_rate, baseline 12.3%.
3. Guardrail p99 < 500ms, error < 0.5%, refund < 2%.
4. MDE 0.0025 absolute, power 80% → n = 38,400 / variant.
5. Duration 14 gün (weekly seasonal).
6. Pre-register: docs/experiments/2026-05-checkout-button-color.md.
7. Flag manifest: .flags/checkout-button-color.yaml.
8. Ramp 1/5/25/50/100 + abort SRM + p99 + error.
9. SRM daily SQL Looker dashboard.
10. Analysis T+14: frequentist + bayesian; pre-spec decision rubric.
11. Cleanup deadline 2026-06-15 ship olursa; kill olursa 14 gün.
# Experiment Design: <id>
## Hypothesis
- Primary + secondary + guardrail metric
- MDE + power + sample size + duration
## Pre-register (`docs/experiments/<id>.md`)
## Flag manifest (`.flags/<key>.yaml`)
## Ramp + abort
## SRM check (daily SQL)
## Decision rubric (pre-spec)
## Cleanup deadline