From argos
Chaos engineering drill — hipotez + steady state + fault injection (pod kill, network, CPU/mem/disk/DNS) + abort condition + drill log + learning loop
npx claudepluginhub resultakak/argos --plugin argos<service | scenario># /chaos-drill ## Amaç Postmortem reaktif; chaos drill **proaktif**. Hipotez-driven fault injection ile sistem dayanıklılığını incident **olmadan** test et. ## Ne Zaman Kullanılır - Yeni servis production gate (resilience drill) - Game day organize (quarterly) - DR drill (region failover, RTO ölç) - Postmortem action: "X olsaydı yakalardık" hipotez doğrula - Production chaos schedule kurulum (continuous) - Compliance SOC 2 CC7.5 (resilience) + ISO 22301 (BCP) ## Input - `<service>` — örn. `api-svc`, `checkout-svc` - veya `<scenario>` — `region-failover`, `db-failover`, `cache-flush` ## ...
Share bugs, ideas, or general feedback.
Postmortem reaktif; chaos drill proaktif. Hipotez-driven fault injection ile sistem dayanıklılığını incident olmadan test et.
<service> — örn. api-svc, checkout-svc<scenario> — region-failover, db-failover, cache-flushchaos-engineer lider — hipotez + blast radius + abort condition.
Alt-delege:
observability-engineer — steady state baseline + drill metricincident-commander — game day koordinasyon + abort decisionproduction-readiness-reviewer — production chaos gaterunbook-author — bulgu → runbook updateinfrastructure-implementer — Chaos Mesh CRD YAMLload-test-engineer — chaos + load karışımımigration-planner — region failover DR drillchaos-engineering skill'i prosedürü taşır.
Rule: rules/chaos-engineering.md.
rules/chaos-engineering.md + rules/observability.md + rules/slo-sli.md + rules/kubernetes.md yükle.chaos: enabled label — opt-out zorunlu./chaos-drill api-svc
/chaos-drill --scenario region-failover --region eu-west-1
/chaos-drill --continuous staging # opt-in label aktive
# Chaos Drill: api-svc Pod Kill
## Hypothesis
- Steady state: error 0.04%, p99 380ms, 5 pod
- Experiment: pod kill (max 1 / 5 dk) for 1 saat
- Expected: error spike < 1%, MTTR < 30 sn
- Abort: error > 5%, p99 > 2s, MTTR > 2 dk
## Tooling
- Chaos Mesh PodChaos `chaos/experiments/pod-kill-staging.yaml`
## Game Day (2026-05-20 09:00-12:00 UTC)
- Steady state baseline: 09:00-09:10 OK
- 12 deney koşumu (5 dk interval)
- 11/12 SLO içinde
- 1 deney: pod kill #7 — p99 1.2s spike (HPA scale-up lag)
## Findings
- **Critical**: HPA `stabilizationWindowSeconds: 60` — scale-up gecikiyor.
Fix: `scaleUp.stabilizationWindowSeconds: 0` + `policies: [{type: Percent, value: 100, periodSeconds: 15}]`.
- **High**: PDB eksik api-svc'de — voluntary disruption tek pod kalabilir.
Fix: `PodDisruptionBudget minAvailable: 3`.
- **Medium**: Datadog alert `pod_restart_count` yok — chaos olayı pasif silinmiş.
## Action Items
| P | Aksiyon | Sahip | Bitiş | Issue |
| P0 | HPA stabilizationWindowSeconds 60 → 0 | @platform | 2026-05-21 | #14001 |
| P0 | PDB minAvailable 3 api-svc | @platform | 2026-05-21 | #14002 |
| P1 | Datadog alert pod_restart_count | @observability | 2026-05-28 | #14003 |
| P2 | 3 ay sonra re-run | @sre | 2026-08-20 | #14004 |
## Re-Run Schedule
- 2026-08-20: aynı experiment + 2 yeni (network latency, CPU stress).