Skill

chaos-engineering

Hipotez-driven fault injection (pod kill, network partition, CPU/memory/disk/DNS/time stress). Steady state baseline + blast radius (%1→%5→%25) + abort condition + drill log + learning loop. Game day organization.

npx claudepluginhub resultakak/argos --plugin argos

Tool Access

This skill uses the workspace's default tool permissions.

Preview

`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md` default-load

SKILL.md

Similar Skills

using-superpowers

185.1k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Stars0

Forks0

Last CommitMay 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Chaos Engineering

Ortak Doktrin

agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md default-load sayılır (agents/coordination.md §11). Bu skill'in çıktısı Critical / High / Medium / Low + kanıt formatında olmak zorunda — spekülatif Critical yasak. Sahiplik dışı bulgu ilgili agent'a delege; karar yetkisi eşiği aşılırsa kullanıcı onayı zorunlu.

Felsefe

Reaktif değil proaktif. Postmortem incident'tan öğrenir; chaos incident olmadan öğrenir.
Hypothesis-driven. "X kapanırsa Y olur" — sistem kabul ediliyor varsayım test edilir.
Production-aware. Final hedef production chaos; staging'de güven inşası.
Steady state. Normal davranış metric ile tanımlanır (SLO baseline).
Abort condition. Customer impact eşiği — otomatik durdur.
Drill log + learning loop. Bulgu issue + runbook update + tekrarlama.

Ne Zaman Kullanılır

Yeni servis production gate (resilience drill)
Game day organize (quarterly)
DR drill (region failover, RTO ölç)
Postmortem action: "X olsaydı yakalardık" hipotez doğrula
Production chaos schedule kurulum (continuous)
Compliance: SOC 2 CC7.5 (resilience), ISO 22301 (BCP)

Workflow

1) Hypothesis design

Steady state + deney + tolerans:

## Hypothesis: Random Pod Kill

**Steady state**:
- error_rate < 0.1%
- latency p99 < 500ms
- availability 99.9%

**Experiment**: random pod kill (max 1 pod / 5 dk) on `api-svc` deployment.

**Hypothesis**:
- error_rate spike < 1% during kill window
- recovery time < 30 sec (new pod ready)
- no customer-facing 5xx beyond 5 sec

**Blast radius**: staging cluster, %100 traffic (staging'de tam).

**Abort condition**:
- error_rate > 5%
- latency p99 > 2 sec
- recovery > 2 dk

**Duration**: 1 saat (12 deney @ 5 dk interval).

2) Tooling seç

Stack	Tool	Komut örneği
K8s + Chaos Mesh	`kubectl apply -f chaos.yaml`	Pod/Network/Stress/IO/DNS/Time chaos
K8s + Litmus	`litmusctl create experiment`	OpenChaos library
AWS	AWS FIS (Fault Injection Simulator)	EC2/RDS/EKS fault
Network	Toxiproxy	proxy layer fault (integration test'te de)
Manual	`kubectl delete pod` (basit)	hızlı drill

Plugin tercih: Chaos Mesh (K8s native, CRD-driven, açık kaynak).

3) Chaos Mesh CRD örnek

Pod kill (chaos/experiments/pod-kill-staging.yaml):

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-api-svc
  namespace: staging
spec:
  action: pod-kill
  mode: one        # one | all | fixed | fixed-percent | random-max-percent
  duration: "30s"
  selector:
    namespaces: [staging]
    labelSelectors:
      app: api-svc
  scheduler:
    cron: "@every 5m"

Network latency (chaos/experiments/network-latency.yaml):

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-provider-latency
spec:
  action: delay
  mode: all
  selector:
    namespaces: [staging]
    labelSelectors:
      app: api-svc
  delay:
    latency: "500ms"
    jitter: "100ms"
    correlation: "25"
  target:
    selector:
      namespaces: [staging]
      labelSelectors:
        app: payment-provider-mock
  duration: "10m"

CPU stress (chaos/experiments/cpu-stress.yaml):

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-api-svc
spec:
  mode: one
  selector:
    namespaces: [staging]
    labelSelectors:
      app: api-svc
  stressors:
    cpu:
      workers: 4
      load: 80         # %80 CPU
  duration: "5m"

4) Steady state baseline

Game day öncesi 10 dakika "normal" metric ölç:

# error rate
sum(rate(http_requests_total{service="api-svc",code=~"5.."}[1m]))
  / sum(rate(http_requests_total{service="api-svc"}[1m]))

# latency p99
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="api-svc"}[1m]))
)

# pod count
count(kube_pod_info{namespace="staging",pod=~"api-svc.*",phase="Running"})

Baseline kayıt: docs/chaos/<experiment>/baseline-YYYY-MM-DD.json.

5) Blast radius progression

Production chaos için tedrici:

Phase 1: Staging %100 traffic, 1 ay sustained (haftalık 1 drill).
Phase 2: Production %1 traffic (canary), 1 ay.
Phase 3: Production %10, 2 hafta.
Phase 4: Production %100, continuous schedule.

Her phase'in abort condition aktif; metric eşiğinde otomatik geri.

6) Game day organization

Pre-game (T-7 gün):

Hypothesis yazılı ve review edildi.
Blast radius + abort condition kararı.
Tooling YAML hazır + reverse komut test edildi.
Stakeholder + on-call rotation notify.
Status page maintenance window (production ise).
Slack channel #chaos-day-YYYY-MM-DD.

Game day (T-0):

09:00 UTC kickoff (3-saat block ideal).
Steady state baseline ölç (10 dk).
Deney 1 koş + 5-15 dk gözlem.
Abort kontrol her deney sonrası.
Deney 2, 3, ... (deney başına ayrı; karışım yasak).
12:00 UTC kapanış: rollback + status green.

Post-game (T+1 gün):

Drill log: hipotez vs gerçek.
Bulgular kategorize: bug | monitoring | runbook | autoscaling.
Action item: sahip + tarih + issue.
Slack özet + Confluence/Notion paylaşım.
Bulgu Critical ise hotfix; aksi sprint backlog.

7) Continuous chaos (production)

# Chaos Mesh Schedule — günde 1x random pod kill
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: daily-pod-kill-production
spec:
  schedule: "0 14 * * 1-5"      # iş günü 14:00 UTC (on-call yoğun)
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: fixed-percent
    value: "5"                  # %5 pod (1 deployment'ta 1 pod)
    selector:
      namespaces: [production]
      labelSelectors:
        chaos: enabled          # opt-in label; kritik servisler hariç

Continuous chaos kurulum:

Opt-in label (chaos: enabled); kritik servisler hariç.
Slack #chaos-alerts channel her execution log.
Datadog dashboard chaos timeline.
Maintenance window'da auto-skip.

8) Learning loop

Her chaos bulgusu:

Bug: regression test ekle + fix PR.
Monitoring gap: alert ekle + dashboard panel.
Runbook eksik: runbooks/<scenario>.md yaz.
Autoscaling: HPA tuning (/capacity-plan delege).
Architecture: SPOF tespit → ADR + refactor.

3 ay sonra aynı chaos tekrar koş; aynı bulgu çıkıyorsa learning loop kırık.

Checklist

Antipattern

Production onaysız chaos — kullanıcı/stakeholder ping zorunlu.
Abort condition yok.
Hypothesis yok — random tahribat.
Steady state yok — sapma görünmez.
Blast radius geniş başlangıçta — %1 → %5 → %25 tedrici.
Drill log yok — öğrenme kaybı.
Action item issue yok — bulgu sayılmaz.
Game day deney karışımı — bulgu izole edilemez.
Bayram / weekend chaos — on-call yetersiz.
Customer-facing duyurusuz — incident algı.
3 ay sonra tekrar etmemek — learning loop kırık.
Critical servis chaos: enabled label — opt-out zorunlu.

Örnek Agent Davranışı

User: /chaos-drill api-svc
Agent (chaos-engineer):
1. Hypothesis: pod kill → SLO sapma < %1, MTTR < 30 sn.
2. Blast radius: staging %100; 1 saat 12 deney.
3. Abort: error_rate > %5, p99 > 2s, MTTR > 2 dk.
4. Chaos Mesh YAML: chaos/experiments/pod-kill-staging.yaml.
5. Stakeholder + on-call ping (Slack #engineering, T-7 gün).
6. Game day 2026-05-20 09:00 UTC.
7. Steady state baseline: error 0.04%, p99 380ms, 5 pod ready.
8. Deney koşumu: 12 pod kill (5 dk interval).
9. Sonuç: 11/12 SLO içinde; 1 deney p99 1.2s'e fırladı (HPA scale-up lag).
10. Bulgu Critical: HPA stabilizationWindowSeconds 60 → 0 (scale-up faster).
11. Bulgu Medium: pod-disruption-budget (PDB) eksik api-svc'de.
12. Action: 2 issue açıldı (#14001 HPA tune, #14002 PDB ekle).
13. Drill log: docs/chaos/api-svc-pod-kill/drill-log-2026-05-20.md.
14. 3 ay sonra (2026-08-20) re-run scheduled.

Çıktı Formatı

# Chaos Drill: <experiment-name>

## Hypothesis
- Steady state + experiment + expected + abort

## Tooling
- Chaos Mesh CRD YAML

## Game Day
- Pre-game checklist
- Day-of timeline
- Post-game drill log

## Findings (Critical/High/Medium/Low)

## Action Items
| P | Aksiyon | Sahip | Bitiş | Issue |

## Re-Run
- 3 ay sonra tekrar tarih