From argos
Hipotez-driven fault injection (pod kill, network partition, CPU/memory/disk/DNS/time stress). Steady state baseline + blast radius (%1→%5→%25) + abort condition + drill log + learning loop. Game day organization.
npx claudepluginhub resultakak/argos --plugin argosThis skill uses the workspace's default tool permissions.
`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md` default-load
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md default-load
sayılır (agents/coordination.md §11). Bu skill'in çıktısı Critical / High / Medium /
Low + kanıt formatında olmak zorunda — spekülatif Critical yasak. Sahiplik dışı bulgu
ilgili agent'a delege; karar yetkisi eşiği aşılırsa kullanıcı onayı zorunlu.
Steady state + deney + tolerans:
## Hypothesis: Random Pod Kill
**Steady state**:
- error_rate < 0.1%
- latency p99 < 500ms
- availability 99.9%
**Experiment**: random pod kill (max 1 pod / 5 dk) on `api-svc` deployment.
**Hypothesis**:
- error_rate spike < 1% during kill window
- recovery time < 30 sec (new pod ready)
- no customer-facing 5xx beyond 5 sec
**Blast radius**: staging cluster, %100 traffic (staging'de tam).
**Abort condition**:
- error_rate > 5%
- latency p99 > 2 sec
- recovery > 2 dk
**Duration**: 1 saat (12 deney @ 5 dk interval).
| Stack | Tool | Komut örneği |
|---|---|---|
| K8s + Chaos Mesh | kubectl apply -f chaos.yaml | Pod/Network/Stress/IO/DNS/Time chaos |
| K8s + Litmus | litmusctl create experiment | OpenChaos library |
| AWS | AWS FIS (Fault Injection Simulator) | EC2/RDS/EKS fault |
| Network | Toxiproxy | proxy layer fault (integration test'te de) |
| Manual | kubectl delete pod (basit) | hızlı drill |
Plugin tercih: Chaos Mesh (K8s native, CRD-driven, açık kaynak).
Pod kill (chaos/experiments/pod-kill-staging.yaml):
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-api-svc
namespace: staging
spec:
action: pod-kill
mode: one # one | all | fixed | fixed-percent | random-max-percent
duration: "30s"
selector:
namespaces: [staging]
labelSelectors:
app: api-svc
scheduler:
cron: "@every 5m"
Network latency (chaos/experiments/network-latency.yaml):
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: payment-provider-latency
spec:
action: delay
mode: all
selector:
namespaces: [staging]
labelSelectors:
app: api-svc
delay:
latency: "500ms"
jitter: "100ms"
correlation: "25"
target:
selector:
namespaces: [staging]
labelSelectors:
app: payment-provider-mock
duration: "10m"
CPU stress (chaos/experiments/cpu-stress.yaml):
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-api-svc
spec:
mode: one
selector:
namespaces: [staging]
labelSelectors:
app: api-svc
stressors:
cpu:
workers: 4
load: 80 # %80 CPU
duration: "5m"
Game day öncesi 10 dakika "normal" metric ölç:
# error rate
sum(rate(http_requests_total{service="api-svc",code=~"5.."}[1m]))
/ sum(rate(http_requests_total{service="api-svc"}[1m]))
# latency p99
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{service="api-svc"}[1m]))
)
# pod count
count(kube_pod_info{namespace="staging",pod=~"api-svc.*",phase="Running"})
Baseline kayıt: docs/chaos/<experiment>/baseline-YYYY-MM-DD.json.
Production chaos için tedrici:
Her phase'in abort condition aktif; metric eşiğinde otomatik geri.
Pre-game (T-7 gün):
#chaos-day-YYYY-MM-DD.Game day (T-0):
Post-game (T+1 gün):
# Chaos Mesh Schedule — günde 1x random pod kill
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: daily-pod-kill-production
spec:
schedule: "0 14 * * 1-5" # iş günü 14:00 UTC (on-call yoğun)
type: PodChaos
podChaos:
action: pod-kill
mode: fixed-percent
value: "5" # %5 pod (1 deployment'ta 1 pod)
selector:
namespaces: [production]
labelSelectors:
chaos: enabled # opt-in label; kritik servisler hariç
Continuous chaos kurulum:
chaos: enabled); kritik servisler hariç.#chaos-alerts channel her execution log.Her chaos bulgusu:
runbooks/<scenario>.md yaz./capacity-plan delege).3 ay sonra aynı chaos tekrar koş; aynı bulgu çıkıyorsa learning loop kırık.
chaos: enabled label — opt-out zorunlu.User: /chaos-drill api-svc
Agent (chaos-engineer):
1. Hypothesis: pod kill → SLO sapma < %1, MTTR < 30 sn.
2. Blast radius: staging %100; 1 saat 12 deney.
3. Abort: error_rate > %5, p99 > 2s, MTTR > 2 dk.
4. Chaos Mesh YAML: chaos/experiments/pod-kill-staging.yaml.
5. Stakeholder + on-call ping (Slack #engineering, T-7 gün).
6. Game day 2026-05-20 09:00 UTC.
7. Steady state baseline: error 0.04%, p99 380ms, 5 pod ready.
8. Deney koşumu: 12 pod kill (5 dk interval).
9. Sonuç: 11/12 SLO içinde; 1 deney p99 1.2s'e fırladı (HPA scale-up lag).
10. Bulgu Critical: HPA stabilizationWindowSeconds 60 → 0 (scale-up faster).
11. Bulgu Medium: pod-disruption-budget (PDB) eksik api-svc'de.
12. Action: 2 issue açıldı (#14001 HPA tune, #14002 PDB ekle).
13. Drill log: docs/chaos/api-svc-pod-kill/drill-log-2026-05-20.md.
14. 3 ay sonra (2026-08-20) re-run scheduled.
# Chaos Drill: <experiment-name>
## Hypothesis
- Steady state + experiment + expected + abort
## Tooling
- Chaos Mesh CRD YAML
## Game Day
- Pre-game checklist
- Day-of timeline
- Post-game drill log
## Findings (Critical/High/Medium/Low)
## Action Items
| P | Aksiyon | Sahip | Bitiş | Issue |
## Re-Run
- 3 ay sonra tekrar tarih