Chaos engineering skill. Activates when user needs to test system resilience through controlled failure injection, validate circuit breakers, plan game days, or verify disaster recovery procedures. Covers network failures, disk pressure, process crashes, dependency outages, and data corruption scenarios. Triggers on: /godmode:chaos, "chaos test", "resilience test", "failure injection", "game day", or when ship skill needs resilience validation.
From godmodenpx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:chaosBefore injecting failures, establish what "healthy" looks like:
STEADY STATE DEFINITION:
System: <service name / system boundary>
Architecture: <monolith | microservices | serverless>
Health indicators (must all be true for "steady state"):
- Response success rate: > <X>% (e.g., 99.9%)
- Response time P95: < <X>ms (e.g., 500ms)
- Error rate: < <X>% (e.g., 0.1%)
- Queue depth: < <N> messages (e.g., 1000)
- CPU usage: < <X>% (e.g., 80%)
- Memory usage: < <X>% (e.g., 85%)
- Active connections: < <N> (e.g., connection pool max)
...
Map all the ways the system can fail:
FAILURE DOMAIN MAP:
| Category | Components | Impact if Failed |
|--|--|--|
| Network | Load balancer | Total outage |
| | DNS resolution | Total outage |
| | Inter-service network | Partial outage |
| | External API access | Feature degraded |
| Compute | Application process | Service restart |
| | Worker processes | Queue backlog |
| | Cron/scheduled jobs | Delayed tasks |
| | Container/VM host | Service relocation |
| Storage | Primary database | Read/write loss |
IF experiment crashes service: halt and rollback. WHEN steady-state violated: record finding.
Create specific, controlled experiments for each failure domain:
CHAOS EXPERIMENT:
Name: <descriptive name>
Hypothesis: "When <failure condition>, the system will <expected behavior>"
Blast radius: <single request | single user | single service | entire system>
Duration: <how long to inject failure>
Rollback: <how to stop the experiment immediately>
Prerequisites:
- [ ] Steady state verified
- [ ] Monitoring dashboards open
- [ ] Rollback procedure tested
- [ ] Team notified (if production)
- [ ] Incident response team on standby (if production)
...
Experiment N1: Dependency Timeout
Hypothesis: "When the payment API responds slowly (5s+), the checkout
service returns a user-friendly error within 3 seconds and does not
block other requests."
Injection:
# Using tc (traffic control) to add latency
tc qdisc add dev eth0 root netem delay 5000ms
# Or using toxiproxy
toxiproxy-cli toxic add -n latency -t latency \
-a latency=5000 payment-api
...
Experiment N2: DNS Failure
Hypothesis: "System falls back to cached data when DNS fails." Injection: iptables -A OUTPUT -p udp --dport 53 -j DROP. Verify cached responses served, error messages shown for uncached.
Experiment N3: Packet Loss
Hypothesis: "With 10% packet loss, success rate stays >95%." Injection: tc qdisc add dev eth0 root netem loss 10%. Verify retry logic, SLO compliance, no pool exhaustion.
Experiment P1: Process Crash
Hypothesis: "When the application process crashes, it restarts within
30 seconds and no requests are dropped (load balancer removes unhealthy
instance)."
Injection:
# Kill application process
kill -9 $(pgrep -f "node server.js")
# Or in Kubernetes
kubectl delete pod <pod-name> --grace-period=0
Verify:
...
Experiment P2: Memory Pressure
Hypothesis: "At 90%+ memory, app sheds load gracefully." Injection: stress-ng --vm 1 --vm-bytes 80% --timeout 300s. Verify load shedding, no OOM kill, health check alive.
Experiment P3: CPU Saturation
Hypothesis: "At 95% CPU, health checks and critical paths prioritized." Injection: stress-ng --cpu $(nproc) --timeout 300s. Verify health check <1s, background deferred, autoscaling triggers.
Experiment S1: Database Failover
Hypothesis: "When the primary database fails, the system fails over to
the replica within 30 seconds with < 1 second of write unavailability."
Injection:
# Stop primary database
docker stop postgres-primary
# Or in cloud — promote replica
aws rds failover-db-cluster --db-cluster-identifier <cluster>
Verify:
- Read traffic continues on replica immediately
...
Experiment S2: Cache Failure (Cold Cache)
Hypothesis: "When Redis is unavailable, the system falls back to direct
database queries with acceptable performance degradation (P95 < 2s
instead of < 200ms)."
Injection:
# Flush all cached data
redis-cli FLUSHALL
# Or kill Redis entirely
docker stop redis
Verify:
...
Experiment S3: Disk Full
Hypothesis: "When disk reaches 95%, the system stops non-critical writes,
alerts operators, and continues serving read traffic."
Injection:
# Fill disk to 95%
fallocate -l $(df --output=avail / | tail -1 | awk '{print int($1*0.90)}')k /tmp/fill-disk
Verify:
- Log rotation and temp file cleanup triggered
- Non-critical writes (analytics, logs) paused
- Critical writes (transactions) continue to reserved space
- Alert fires with disk usage percentage
...
Specifically test circuit breaker behavior:
CIRCUIT BREAKER VALIDATION:
State Transitions
CLOSED ──(failures > threshold)──→ OPEN
| ▲ | |
| | | (timeout) |
| | ▼ |
└──(success)── HALF-OPEN ←─────────┘
CLOSED: Normal operation, requests flow through
OPEN: All requests fail fast (no network call)
HALF-OPEN: Limited requests to test recovery
Organize a structured resilience testing exercise:
GAME DAY PLAN:
Date: <scheduled date>
Duration: <2-4 hours>
Facilitator: <person>
Participants: <team members and roles>
OBJECTIVES:
1. Validate <specific resilience property>
2. Test <incident response procedure>
3. Verify <recovery time objective>
TIMELINE:
...
CHAOS ENGINEERING REPORT — <system>
Experiments run: <N>
Hypotheses confirmed: <N>/<total>
Surprises found: <N>
RESILIENCE SCORECARD:
┌─────────────────────────┬────────┬───────────────────┐
| | Failure Domain | Grade | Notes | |
├─────────────────────────┼────────┼───────────────────┤
| | Network latency | A/B/C/F | <detail> | |
| | Network partition | A/B/C/F | <detail> | |
| | Process crash | A/B/C/F | <detail> | |
| | Memory pressure | A/B/C/F | <detail> | |
docs/chaos/<system>-experiments.mddocs/chaos/<system>-gameday-plan.mddocs/chaos/<system>-resilience-report.md"chaos: <system> — <N> experiments, resilience: <grade>"/godmode:fix to address, then re-test."/godmode:ship."# Chaos injection tools
tc qdisc add dev eth0 root netem delay 500ms
kubectl delete pod <pod-name> --grace-period=0
stress-ng --cpu $(nproc) --vm 1 --vm-bytes 80% --timeout 60s
redis-cli FLUSHALL
| Flag | Description |
|---|---|
| (none) | Full chaos assessment — map failure domains, design experiments |
--experiment <name> | Run a specific pre-designed experiment |
--network | Network failure experiments only |
timestamp experiment_name hypothesis blast_radius duration result surprises
On activation, automatically detect infrastructure context:
AUTO-DETECT:
1. Container orchestration:
kubectl cluster-info 2>/dev/null && echo "kubernetes"
docker info 2>/dev/null && echo "docker"
2. Cloud provider:
aws sts get-caller-identity 2>/dev/null && echo "aws"
gcloud config get-value project 2>/dev/null && echo "gcp"
3. Service mesh / proxy:
kubectl get crd | grep -i istio && echo "istio"
linkerd check 2>/dev/null && echo "linkerd"
...
KEEP if: improvement verified. DISCARD if: regression or no change. Revert discards immediately.
Stop when: target reached, budget exhausted, or >5 consecutive discards.