Chaos Engineering for production resilience: steady-state hypothesis design, fault injection tools (Chaos Monkey, Litmus, Gremlin, Toxiproxy, tc netem), GameDay format, and maturity model from manual to continuous chaos.
From clarcnpx claudepluginhub marvinrichter/clarc --plugin clarcThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Load tests tell you if your service handles traffic. Chaos Engineering tells you if it survives reality — network partitions, disk failures, dependency outages, and clock skew.
Steady-State Hypothesis first. Before injecting chaos, define what "normal" looks like:
Hypothesis: Under normal conditions, the checkout service processes
≥99% of requests in <500ms and the error rate is <0.1%.
Chaos engineering is NOT about breaking things randomly. It's about:
Always start small:
| Scope | Example | When |
|---|---|---|
| Single instance | One pod, one container | First experiment |
| Canary traffic | 5% of requests | After single instance success |
| Single AZ | One availability zone | Proven patterns |
| Full service | All instances | Only with proven fallback |
| Multi-service | Cascading failure | Advanced teams only |
Abort criteria: Define rollback triggers before starting. If error rate exceeds X% or latency exceeds Yms — stop and restore.
Simulate the most common production failure: network degradation.
# Linux — tc netem (no external tool needed)
# Add 100ms latency to all outbound traffic from eth0
tc qdisc add dev eth0 root netem delay 100ms
# Add 100ms latency with 30ms jitter (more realistic)
tc qdisc add dev eth0 root netem delay 100ms 30ms
# Simulate 10% packet loss
tc qdisc add dev eth0 root netem loss 10%
# Remove (cleanup)
tc qdisc del dev eth0 root
# macOS — comcast (wraps pfctl)
brew install comcast
comcast --device=en0 --latency=100 --target-bw=1000
comcast --device=en0 --stop
# Install
brew install toxiproxy
# or: docker run -d -p 8474:8474 -p 5432:5432 ghcr.io/shopify/toxiproxy
# Create a proxy for your database
toxiproxy-cli create postgres_proxy -l localhost:25432 -u your-db-host:5432
# Add latency toxic
toxiproxy-cli toxic add postgres_proxy -t latency -a latency=500
# Add connection limiter (simulate connection pool exhaustion)
toxiproxy-cli toxic add postgres_proxy -t limit_data -a bytes=0
# Simulate connection reset
toxiproxy-cli toxic add postgres_proxy -t reset_peer
# Remove toxic (restore)
toxiproxy-cli toxic remove postgres_proxy -n latency_1
# CPU stress (Linux/macOS) — stress-ng
stress-ng --cpu 4 --timeout 60s
# Memory pressure
stress-ng --vm 2 --vm-bytes 512M --timeout 60s
# Go: simulate CPU spike in test
go test -bench=. -benchtime=30s -cpuprofile=cpu.out
# Manual: kill random pods in a deployment
kubectl delete pod -l app=my-service --grace-period=0 -n production
# Chaos Mesh: declarative pod failure
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-example
spec:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors:
app: my-service
scheduler:
cron: "@every 5m"
EOF
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testinghttps://hub.litmuschaos.io# Chaos Experiment Protocol — [Date]
## Service: [name]
## Hypothesis: [specific statement]
## Steady State: [measurable: p99 <Xms, error rate <Y%]
## Fault to Inject
- Type: [network latency / pod kill / CPU pressure]
- Scope: [single instance / canary 5%]
- Duration: [10 minutes]
- Tool: [Toxiproxy / Chaos Mesh / manual tc]
## Abort Criteria
- Error rate exceeds [X%]
- Latency p99 exceeds [Yms]
- On-call pages triggered
## Timeline
[HH:MM] Baseline confirmed
[HH:MM] Fault injected
[HH:MM] Observation period
[HH:MM] Fault removed
[HH:MM] Recovery confirmed
## Findings
- Hypothesis: [CONFIRMED / REJECTED]
- Observed behavior: [...]
- Action items: [...]
| Level | Characteristics |
|---|---|
| 0 — Unprepared | No chaos testing. "It'll be fine." |
| 1 — Manual | Occasional GameDays, no tooling, learning phase |
| 2 — Structured | Regular GameDays, hypothesis-driven, tooling (Toxiproxy/Litmus) |
| 3 — Automated | Chaos in staging CI, experiments in pre-prod before every release |
| 4 — Continuous | Chaos in production, automated rollback, SLO-gated experiments |
Start at Level 1. Most teams benefit most from Level 2–3.
https://principlesofchaos.orghttps://chaos-mesh.orghttps://hub.litmuschaos.iohttps://www.gremlin.comhttps://github.com/Shopify/toxiproxy