From claude-dev-assistant
Chaos engineering for distributed systems: design and run chaos experiments to proactively discover weaknesses before they cause outages. Use when user mentions chaos, fault injection, resilience testing, GameDays, chaos monkey,LitmusChaos, Gremlin, steady-state hypothesis, or when building信頼性 into microservices.
npx claudepluginhub an8079/take-skillsThis skill uses the workspace's default tool permissions.
**"Break things on purpose to build unbreakable systems."**
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
"Break things on purpose to build unbreakable systems."
Chaos engineering is the discipline of experimenting on a system to build confidence in its resilience. Unlike reactive debugging, it proactively injects faults before users notice them.
Every experiment starts with a steady-state hypothesis — a measurable statement about your system's behavior under normal conditions.
Steady State = "Normal behavior we expect to observe"
Hypothesis = "Injecting fault X will [not change / degrade] the steady state"
Never run chaos without first defining steady state.
1. Define steady-state hypothesis
2. Inject the smallest possible fault
3. Monitor for deviation from steady state
4. Roll back immediately if hypothesis fails
5. Automate and repeat
| Fault | Command/Method | Impact |
|---|---|---|
| Latency | tc qdisc add dev eth0 netem delay 100ms | API timeouts |
| Packet loss | tc qdisc add dev eth0 netem loss 5% | Intermittent errors |
| DNS failure | /etc/hosts manipulation | Service discovery breaks |
| Partition | iptables -A INPUT -s <ip> | Network split |
| Fault | Method | Impact |
|---|---|---|
| CPU spike | stress-ng --cpu 1 --timeout 30s | Throttling, slow responses |
| Memory leak | Allocate until OOM | Pod evictions |
| Disk full | dd if=/dev/zero of=/tmp/bigfile | Write failures |
| File descriptor exhaustion | ulimit -n 1024 in container | Connection pooling breaks |
| Fault | Method | Impact |
|---|---|---|
| Kill pod | kubectl delete pod | Restart tolerance |
| Container restart | docker restart | Session/state recovery |
| Config map change | kubectl patch | Behavior drift |
| Secret deletion | kubectl delete secret | Auth failures |
| Fault | Method | Impact |
|---|---|---|
| Downstream timeout | TC delay on egress | Cascade failure |
| 5xx simulation | iptables redirect | Error budget burn |
| Dependency pod kill | kubectl delete | Circuit breaker validation |
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-chaos
spec:
engineState: active
appinfo:
appns: production
applabel: "app=api-gateway"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
spec:
components:
env:
- name: NETWORK_LATENCY
value: '6000' # 6000ms = 6s latency
- name: TARGET_PODS
value: 'http-api' # pods matching label
- name: DESTINATION_PORTS
value: '8080'
import pytest
from unittest.mock import patch
@pytest.mark.parametrize("fault", ["timeout", "500", "connection_refused"])
def test_order_service_resilience(fault):
"""Order service should degrade gracefully under API faults."""
with patch("orders.clients.payment") as mock_payment:
if fault == "timeout":
mock_payment.charge.side_effect = TimeoutError()
elif fault == "500":
mock_payment.charge.side_effect = HTTPError(500)
else:
mock_payment.charge.side_effect = ConnectionRefusedError()
# System under test
result = order_service.process_order(TEST_ORDER)
# Assertions: graceful degradation, no crash
assert result.status in ("pending_retry", "failed_isolated")
assert not result.charged # don't charge if payment uncertain
func TestInventoryService_TimeoutResilience(t *testing.T) {
// Simulate slow DB with context deadline
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Millisecond)
defer cancel()
svc := NewInventoryService(mockDB{latency: 5 * time.Second})
// Should return context deadline exceeded, not panic
_, err := svc.GetStock(ctx, "SKU-123")
if err == nil {
t.Fatal("expected context deadline error under slow DB")
}
if !strings.Contains(err.Error(), "deadline") {
t.Errorf("wrong error type: %v", err)
}
}
#!/bin/bash
# Inject 500ms latency on eth0 for 60 seconds, then auto-rollback
INTERFACE="${INTERFACE:-eth0}"
DELAY_MS="${DELAY_MS:-500}"
cleanup() {
echo "[chaos] Removing netem rule..."
tc qdisc del dev "$INTERFACE" root 2>/dev/null || true
}
trap cleanup EXIT
echo "[chaos] Injecting ${DELAY_MS}ms latency on $INTERFACE..."
tc qdisc add dev "$INTERFACE" root netem delay "${DELAY_MS}ms" limit 1000000
echo "[chaos] Running for 60s... (Ctrl+C to abort)"
sleep 60
echo "[chaos] Experiment complete."
A GameDay is a team-wide chaos experiment event.
Pre-GameDay (1 week before):
- Define hypotheses and acceptance criteria
- Identify blast radius — must be contained
- Prepare rollback procedures
- Notify on-call and stakeholders
- Set up monitoring dashboards
GameDay:
Step 1: Baseline measurement (10 min)
→ Capture normal metrics: latency p50/p95/p99, error rate, throughput
Step 2: Small fault injection (20 min)
→ Start with latency injection (lowest blast radius)
→ Observe monitoring dashboards in real time
→ Document observations vs. hypothesis
Step 3: Escalate fault scope (30 min)
→ Kill a single pod → kill 25% → kill 50%
→ Document failure cascade patterns
Step 4: Roll back and reflect (10 min)
Post-GameDay:
- Write formal experiment report
- Prioritize discovered gaps by blast radius × frequency
- Create follow-up tickets for each gap
- Update runbooks with new failure modes
## Chaos Experiment Report: [Name]
### Hypothesis
[What we expected to happen]
### Steady State Baseline
- Metric A: [value] (pre-experiment)
- Metric B: [value] (pre-experiment)
### Fault Injected
- Type: [network/resource/app/external]
- Details: [specific configuration]
### Results
| Metric | Baseline | During Experiment | Deviation |
|--------|----------|-------------------|-----------|
| p99 latency | 120ms | 3400ms | +2833% |
| Error rate | 0.1% | 8.3% | +820% |
### Conclusion
✅ Hypothesis confirmed / ❌ Hypothesis disproved
### Findings
1. [Finding 1]
2. [Finding 2]
### Action Items
| Action | Severity | Owner |
|--------|----------|-------|
| Add circuit breaker | P1 | @team-backend |
| Increase replica count | P2 | @team-infra |
| Environment | Recommended Tool |
|---|---|
| Kubernetes (open source) | LitmusChaos or Chaos Mesh |
| AWS (managed) | AWS Fault Injection Simulator (FIS) |
| Docker containers | Pumba |
| Local dev / testing | Toxiproxy |
| Go services | Programmatic injection in unit tests |
| Python services | unittest.mock + pytest parametrization |
| Full stack (mixed) | Gremlin (commercial) |
This skill activates when the user mentions:
Last updated: 2026-03-28