From harness-claude
Injects controlled faults like network latency, service failures, pod crashes, and resource exhaustion to validate resilience, graceful degradation, and automatic recovery.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Chaos engineering, fault injection, and resilience validation. Systematically introduces failures to verify that systems degrade gracefully, recover automatically, and maintain availability under real-world fault conditions.
Executes chaos engineering experiments injecting failures like network latency, service crashes, resource exhaustion to test resilience in distributed systems.
Guides chaos engineering workflows: define steady state, map failure domains, design experiments for network failures, timeouts, DNS issues using tc and toxiproxy. For resilience tests and game days.
Designs chaos experiments, failure injection frameworks, and game day exercises for distributed systems resilience—producing runbooks, manifests, rollbacks, and post-mortems. Use for resilience testing and fault injection.
Share bugs, ideas, or general feedback.
Chaos engineering, fault injection, and resilience validation. Systematically introduces failures to verify that systems degrade gracefully, recover automatically, and maintain availability under real-world fault conditions.
Map the system architecture. Identify:
Define steady-state behavior. Establish measurable indicators of normal operation:
Enumerate failure modes. For each dependency, define what can go wrong:
Scope the blast radius. For each experiment, define:
Prioritize experiments by risk and value. Start with:
Document the experiment plan. For each experiment, write:
Select the chaos tooling. Based on the infrastructure:
Configure the experiment. Write the experiment definition:
Verify the pre-experiment steady state. Before injecting any fault:
Inject the fault. Execute the experiment:
Verify the abort mechanism works. Before running experiments with larger blast radius:
Collect metrics during the experiment. Capture:
Verify the steady-state hypothesis. Compare observed metrics against the hypothesis:
Check for cascading failures. Monitor downstream services:
Record the timeline. Document:
Terminate the experiment. Remove the injected fault and verify:
Classify findings. For each experiment:
Recommend resilience improvements. For each finding:
Update runbooks and incident response documentation. For each experiment:
Plan follow-up experiments. Based on findings:
Run harness validate. Confirm the project passes all harness checks after any code changes made for resilience improvements.
Generate an experiment report. Summarize:
If a knowledge graph exists at .harness/graph/, refresh it after code changes to keep graph queries accurate:
harness scan [path]
harness validate -- Run in IMPROVE phase after resilience changes are implemented. Confirms project health.harness check-deps -- Run after INJECT phase setup to verify chaos tooling dependencies do not leak into production bundles.emit_interaction -- Used at checkpoints to present experiment plans for human approval before fault injection, and to present findings for prioritization.harness validate passes after resilience improvementsPLAN -- Experiment definition:
{
"title": "Database latency does not cause cascading timeout failures",
"description": "Inject 2-second latency on PostgreSQL connections and verify the order service responds within 5 seconds using cached data",
"steady-state-hypothesis": {
"title": "Order service responds within SLA",
"probes": [
{
"type": "probe",
"name": "order-api-responds",
"tolerance": true,
"provider": {
"type": "http",
"url": "http://localhost:3000/api/orders/health",
"timeout": 5
}
}
]
},
"method": [
{
"type": "action",
"name": "inject-db-latency",
"provider": {
"type": "process",
"path": "toxiproxy-cli",
"arguments": "toxic add -t latency -a latency=2000 postgresql"
}
},
{
"type": "probe",
"name": "check-order-response-time",
"provider": {
"type": "http",
"url": "http://localhost:3000/api/orders?limit=10",
"timeout": 5
}
}
],
"rollbacks": [
{
"type": "action",
"name": "remove-db-latency",
"provider": {
"type": "process",
"path": "toxiproxy-cli",
"arguments": "toxic remove -n inject-db-latency_latency_downstream postgresql"
}
}
]
}
INJECT -- Pod kill experiment:
# litmus/pod-kill-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-service-pod-kill
namespace: staging
spec:
appinfo:
appns: staging
applabel: app=order-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
- name: PODS_AFFECTED_PERC
value: '50'
OBSERVE -- Expected behavior timeline:
T+0s: Fault injected - 50% of order-service pods killed
T+3s: Kubernetes detects pod failure, starts replacement pods
T+5s: Load balancer routes traffic to surviving pods
T+8s: Response latency increases from 50ms to 200ms (surviving pods absorb load)
T+15s: Replacement pods pass readiness probe, rejoin the pool
T+20s: Latency returns to baseline (50ms)
T+60s: Experiment ends
Result: PASSED - System maintained availability throughout.
P99 latency spiked to 450ms (within 500ms SLA).
Zero 5xx errors observed. No data loss.
| Rationalization | Reality |
|---|---|
| "Our circuit breakers are already tested in unit tests — we don't need chaos experiments" | Unit tests verify that circuit breaker code executes. Chaos experiments verify that the circuit breaker actually opens under real load, that the fallback produces an acceptable user-facing response, and that monitoring detects the transition. These are different things. |
| "We can't run chaos experiments in production — it's too risky" | Avoiding chaos experiments does not reduce risk — it defers the discovery of failure modes to real incidents. Chaos experiments in staging with defined abort criteria and short durations are lower risk than discovering failure modes at 2am during a real outage. |
| "The experiment passed in staging so we know it'll work in production" | Staging differences in traffic volume, data distribution, and infrastructure scale can mask failure modes. Staging experiments validate the mechanism; production experiments (with tightly scoped blast radius) validate the system under real conditions. |
| "We injected the fault and the system recovered — the experiment is done" | Recovery alone does not validate resilience. The experiment must also confirm: detection time (did monitoring catch it?), recovery time (did it meet the SLA?), and no data loss or corruption. A system that recovers after 10 minutes of data inconsistency has not passed. |
| "We have runbooks for these failure modes, so game days aren't necessary" | A runbook that has never been executed under pressure is a hypothesis, not a procedure. Game days reveal whether runbooks are complete, whether on-call engineers can execute them accurately under stress, and whether the estimated recovery times are realistic. |