Design and document chaos engineering experiments. Guide steady state baseline, hypothesis formation, failure injection plans, and results analysis. Use for resilience testing, game days, failure injection experiments, and building confidence in system stability.
Designs and documents chaos engineering experiments for resilience testing and system stability validation.
/plugin marketplace add rjmurillo/ai-agents/plugin install project-toolkit@ai-agentsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Design rigorous chaos engineering experiments that build confidence in system resilience.
chaos experimenttest resiliencefailure injectiongame daychaos engineering| Phase | Purpose | Output |
|---|---|---|
| 1. Scope | Define system boundaries and objectives | System under test, success criteria |
| 2. Baseline | Establish steady state metrics | Quantified normal behavior |
| 3. Hypothesis | Form falsifiable hypothesis | Clear prediction statement |
| 4. Injection | Design failure scenarios | Injection plan with blast radius |
| 5. Execute | Run controlled experiment | Observation log |
| 6. Analyze | Compare actual vs expected | Findings and action items |
Use this skill when:
Use threat-modeling instead when:
Use pre-mortem instead when:
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
Define the experiment boundaries.
Inputs: System architecture, historical incidents, monitoring data
Questions to Answer:
Output: Scoped experiment definition with stakeholder sign-off
Quantify normal system behavior.
Collect Steady State Metrics:
| Metric Category | Examples | Collection Period |
|---|---|---|
| Throughput | Requests/second, transactions/minute | 7-30 days |
| Error Rates | 4xx rate, 5xx rate, exception count | 7-30 days |
| Latency | P50, P95, P99 response times | 7-30 days |
| Resource | CPU%, Memory%, Disk I/O, Network I/O | 7-30 days |
| Business | Orders/hour, active sessions, conversion rate | 7-30 days |
Define Tolerance Thresholds:
Output: Baseline document with metric values and thresholds
Create a falsifiable hypothesis.
Hypothesis Template:
Given [system in steady state],
When [specific failure is injected],
Then [system behavior remains within tolerance]
Because [specific resilience mechanism exists].
Example Hypotheses:
Hypothesis Quality Checklist:
Output: Documented hypothesis with measurable predictions
Plan the controlled failure injection.
Common Failure Categories:
| Category | Examples | Tools |
|---|---|---|
| Instance Failure | Kill process, terminate VM, evict pod | chaos-monkey, kill, kubectl delete |
| Network | Partition, latency, packet loss, DNS failure | tc, iptables, toxiproxy, chaos-mesh |
| Resource Exhaustion | CPU spike, memory pressure, disk fill | stress-ng, dd, memory hogs |
| Dependency | External service unavailable, slow response | fault injection proxy, mock services |
| Time | Clock skew, NTP failure | faketime, chrony manipulation |
| State | Data corruption, cache invalidation | Custom scripts |
Injection Plan Elements:
Blast Radius Containment:
Output: Detailed injection plan with rollback procedures
Run the controlled experiment.
Pre-Execution Checklist:
During Execution:
Observation Log Format:
[HH:MM:SS] - [Metric/Event]: [Value/Description]
[00:00:00] - Experiment started: Injected 500ms latency to database connection
[00:00:15] - P99 latency: 450ms -> 650ms
[00:00:30] - Circuit breaker: OPEN on database connection pool
[00:01:00] - Retry queue depth: 0 -> 247
[00:01:30] - Auto-recovery initiated
[00:02:00] - P99 latency: 650ms -> 480ms
[00:02:30] - Circuit breaker: CLOSED
[00:03:00] - Experiment ended: Removed latency injection
Output: Timestamped observation log
Compare actual behavior against hypothesis.
Analysis Questions:
Verdict Options:
| Verdict | Meaning | Action |
|---|---|---|
| VALIDATED | Hypothesis confirmed | Document and expand scope |
| INVALIDATED | Hypothesis falsified | File bugs, prioritize fixes |
| INCONCLUSIVE | Unable to determine | Refine experiment design |
Finding Categories:
Output: Analysis document with prioritized action items
| Script | Purpose | Usage |
|---|---|---|
generate_experiment.py | Create experiment document from inputs | python scripts/generate_experiment.py --name "API Gateway Resilience" |
validate_experiment.py | Validate experiment document completeness | python scripts/validate_experiment.py path/to/experiment.md |
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General failure |
| 2 | Invalid arguments |
| 10 | Validation failure (missing required sections) |
Experiments are saved to: .agents/chaos/
.agents/chaos/
YYYY-MM-DD-experiment-name.md
YYYY-MM-DD-experiment-name-results.md
| Avoid | Why | Instead |
|---|---|---|
| Testing in staging only | Production has different traffic patterns | Start small in production |
| No rollback plan | Cannot recover if things go wrong | Define rollback before starting |
| Vague hypothesis | Cannot determine success | Use quantifiable predictions |
| Measuring internal metrics only | Do not reflect customer experience | Focus on observable outputs |
| Big bang experiments | Blast radius too large | Start with smallest scope |
| No baseline | Cannot compare results | Collect 7+ days of metrics first |
| Skipping stakeholder buy-in | Creates political problems | Get approval before execution |
Use templates/experiment-template.md or generate with:
python scripts/generate_experiment.py \
--name "Database Failover Resilience" \
--system "Payment Service" \
--owner "Jane Smith" \
--output .agents/chaos/
Before executing any chaos experiment:
| Skill | Relationship |
|---|---|
| security | Security review for production experiments |
| devops | CI/CD integration for automated chaos |
| qa | Test strategy alignment |
| analyst | Root cause analysis of findings |
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.