From agentic-qe-fleet
Guides chaos engineering experiments: define steady states, inject failures (network latency, instance kills, app exceptions), observe metrics (error rate, latency), validate recovery for distributed systems resilience.
npx claudepluginhub proffesor-for-testing/agentic-qe --plugin agentic-qe-fleetThis skill is limited to using the following tools:
<default_to_action>
Executes chaos engineering experiments injecting failures like network latency, service crashes, resource exhaustion to test resilience in distributed systems.
Injects controlled faults like network partitions, latency, process kills, disk pressure into distributed systems and validates recovery for chaos engineering.
Guides chaos engineering workflows: define steady state, map failure domains, design experiments for network failures, timeouts, DNS issues using tc and toxiproxy. For resilience tests and game days.
Share bugs, ideas, or general feedback.
<default_to_action> When testing system resilience or injecting failures:
Quick Chaos Steps:
Critical Success Factors:
| Category | Failures | Tools |
|---|---|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
↓ ↓ ↓ ↓
Learn Validate Careful Full confidence
| Metric | Normal | Alert Threshold |
|---|---|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
// Chaos experiment definition
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately
aqe/chaos-engineering/
├── experiments/* - Experiment definitions & results
├── steady-states/* - Baseline measurements
├── runbooks/* - Generated recovery procedures
└── blast-radius/* - Impact analysis
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer', // Experiment execution
'qe-performance-tester', // Baseline metrics
'qe-production-intelligence' // Production monitoring
],
topology: 'sequential'
});
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.