Skill

resilience-design

Designs resilience patterns including circuit breakers, retries, bulkheads, timeouts, and chaos engineering practices. Trigger: "resilience design", "circuit breaker", "fault tolerance", "chaos engineering", "bulkhead pattern", "retry strategy".

From sovereign-architect

Install

Run in your terminal

npx claudepluginhub javimontano/mao-sovereign-architect

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepBashAgent

Supporting Assets

View in Repository

evals/evals.json

examples/sample-output.md

prompts/use-case-prompts.md

references/body-of-knowledge.md

Skill Content

Similar Skills

browser-automation

Guides browser automation with Playwright, Puppeteer, Selenium for e2e testing and scraping. Teaches reliable selectors, auto-waits, isolation to fix flaky tests.

antigravity-bundle-qa-testing

31.1k

code-review-checklist

Provides checklists to review code for functionality, quality, security, performance, tests, and maintainability. Use for PRs, audits, team standards, and developer training.

antigravity-bundle-qa-testing

31.1k

ab-test-setup

Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.

antigravity-bundle-qa-testing

31.1k

Stats

Stars0

Forks0

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

Resilience Design

Designs fault-tolerant architectures using circuit breakers, retry policies, bulkheads, timeouts, and chaos engineering practices to ensure systems degrade gracefully under failure conditions.

Guiding Principle

"Hope is not a strategy. Design for failure, because failure is not a possibility — it is an inevitability."

Procedure

Step 1 — Failure Mode Analysis

Map all external dependencies: databases, APIs, message brokers, DNS, CDN, third-party services.
For each dependency, identify failure modes: timeout, error response, partial failure, data corruption, total outage.
Classify failures by frequency (common, rare, black swan) and impact (degraded, partial outage, total outage).
Identify cascading failure paths: where does one failure trigger another?
Document the current blast radius for each failure mode.

Step 2 — Apply Resilience Patterns

Circuit Breaker: Prevent cascading failures by stopping calls to a failing dependency after a threshold; periodically test for recovery.
Retry with Backoff: Retry transient failures with exponential backoff and jitter to prevent thundering herd.
Timeout: Set explicit timeouts for all external calls; never wait indefinitely.
Bulkhead: Isolate failure domains so that one failing component cannot exhaust resources needed by others.
Fallback: Define degraded responses when a dependency is unavailable (cached data, default values, feature disablement).
Select patterns per dependency based on failure mode analysis.

Step 3 — Design for Graceful Degradation

Define the degradation hierarchy: what features are sacrificed first, second, third?
Implement load shedding: reject low-priority requests under extreme load to protect critical paths.
Design rate limiting: per-client, per-endpoint, and global rate limits with clear 429 responses.
Specify the health check architecture: liveness (is the process running?) vs. readiness (can it serve traffic?).
Plan for data center / availability zone failure: multi-AZ deployment, cross-region failover.

Step 4 — Chaos Engineering

Define steady-state hypotheses: what does "normal" look like in measurable terms?
Design chaos experiments: inject latency, kill instances, simulate network partitions, exhaust resources.
Start small: single-service chaos in staging before production experiments.
Establish the blast radius controls: abort conditions, rollback procedures, scope limits.
Schedule regular game days to practice incident response.

Quality Criteria

Every external dependency has an explicit timeout, retry policy, and circuit breaker configuration.
Graceful degradation is defined for at least the top 3 failure scenarios.
Chaos experiments are documented and run at least quarterly in production.
Health checks distinguish between liveness and readiness, preventing premature traffic routing.

Anti-Patterns

Retrying without backoff or jitter, amplifying the problem on a struggling service ("retry storm").
Circuit breakers that never open because thresholds are set too high ("decoration-only circuit breaker").
Infinite timeouts or no timeouts on external HTTP calls ("waiting forever").
Testing only happy paths — never simulating dependency failures in integration tests.