Use when implementing circuit breakers, retries, bulkheads, or other resilience patterns. Covers failure handling strategies for distributed systems.
Provides patterns for building fault-tolerant distributed systems that handle failures gracefully. Use when implementing circuit breakers, retries with exponential backoff, bulkheads, timeouts, or fallback strategies to prevent cascading failures and maintain availability.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Patterns for building systems that gracefully handle failures, degrade gracefully, and recover automatically.
In distributed systems, failure is not exceptional—it's normal.
Networks fail. Services crash. Databases timeout.
The question isn't IF but WHEN.
Resilience = The ability to handle failures gracefully
Goals:
- Prevent cascading failures
- Degrade gracefully
- Recover automatically
- Maintain availability
What: Automatically retry failed operations
When: Transient failures (network blips, temporary unavailability)
Simple retry:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Request │────►│ Failure │────►│ Retry │───► Success
└─────────┘ └─────────┘ └─────────┘
With backoff:
Request → Fail → Wait 100ms → Retry
Fail → Wait 200ms → Retry
Fail → Wait 400ms → Retry
Fail → Give up
Backoff strategies:
- Fixed: Wait same time each retry
- Linear: 100ms, 200ms, 300ms...
- Exponential: 100ms, 200ms, 400ms, 800ms...
- Exponential + Jitter: Add randomness to prevent thundering herd
Do:
- Add jitter to prevent thundering herd
- Set maximum retry count
- Use exponential backoff
- Only retry transient failures
- Log retries for visibility
Don't:
- Retry non-idempotent operations blindly
- Retry client errors (400s)
- Retry indefinitely
- Use same delay for all retries
What: Stop calling a failing service temporarily
When: Service is consistently failing
States:
┌──────────────────────────────────────────────────────────┐
│ │
│ ┌────────┐ Failures ┌────────┐ Timeout │
│ │ CLOSED │───────────────►│ OPEN │─────────────┐ │
│ │ │ │ │ │ │
│ └────┬───┘ └────────┘ │ │
│ │ ▲ │ │
│ │ │ ▼ │
│ │ Success Failure ┌────────┐ │
│ └────────────────────────────────────►│ HALF │ │
│ Success │ OPEN │ │
│ ◄───────────────┴────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
CLOSED: Normal operation, requests flow through
OPEN: Failures exceeded threshold, fail fast
HALF-OPEN: Testing if service recovered
Key parameters:
Failure threshold: How many failures to open
- Too low: Opens on minor issues
- Too high: Doesn't protect enough
- Typical: 5-10 failures or 50% error rate
Timeout (open duration): How long to stay open
- Too short: May retry too quickly
- Too long: Slow recovery
- Typical: 30-60 seconds
Success threshold: Successes to close from half-open
- Typically 1-3 successful requests
What: Isolate components to contain failures
When: Prevent one failure from taking down everything
Ship analogy:
┌─────────────────────────────────────────────┐
│ Ship without bulkheads │
│ ┌───────────────────────────────────────┐ │
│ │ One hole → Entire ship floods │ │
│ └───────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Ship with bulkheads │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ │ │ X │ │ │ │ │ │
│ │ OK │ │Flood │ │ OK │ │ OK │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ One compartment floods, others stay dry │
└─────────────────────────────────────────────┘
Thread pool isolation:
┌────────────────────────────────────────────────────────┐
│ Application │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Service A Pool │ │ Service B Pool │ │
│ │ [10 threads] │ │ [10 threads] │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ ▼ ▼ │
│ Service A Service B │
│ (slow) (healthy) │
└────────────────────────────────────────────────────────┘
Service A being slow doesn't exhaust threads for Service B
Semaphore isolation:
- Limit concurrent requests per dependency
- Lighter weight than thread pools
- Good for async operations
What: Limit how long to wait for operations
When: Always (every external call needs a timeout)
Without timeout:
Request → Service hangs → Caller waits forever → Resources exhausted
With timeout:
Request → Service hangs → Timeout after 5s → Caller handles failure
Timeout types:
- Connection timeout: Time to establish connection
- Read timeout: Time to receive response
- Overall timeout: Total time allowed
Setting timeouts:
- Connection: 1-5 seconds (fast fail)
- Read: Based on p99 latency + buffer
- Overall: Sum of connection + read + processing
Example:
Connection timeout: 2s
Read timeout: 10s (p99 is 5s, 2x buffer)
Overall timeout: 15s
Cascade consideration:
If A calls B calls C:
- C timeout < B timeout < A timeout
- Each layer has buffer for retries
What: Provide alternative when primary fails
When: There's a degraded but acceptable alternative
Fallback options:
┌────────────────────────────────────────────────────────┐
│ Primary fails? Options: │
│ │
│ 1. Cached data: Return stale but valid data │
│ 2. Default value: Return safe default │
│ 3. Degraded service: Reduced functionality │
│ 4. Alternative service: Different provider │
│ 5. Graceful error: Friendly error message │
└────────────────────────────────────────────────────────┘
Example:
Primary: Real-time price service
Fallback 1: Cached price (< 5 min old)
Fallback 2: Last known price with warning
Fallback 3: "Price temporarily unavailable"
What: Control the rate of requests
When: Protect services from overload
Client-side rate limiting:
- Limit outgoing requests
- Prevent overwhelming dependencies
Server-side rate limiting:
- Limit incoming requests
- Protect from traffic spikes
See: rate-limiting-patterns skill for details
Request Flow:
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌────────────┐ │
│ │ Timeout │ ← Overall request timeout │
│ │ ┌────────┐ │ │
│ │ │ Retry │ │ ← With exponential backoff │
│ │ │┌──────┐│ │ │
│ │ ││Circuit││ ← Fail fast if service down │
│ │ ││Breaker│ │ │
│ │ │└──────┘│ │ │
│ │ │┌──────┐│ │ │
│ │ ││Bulkhead│ ← Isolate from other calls │
│ │ │└──────┘│ │ │
│ │ └────────┘ │ │
│ └────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Service │ │
│ └──────────┘ │
│ │ │
│ Failure?──────► Fallback │
│ │
└─────────────────────────────────────────────────────────┘
Outer to inner:
1. Timeout: Overall time limit
2. Retry: Attempt recovery from transient failures
3. Circuit Breaker: Stop calling failed services
4. Bulkhead: Isolate this call from others
5. [Call service]
6. Fallback: Handle failures gracefully
When system is overloaded:
- Accept what you can handle
- Reject the rest gracefully
- Better to serve some users well than all users poorly
Priority-based shedding:
- High priority: Never shed
- Medium: Shed under moderate load
- Low: Shed first
Approaches:
1. Queue-based:
- Fixed-size queue
- Reject when queue full
- Serve based on priority
2. Rate-based:
- Maximum requests per second
- Reject when exceeded
- Return 503 or 429
3. Adaptive:
- Monitor latency/error rate
- Reduce acceptance as stress increases
- Recover as system stabilizes
Level 0: Full functionality
└── Everything works normally
Level 1: Non-essential features disabled
└── Recommendations off, analytics delayed
Level 2: Reduced functionality
└── Read-only mode, cached data only
Level 3: Minimal functionality
└── Core features only, no personalization
Level 4: Maintenance mode
└── Static page, "be back soon"
Transition:
Automatic based on system health metrics
or manual via feature flags
E-commerce during high load:
Full feature:
- Real-time inventory
- Personalized recommendations
- Live chat support
- Detailed analytics
Degraded:
- Cached inventory (5 min delay)
- Generic recommendations
- Contact form only
- Analytics queued
Minimal:
- Static "in stock" status
- No recommendations
- Email support only
- Analytics dropped
1. Liveness check:
"Is the process alive?"
- Simple ping
- Returns 200 if running
- Used for restart decisions
2. Readiness check:
"Can it handle traffic?"
- Checks dependencies
- Returns 200 if ready
- Used for load balancer
3. Deep health check:
"Is everything working?"
- Comprehensive checks
- May be slower
- Used for monitoring/debugging
Do:
- Keep liveness checks simple and fast
- Check all critical dependencies in readiness
- Include version/build info in response
- Return appropriate status codes
Don't:
- Block liveness on dependencies
- Include heavy operations in health checks
- Expose sensitive information
- Forget to handle dependency timeouts
1. Unit tests:
- Test each pattern in isolation
- Mock failures
- Verify behavior
2. Integration tests:
- Test pattern combinations
- Inject failures
- Verify recovery
3. Chaos engineering:
- Test in production-like environment
- Random failures
- Verify system behavior
See: chaos-engineering-fundamentals skill
Libraries (recommended):
- Polly (.NET)
- Resilience4j (Java)
- Hystrix (Java, deprecated)
- go-resilience (Go)
Benefits:
- Battle-tested
- Well-documented
- Community support
- Metrics built-in
Custom implementation:
- Only when specific needs
- High maintenance burden
- Risk of subtle bugs
Metrics to track:
Circuit Breaker:
- State changes
- Open duration
- Failure rate
Retries:
- Retry count
- Retry success rate
- Final success/failure
Bulkhead:
- Concurrent calls
- Rejections
- Queue depth
Timeouts:
- Timeout count
- Latency distribution
1. Every external call needs a timeout
No call should wait forever
2. Retry only transient failures
Don't retry 400 errors
3. Circuit breaker per dependency
Different services need different protection
4. Bulkhead critical paths
Isolate important from less important
5. Plan fallbacks
Know what to do when things fail
6. Monitor everything
Can't fix what you can't see
7. Test failure paths
Happy path tests aren't enough
chaos-engineering-fundamentals - Testing resiliencedistributed-transactions - Handling failures in transactionsrate-limiting-patterns - Controlling request ratesCreating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.