Resilience Patterns

Patterns for building systems that gracefully handle failures, degrade gracefully, and recover automatically.

When to Use This Skill

Implementing circuit breakers
Designing retry strategies
Isolating failures with bulkheads
Building fault-tolerant systems
Handling cascading failures

Why Resilience Matters

In distributed systems, failure is not exceptional—it's normal.

Networks fail. Services crash. Databases timeout.
The question isn't IF but WHEN.

Resilience = The ability to handle failures gracefully

Goals:
- Prevent cascading failures
- Degrade gracefully
- Recover automatically
- Maintain availability

Core Resilience Patterns

1. Retry Pattern

What: Automatically retry failed operations
When: Transient failures (network blips, temporary unavailability)

Simple retry:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Request │────►│ Failure │────►│  Retry  │───► Success
└─────────┘     └─────────┘     └─────────┘

With backoff:
Request → Fail → Wait 100ms → Retry
                 Fail → Wait 200ms → Retry
                        Fail → Wait 400ms → Retry
                               Fail → Give up

Backoff strategies:
- Fixed: Wait same time each retry
- Linear: 100ms, 200ms, 300ms...
- Exponential: 100ms, 200ms, 400ms, 800ms...
- Exponential + Jitter: Add randomness to prevent thundering herd

Retry Best Practices

Do:
- Add jitter to prevent thundering herd
- Set maximum retry count
- Use exponential backoff
- Only retry transient failures
- Log retries for visibility

Don't:
- Retry non-idempotent operations blindly
- Retry client errors (400s)
- Retry indefinitely
- Use same delay for all retries

2. Circuit Breaker Pattern

What: Stop calling a failing service temporarily
When: Service is consistently failing

States:
┌──────────────────────────────────────────────────────────┐
│                                                          │
│   ┌────────┐    Failures    ┌────────┐    Timeout       │
│   │ CLOSED │───────────────►│  OPEN  │─────────────┐    │
│   │        │                │        │             │    │
│   └────┬───┘                └────────┘             │    │
│        │                         ▲                 │    │
│        │                         │                 ▼    │
│        │  Success           Failure          ┌────────┐ │
│        └────────────────────────────────────►│  HALF  │ │
│                                 Success      │  OPEN  │ │
│                              ◄───────────────┴────────┘ │
│                                                          │
└──────────────────────────────────────────────────────────┘

CLOSED: Normal operation, requests flow through
OPEN: Failures exceeded threshold, fail fast
HALF-OPEN: Testing if service recovered

Circuit Breaker Configuration

Key parameters:

Failure threshold: How many failures to open
- Too low: Opens on minor issues
- Too high: Doesn't protect enough
- Typical: 5-10 failures or 50% error rate

Timeout (open duration): How long to stay open
- Too short: May retry too quickly
- Too long: Slow recovery
- Typical: 30-60 seconds

Success threshold: Successes to close from half-open
- Typically 1-3 successful requests

3. Bulkhead Pattern

What: Isolate components to contain failures
When: Prevent one failure from taking down everything

Ship analogy:
┌─────────────────────────────────────────────┐
│  Ship without bulkheads                     │
│  ┌───────────────────────────────────────┐  │
│  │ One hole → Entire ship floods         │  │
│  └───────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Ship with bulkheads                        │
│  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐   │
│  │      │  │  X   │  │      │  │      │   │
│  │ OK   │  │Flood │  │ OK   │  │ OK   │   │
│  └──────┘  └──────┘  └──────┘  └──────┘   │
│  One compartment floods, others stay dry    │
└─────────────────────────────────────────────┘

Bulkhead Implementation

Thread pool isolation:
┌────────────────────────────────────────────────────────┐
│ Application                                            │
│                                                        │
│  ┌─────────────────┐  ┌─────────────────┐            │
│  │ Service A Pool  │  │ Service B Pool  │            │
│  │ [10 threads]    │  │ [10 threads]    │            │
│  └────────┬────────┘  └────────┬────────┘            │
│           │                    │                      │
│           ▼                    ▼                      │
│      Service A             Service B                  │
│      (slow)                (healthy)                  │
└────────────────────────────────────────────────────────┘

Service A being slow doesn't exhaust threads for Service B

Semaphore isolation:
- Limit concurrent requests per dependency
- Lighter weight than thread pools
- Good for async operations

4. Timeout Pattern

What: Limit how long to wait for operations
When: Always (every external call needs a timeout)

Without timeout:
Request → Service hangs → Caller waits forever → Resources exhausted

With timeout:
Request → Service hangs → Timeout after 5s → Caller handles failure

Timeout types:
- Connection timeout: Time to establish connection
- Read timeout: Time to receive response
- Overall timeout: Total time allowed

Timeout Best Practices

Setting timeouts:
- Connection: 1-5 seconds (fast fail)
- Read: Based on p99 latency + buffer
- Overall: Sum of connection + read + processing

Example:
Connection timeout: 2s
Read timeout: 10s (p99 is 5s, 2x buffer)
Overall timeout: 15s

Cascade consideration:
If A calls B calls C:
- C timeout < B timeout < A timeout
- Each layer has buffer for retries

5. Fallback Pattern

What: Provide alternative when primary fails
When: There's a degraded but acceptable alternative

Fallback options:
┌────────────────────────────────────────────────────────┐
│ Primary fails? Options:                                │
│                                                        │
│ 1. Cached data: Return stale but valid data           │
│ 2. Default value: Return safe default                 │
│ 3. Degraded service: Reduced functionality            │
│ 4. Alternative service: Different provider            │
│ 5. Graceful error: Friendly error message             │
└────────────────────────────────────────────────────────┘

Example:
Primary: Real-time price service
Fallback 1: Cached price (< 5 min old)
Fallback 2: Last known price with warning
Fallback 3: "Price temporarily unavailable"

6. Rate Limiting Pattern

What: Control the rate of requests
When: Protect services from overload

Client-side rate limiting:
- Limit outgoing requests
- Prevent overwhelming dependencies

Server-side rate limiting:
- Limit incoming requests
- Protect from traffic spikes

See: rate-limiting-patterns skill for details

Pattern Combinations

Typical Resilience Stack

Request Flow:
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  ┌────────────┐                                        │
│  │  Timeout   │ ← Overall request timeout              │
│  │ ┌────────┐ │                                        │
│  │ │ Retry  │ │ ← With exponential backoff            │
│  │ │┌──────┐│ │                                        │
│  │ ││Circuit││ ← Fail fast if service down            │
│  │ ││Breaker│  │                                        │
│  │ │└──────┘│ │                                        │
│  │ │┌──────┐│ │                                        │
│  │ ││Bulkhead│ ← Isolate from other calls             │
│  │ │└──────┘│ │                                        │
│  │ └────────┘ │                                        │
│  └────────────┘                                        │
│         │                                              │
│         ▼                                              │
│   ┌──────────┐                                         │
│   │ Service  │                                         │
│   └──────────┘                                         │
│         │                                              │
│   Failure?──────► Fallback                             │
│                                                         │
└─────────────────────────────────────────────────────────┘

Order of Application

Outer to inner:

1. Timeout: Overall time limit
2. Retry: Attempt recovery from transient failures
3. Circuit Breaker: Stop calling failed services
4. Bulkhead: Isolate this call from others
5. [Call service]
6. Fallback: Handle failures gracefully

Load Shedding

What is Load Shedding?

When system is overloaded:
- Accept what you can handle
- Reject the rest gracefully
- Better to serve some users well than all users poorly

Priority-based shedding:
- High priority: Never shed
- Medium: Shed under moderate load
- Low: Shed first

Implementation

Approaches:

1. Queue-based:
   - Fixed-size queue
   - Reject when queue full
   - Serve based on priority

2. Rate-based:
   - Maximum requests per second
   - Reject when exceeded
   - Return 503 or 429

3. Adaptive:
   - Monitor latency/error rate
   - Reduce acceptance as stress increases
   - Recover as system stabilizes

Graceful Degradation

Levels of Degradation

Level 0: Full functionality
└── Everything works normally

Level 1: Non-essential features disabled
└── Recommendations off, analytics delayed

Level 2: Reduced functionality
└── Read-only mode, cached data only

Level 3: Minimal functionality
└── Core features only, no personalization

Level 4: Maintenance mode
└── Static page, "be back soon"

Transition:
Automatic based on system health metrics
or manual via feature flags

Feature Degradation Examples

E-commerce during high load:

Full feature:
- Real-time inventory
- Personalized recommendations
- Live chat support
- Detailed analytics

Degraded:
- Cached inventory (5 min delay)
- Generic recommendations
- Contact form only
- Analytics queued

Minimal:
- Static "in stock" status
- No recommendations
- Email support only
- Analytics dropped

Health Checks

Types of Health Checks

1. Liveness check:
   "Is the process alive?"
   - Simple ping
   - Returns 200 if running
   - Used for restart decisions

2. Readiness check:
   "Can it handle traffic?"
   - Checks dependencies
   - Returns 200 if ready
   - Used for load balancer

3. Deep health check:
   "Is everything working?"
   - Comprehensive checks
   - May be slower
   - Used for monitoring/debugging

Health Check Best Practices

Do:
- Keep liveness checks simple and fast
- Check all critical dependencies in readiness
- Include version/build info in response
- Return appropriate status codes

Don't:
- Block liveness on dependencies
- Include heavy operations in health checks
- Expose sensitive information
- Forget to handle dependency timeouts

Testing Resilience

How to Test

1. Unit tests:
   - Test each pattern in isolation
   - Mock failures
   - Verify behavior

2. Integration tests:
   - Test pattern combinations
   - Inject failures
   - Verify recovery

3. Chaos engineering:
   - Test in production-like environment
   - Random failures
   - Verify system behavior

See: chaos-engineering-fundamentals skill

Implementation Considerations

Library vs Custom

Libraries (recommended):
- Polly (.NET)
- Resilience4j (Java)
- Hystrix (Java, deprecated)
- go-resilience (Go)

Benefits:
- Battle-tested
- Well-documented
- Community support
- Metrics built-in

Custom implementation:
- Only when specific needs
- High maintenance burden
- Risk of subtle bugs

Monitoring Resilience

Metrics to track:

Circuit Breaker:
- State changes
- Open duration
- Failure rate

Retries:
- Retry count
- Retry success rate
- Final success/failure

Bulkhead:
- Concurrent calls
- Rejections
- Queue depth

Timeouts:
- Timeout count
- Latency distribution

Best Practices

1. Every external call needs a timeout
   No call should wait forever

2. Retry only transient failures
   Don't retry 400 errors

3. Circuit breaker per dependency
   Different services need different protection

4. Bulkhead critical paths
   Isolate important from less important

5. Plan fallbacks
   Know what to do when things fail

6. Monitor everything
   Can't fix what you can't see

7. Test failure paths
   Happy path tests aren't enough

Related Skills

chaos-engineering-fundamentals - Testing resilience
distributed-transactions - Handling failures in transactions
rate-limiting-patterns - Controlling request rates

resilience-patterns

Resilience Patterns

When to Use This Skill

Why Resilience Matters

Core Resilience Patterns

1. Retry Pattern

Retry Best Practices

2. Circuit Breaker Pattern

Circuit Breaker Configuration

3. Bulkhead Pattern

Bulkhead Implementation

4. Timeout Pattern

Timeout Best Practices

5. Fallback Pattern

6. Rate Limiting Pattern

Pattern Combinations

Typical Resilience Stack

Order of Application

Load Shedding

What is Load Shedding?

Implementation

Graceful Degradation

Levels of Degradation

Feature Degradation Examples

Health Checks

Types of Health Checks

Health Check Best Practices

Testing Resilience

How to Test

Implementation Considerations

Library vs Custom

Monitoring Resilience

Best Practices

Related Skills

Similar Skills