Use when designing globally distributed systems, multi-region architectures, or disaster recovery strategies. Covers region selection, active-active vs active-passive, data replication, and failover patterns.
Designs globally distributed multi-region deployments for high availability, low latency, and disaster recovery.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Comprehensive guide to deploying applications across multiple geographic regions for availability, performance, and disaster recovery.
Reasons for Multi-Region:
1. High Availability
└── Survive region-wide failures
└── Natural disasters, power outages
└── Target: 99.99%+ uptime
2. Low Latency
└── Serve users from nearest region
└── Reduce round-trip time
└── Better user experience
3. Data Residency
└── GDPR, data sovereignty laws
└── Keep data in specific countries
└── Compliance requirements
4. Disaster Recovery
└── Business continuity
└── RTO/RPO requirements
└── Regulatory requirements
Trade-offs:
+ Higher availability
+ Lower latency globally
+ Compliance capability
- Higher cost (2x-3x or more)
- Increased complexity
- Data consistency challenges
Model 1: Active-Passive (DR)
┌─────────────────┐ ┌─────────────────┐
│ PRIMARY (Active)│ │ SECONDARY (Passive)│
│ ┌─────────────┐│ │ ┌─────────────┐│
│ │ App ││ ──► │ │ App ││
│ │ (Live) ││ Sync │ │ (Standby) ││
│ └─────────────┘│ │ └─────────────┘│
│ ┌─────────────┐│ │ ┌─────────────┐│
│ │ DB ││ ──► │ │ DB ││
│ │ (Primary) ││ Replic │ │ (Replica) ││
│ └─────────────┘│ │ └─────────────┘│
└─────────────────┘ └─────────────────┘
All traffic Failover only
Model 2: Active-Active (Load Distributed)
┌─────────────────┐ ┌─────────────────┐
│ REGION A │ │ REGION B │
│ ┌─────────────┐│ ◄──► │ ┌─────────────┐│
│ │ App ││ Users │ │ App ││
│ │ (Active) ││ routed │ │ (Active) ││
│ └─────────────┘│ by │ └─────────────┘│
│ ┌─────────────┐│ location│ ┌─────────────┐│
│ │ DB ││ ◄──► │ │ DB ││
│ │ (Primary) ││ Replic │ │ (Primary) ││
│ └─────────────┘│ Both │ └─────────────┘│
└─────────────────┘ ways └─────────────────┘
Serves Region A Serves Region B
Model 3: Active-Active-Active (Global)
┌──────┐ ┌──────┐ ┌──────┐
│ US │◄──►│ EU │◄──►│ APAC │
│Active│ │Active│ │Active│
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
└───────────┼───────────┘
│
Global Load Balancer
routes by location
Region Selection Factors:
1. User Location
□ Where are your users?
□ Latency requirements per region?
□ User concentration (80/20 rule)?
2. Compliance Requirements
□ Data residency laws (GDPR, etc.)
□ Government regulations
□ Industry requirements (HIPAA, PCI)
3. Cloud Provider Availability
□ Not all services in all regions
□ Service feature parity
□ Regional pricing differences
4. Network Connectivity
□ Internet exchange points
□ Direct connect options
□ Cross-region latency
5. Disaster Risk
□ Natural disaster patterns
□ Political stability
□ Infrastructure reliability
6. Cost
□ Compute/storage pricing varies
□ Data transfer costs (egress)
□ Support availability
Region Pair Strategy:
Americas:
- Primary: US East (N. Virginia)
- Secondary: US West (Oregon) or US East (Ohio)
- Distance: 2,500-3,000 km
- Latency: ~60ms
Europe:
- Primary: EU West (Ireland)
- Secondary: EU Central (Frankfurt) or EU West (London)
- Distance: ~1,000-1,500 km
- Latency: ~20-30ms
Asia Pacific:
- Primary: Singapore or Tokyo
- Secondary: Sydney or Mumbai
- Distance: 5,000-7,000 km
- Latency: ~100-150ms
Global Triad:
- US East + EU West + Singapore/Tokyo
- Covers most global users
- <100ms to 80%+ of users
Pattern 1: Async Replication (Most Common)
Primary ──────► Replica
lag:
ms to seconds
+ Lower latency for writes
+ Primary not blocked by replica
- Potential data loss on failover (RPO > 0)
- Replication lag visible
Pattern 2: Sync Replication
Primary ◄─────► Replica
both
confirm
+ No data loss on failover (RPO = 0)
+ Strong consistency
- Higher write latency
- Availability coupled to both regions
Pattern 3: Semi-Sync Replication
Primary ──────► At least 1 Replica (sync)
└────► Other Replicas (async)
+ Guaranteed durability for some replicas
+ Balance of latency and durability
- More complex failure handling
Multi-Primary Conflict Resolution:
Scenario: Same record updated in two regions simultaneously
Resolution Strategies:
1. Last Write Wins (LWW)
└── Timestamp-based
└── Simple but can lose data
└── Clock sync important
2. First Write Wins
└── First committed wins
└── Later writes rejected or queued
└── Good for "create once" data
3. Application-Level Resolution
└── Custom merge logic
└── Most flexible
└── Most complex
4. CRDTs (Conflict-free Replicated Data Types)
└── Mathematically guaranteed convergence
└── Counters, sets, maps
└── Good for specific use cases
Best Practice:
- Design to avoid conflicts where possible
- Partition data by region when appropriate
- Use single-primary for conflict-sensitive data
Failover Types:
1. DNS-Based Failover
┌─────────────────────────────────────────┐
│ DNS Health Check │
│ ├── Check primary every 10-30s │
│ ├── 3 consecutive failures = unhealthy│
│ └── Update DNS to point to secondary │
└─────────────────────────────────────────┘
RTO: 60-300 seconds (DNS TTL + propagation)
Pros: Simple, works with any app
Cons: Slow failover, DNS caching issues
2. Load Balancer Failover
┌─────────────────────────────────────────┐
│ Global Load Balancer │
│ ├── Continuous health checks │
│ ├── Instant routing changes │
│ └── No DNS propagation wait │
└─────────────────────────────────────────┘
RTO: 10-60 seconds
Pros: Fast, reliable
Cons: Requires GLB, potential single point
3. Application-Level Failover
┌─────────────────────────────────────────┐
│ Client/App Aware │
│ ├── Client retries to alternate region│
│ ├── SDK handles failover │
│ └── No infrastructure dependency │
└─────────────────────────────────────────┘
RTO: 1-10 seconds
Pros: Fastest, most control
Cons: Requires client changes
Recovery Objectives:
RTO (Recovery Time Objective):
└── Maximum acceptable downtime
└── Time from failure to recovery
└── Drives failover automation investment
RPO (Recovery Point Objective):
└── Maximum acceptable data loss
└── Time between last backup and failure
└── Drives replication strategy
Common Targets:
┌──────────────┬──────────┬──────────┬───────────────────┐
│ Tier │ RTO │ RPO │ Strategy │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Critical │ <1 min │ 0 │ Active-active │
│ │ │ │ Sync replication │
├──────────────┼──────────┼──────────┼───────────────────┤
│ High │ <15 min │ <1 min │ Active-passive │
│ │ │ │ Hot standby │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Medium │ <4 hours │ <1 hour │ Warm standby │
│ │ │ │ Async replication │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Low │ <24 hours│ <24 hours│ Backup/Restore │
│ │ │ │ Pilot light │
└──────────────┴──────────┴──────────┴───────────────────┘
GLB Routing Policies:
1. Geolocation Routing
└── Route by user's geographic location
└── Europe users → EU region
└── Fallback for unmapped locations
2. Latency-Based Routing
└── Route to lowest latency region
└── Based on real measurements
└── Adapts to network conditions
3. Weighted Routing
└── Split traffic by percentage
└── Good for rollouts, testing
└── Example: 90% primary, 10% secondary
4. Failover Routing
└── Primary region until unhealthy
└── Automatic switch to secondary
└── Health check driven
Cloud Implementations:
- AWS: Route 53, Global Accelerator
- Azure: Traffic Manager, Front Door
- GCP: Cloud Load Balancing
- Cloudflare: Load Balancing
Session Affinity in Multi-Region:
Challenge: User session state across regions
Option 1: Sticky Sessions
└── User stays in same region for session
└── Failover loses session
└── Simple but limited DR
Option 2: Centralized Session Store
└── Session in Redis/database
└── All regions access same store
└── Adds latency, single point of failure
Option 3: Distributed Session Store
└── Redis Cluster across regions
└── Session replicated
└── Complex but resilient
Option 4: Stateless (JWT/Token)
└── Session in client-side token
└── No server-side state
└── Best for multi-region
Recommendation:
- Prefer stateless where possible
- If stateful, use distributed store
- Design for session loss on failover
Option 1: Single Primary + Read Replicas
┌───────────────┐ ┌───────────────┐
│ US-EAST │ │ EU-WEST │
│ ┌─────────┐ │ ───► │ ┌─────────┐ │
│ │ Primary │ │ Async │ │ Replica │ │
│ │ (R/W) │ │ Replic │ │ (Read) │ │
│ └─────────┘ │ │ └─────────┘ │
└───────────────┘ └───────────────┘
- Writes go to primary region
- Reads served locally
- Failover promotes replica
Option 2: Multi-Primary (Active-Active)
┌───────────────┐ ┌───────────────┐
│ US-EAST │◄───────►│ EU-WEST │
│ ┌─────────┐ │ Bi-dir │ ┌─────────┐ │
│ │ Primary │ │ Replic │ │ Primary │ │
│ │ (R/W) │ │ │ │ (R/W) │ │
│ └─────────┘ │ │ └─────────┘ │
└───────────────┘ └───────────────┘
- Writes accepted in both regions
- Conflict resolution required
- Complex but lowest latency
Option 3: Globally Distributed Database
┌─────────────────────────────────────────┐
│ CockroachDB / Spanner / YugabyteDB │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ US │────│ EU │────│ APAC│ │
│ └─────┘ └─────┘ └─────┘ │
│ Automatic sharding and replication │
└─────────────────────────────────────────┘
- Database handles distribution
- Strong consistency available
- Higher latency for writes
Multi-Region Chaos Tests:
1. Region Failover Test
□ Fail primary region completely
□ Measure failover time
□ Verify data integrity
□ Test user experience
2. Network Partition Test
□ Block inter-region communication
□ Verify split-brain handling
□ Test conflict resolution
3. Partial Failure Test
□ Fail subset of services in region
□ Test degraded operation
□ Verify monitoring/alerting
4. Data Replication Lag Test
□ Introduce artificial lag
□ Test application behavior
□ Verify consistency expectations
5. Failback Test
□ Restore failed region
□ Test data sync
□ Test traffic redistribution
Schedule:
- Failover tests: Monthly
- Full DR drill: Quarterly
- Chaos experiments: Weekly
Multi-Region Best Practices:
1. Design for Failure
□ Assume any region can fail
□ No single points of failure
□ Automated failover
□ Regular testing
2. Data Strategy
□ Define consistency requirements
□ Choose appropriate replication
□ Plan for conflicts
□ Consider data residency
3. Observability
□ Cross-region metrics
□ Distributed tracing
□ Centralized logging
□ Region-aware alerting
4. Cost Management
□ Right-size standby resources
□ Use reserved capacity wisely
□ Monitor data transfer costs
□ Consider traffic patterns
5. Operational Readiness
□ Runbooks for failover
□ Regular DR drills
□ On-call training
□ Post-incident reviews
latency-optimization - Reducing global latencydistributed-consensus - Consistency patternscdn-architecture - Edge caching for multi-regionchaos-engineering-fundamentals - Testing resilienceCreating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.