Multi-Region Deployment

Comprehensive guide to deploying applications across multiple geographic regions for availability, performance, and disaster recovery.

When to Use This Skill

Designing globally distributed applications
Implementing disaster recovery (DR)
Reducing latency for global users
Meeting data residency requirements
Achieving high availability (99.99%+)
Planning failover strategies

Multi-Region Fundamentals

Why Multi-Region?

Reasons for Multi-Region:

1. High Availability
   └── Survive region-wide failures
   └── Natural disasters, power outages
   └── Target: 99.99%+ uptime

2. Low Latency
   └── Serve users from nearest region
   └── Reduce round-trip time
   └── Better user experience

3. Data Residency
   └── GDPR, data sovereignty laws
   └── Keep data in specific countries
   └── Compliance requirements

4. Disaster Recovery
   └── Business continuity
   └── RTO/RPO requirements
   └── Regulatory requirements

Trade-offs:
+ Higher availability
+ Lower latency globally
+ Compliance capability
- Higher cost (2x-3x or more)
- Increased complexity
- Data consistency challenges

Deployment Models

Model 1: Active-Passive (DR)
┌─────────────────┐         ┌─────────────────┐
│  PRIMARY (Active)│         │ SECONDARY (Passive)│
│  ┌─────────────┐│         │  ┌─────────────┐│
│  │     App     ││   ──►   │  │    App      ││
│  │    (Live)   ││  Sync   │  │  (Standby)  ││
│  └─────────────┘│         │  └─────────────┘│
│  ┌─────────────┐│         │  ┌─────────────┐│
│  │     DB      ││   ──►   │  │    DB       ││
│  │  (Primary)  ││  Replic │  │  (Replica)  ││
│  └─────────────┘│         │  └─────────────┘│
└─────────────────┘         └─────────────────┘
    All traffic              Failover only

Model 2: Active-Active (Load Distributed)
┌─────────────────┐         ┌─────────────────┐
│   REGION A       │         │   REGION B       │
│  ┌─────────────┐│    ◄──► │  ┌─────────────┐│
│  │     App     ││  Users  │  │     App     ││
│  │   (Active)  ││  routed │  │   (Active)  ││
│  └─────────────┘│   by    │  └─────────────┘│
│  ┌─────────────┐│ location│  ┌─────────────┐│
│  │     DB      ││    ◄──► │  │     DB      ││
│  │  (Primary)  ││  Replic │  │  (Primary)  ││
│  └─────────────┘│   Both  │  └─────────────┘│
└─────────────────┘  ways   └─────────────────┘
  Serves Region A           Serves Region B

Model 3: Active-Active-Active (Global)
┌──────┐    ┌──────┐    ┌──────┐
│  US  │◄──►│  EU  │◄──►│ APAC │
│Active│    │Active│    │Active│
└──┬───┘    └──┬───┘    └──┬───┘
   │           │           │
   └───────────┼───────────┘
               │
         Global Load Balancer
         routes by location

Region Selection

Selection Criteria

Region Selection Factors:

1. User Location
   □ Where are your users?
   □ Latency requirements per region?
   □ User concentration (80/20 rule)?

2. Compliance Requirements
   □ Data residency laws (GDPR, etc.)
   □ Government regulations
   □ Industry requirements (HIPAA, PCI)

3. Cloud Provider Availability
   □ Not all services in all regions
   □ Service feature parity
   □ Regional pricing differences

4. Network Connectivity
   □ Internet exchange points
   □ Direct connect options
   □ Cross-region latency

5. Disaster Risk
   □ Natural disaster patterns
   □ Political stability
   □ Infrastructure reliability

6. Cost
   □ Compute/storage pricing varies
   □ Data transfer costs (egress)
   □ Support availability

Common Region Pairs

Region Pair Strategy:

Americas:
- Primary: US East (N. Virginia)
- Secondary: US West (Oregon) or US East (Ohio)
- Distance: 2,500-3,000 km
- Latency: ~60ms

Europe:
- Primary: EU West (Ireland)
- Secondary: EU Central (Frankfurt) or EU West (London)
- Distance: ~1,000-1,500 km
- Latency: ~20-30ms

Asia Pacific:
- Primary: Singapore or Tokyo
- Secondary: Sydney or Mumbai
- Distance: 5,000-7,000 km
- Latency: ~100-150ms

Global Triad:
- US East + EU West + Singapore/Tokyo
- Covers most global users
- <100ms to 80%+ of users

Data Replication

Replication Patterns

Pattern 1: Async Replication (Most Common)
Primary ──────► Replica
         lag:
         ms to seconds

+ Lower latency for writes
+ Primary not blocked by replica
- Potential data loss on failover (RPO > 0)
- Replication lag visible

Pattern 2: Sync Replication
Primary ◄─────► Replica
         both
         confirm

+ No data loss on failover (RPO = 0)
+ Strong consistency
- Higher write latency
- Availability coupled to both regions

Pattern 3: Semi-Sync Replication
Primary ──────► At least 1 Replica (sync)
        └────► Other Replicas (async)

+ Guaranteed durability for some replicas
+ Balance of latency and durability
- More complex failure handling

Conflict Resolution

Multi-Primary Conflict Resolution:

Scenario: Same record updated in two regions simultaneously

Resolution Strategies:

1. Last Write Wins (LWW)
   └── Timestamp-based
   └── Simple but can lose data
   └── Clock sync important

2. First Write Wins
   └── First committed wins
   └── Later writes rejected or queued
   └── Good for "create once" data

3. Application-Level Resolution
   └── Custom merge logic
   └── Most flexible
   └── Most complex

4. CRDTs (Conflict-free Replicated Data Types)
   └── Mathematically guaranteed convergence
   └── Counters, sets, maps
   └── Good for specific use cases

Best Practice:
- Design to avoid conflicts where possible
- Partition data by region when appropriate
- Use single-primary for conflict-sensitive data

Failover Strategies

Failover Types

Failover Types:

1. DNS-Based Failover
   ┌─────────────────────────────────────────┐
   │  DNS Health Check                       │
   │  ├── Check primary every 10-30s        │
   │  ├── 3 consecutive failures = unhealthy│
   │  └── Update DNS to point to secondary  │
   └─────────────────────────────────────────┘

   RTO: 60-300 seconds (DNS TTL + propagation)
   Pros: Simple, works with any app
   Cons: Slow failover, DNS caching issues

2. Load Balancer Failover
   ┌─────────────────────────────────────────┐
   │  Global Load Balancer                   │
   │  ├── Continuous health checks          │
   │  ├── Instant routing changes           │
   │  └── No DNS propagation wait           │
   └─────────────────────────────────────────┘

   RTO: 10-60 seconds
   Pros: Fast, reliable
   Cons: Requires GLB, potential single point

3. Application-Level Failover
   ┌─────────────────────────────────────────┐
   │  Client/App Aware                       │
   │  ├── Client retries to alternate region│
   │  ├── SDK handles failover              │
   │  └── No infrastructure dependency      │
   └─────────────────────────────────────────┘

   RTO: 1-10 seconds
   Pros: Fastest, most control
   Cons: Requires client changes

RTO and RPO

Recovery Objectives:

RTO (Recovery Time Objective):
└── Maximum acceptable downtime
└── Time from failure to recovery
└── Drives failover automation investment

RPO (Recovery Point Objective):
└── Maximum acceptable data loss
└── Time between last backup and failure
└── Drives replication strategy

Common Targets:
┌──────────────┬──────────┬──────────┬───────────────────┐
│ Tier         │ RTO      │ RPO      │ Strategy          │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Critical     │ <1 min   │ 0        │ Active-active     │
│              │          │          │ Sync replication  │
├──────────────┼──────────┼──────────┼───────────────────┤
│ High         │ <15 min  │ <1 min   │ Active-passive    │
│              │          │          │ Hot standby       │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Medium       │ <4 hours │ <1 hour  │ Warm standby      │
│              │          │          │ Async replication │
├──────────────┼──────────┼──────────┼───────────────────┤
│ Low          │ <24 hours│ <24 hours│ Backup/Restore    │
│              │          │          │ Pilot light       │
└──────────────┴──────────┴──────────┴───────────────────┘

Traffic Routing

Global Load Balancing

GLB Routing Policies:

1. Geolocation Routing
   └── Route by user's geographic location
   └── Europe users → EU region
   └── Fallback for unmapped locations

2. Latency-Based Routing
   └── Route to lowest latency region
   └── Based on real measurements
   └── Adapts to network conditions

3. Weighted Routing
   └── Split traffic by percentage
   └── Good for rollouts, testing
   └── Example: 90% primary, 10% secondary

4. Failover Routing
   └── Primary region until unhealthy
   └── Automatic switch to secondary
   └── Health check driven

Cloud Implementations:
- AWS: Route 53, Global Accelerator
- Azure: Traffic Manager, Front Door
- GCP: Cloud Load Balancing
- Cloudflare: Load Balancing

Session Handling

Session Affinity in Multi-Region:

Challenge: User session state across regions

Option 1: Sticky Sessions
└── User stays in same region for session
└── Failover loses session
└── Simple but limited DR

Option 2: Centralized Session Store
└── Session in Redis/database
└── All regions access same store
└── Adds latency, single point of failure

Option 3: Distributed Session Store
└── Redis Cluster across regions
└── Session replicated
└── Complex but resilient

Option 4: Stateless (JWT/Token)
└── Session in client-side token
└── No server-side state
└── Best for multi-region

Recommendation:
- Prefer stateless where possible
- If stateful, use distributed store
- Design for session loss on failover

Database Patterns

Database Deployment Options

Option 1: Single Primary + Read Replicas
┌───────────────┐         ┌───────────────┐
│   US-EAST     │         │   EU-WEST     │
│  ┌─────────┐  │   ───►  │  ┌─────────┐  │
│  │ Primary │  │  Async  │  │ Replica │  │
│  │  (R/W)  │  │  Replic │  │  (Read) │  │
│  └─────────┘  │         │  └─────────┘  │
└───────────────┘         └───────────────┘
- Writes go to primary region
- Reads served locally
- Failover promotes replica

Option 2: Multi-Primary (Active-Active)
┌───────────────┐         ┌───────────────┐
│   US-EAST     │◄───────►│   EU-WEST     │
│  ┌─────────┐  │  Bi-dir │  ┌─────────┐  │
│  │ Primary │  │  Replic │  │ Primary │  │
│  │  (R/W)  │  │         │  │  (R/W)  │  │
│  └─────────┘  │         │  └─────────┘  │
└───────────────┘         └───────────────┘
- Writes accepted in both regions
- Conflict resolution required
- Complex but lowest latency

Option 3: Globally Distributed Database
┌─────────────────────────────────────────┐
│  CockroachDB / Spanner / YugabyteDB    │
│  ┌─────┐    ┌─────┐    ┌─────┐        │
│  │ US  │────│ EU  │────│ APAC│        │
│  └─────┘    └─────┘    └─────┘        │
│  Automatic sharding and replication    │
└─────────────────────────────────────────┘
- Database handles distribution
- Strong consistency available
- Higher latency for writes

Testing and Validation

Chaos Engineering for Multi-Region

Multi-Region Chaos Tests:

1. Region Failover Test
   □ Fail primary region completely
   □ Measure failover time
   □ Verify data integrity
   □ Test user experience

2. Network Partition Test
   □ Block inter-region communication
   □ Verify split-brain handling
   □ Test conflict resolution

3. Partial Failure Test
   □ Fail subset of services in region
   □ Test degraded operation
   □ Verify monitoring/alerting

4. Data Replication Lag Test
   □ Introduce artificial lag
   □ Test application behavior
   □ Verify consistency expectations

5. Failback Test
   □ Restore failed region
   □ Test data sync
   □ Test traffic redistribution

Schedule:
- Failover tests: Monthly
- Full DR drill: Quarterly
- Chaos experiments: Weekly

Best Practices

Multi-Region Best Practices:

1. Design for Failure
   □ Assume any region can fail
   □ No single points of failure
   □ Automated failover
   □ Regular testing

2. Data Strategy
   □ Define consistency requirements
   □ Choose appropriate replication
   □ Plan for conflicts
   □ Consider data residency

3. Observability
   □ Cross-region metrics
   □ Distributed tracing
   □ Centralized logging
   □ Region-aware alerting

4. Cost Management
   □ Right-size standby resources
   □ Use reserved capacity wisely
   □ Monitor data transfer costs
   □ Consider traffic patterns

5. Operational Readiness
   □ Runbooks for failover
   □ Regular DR drills
   □ On-call training
   □ Post-incident reviews

Related Skills

latency-optimization - Reducing global latency
distributed-consensus - Consistency patterns
cdn-architecture - Edge caching for multi-region
chaos-engineering-fundamentals - Testing resilience

multi-region-deployment

Multi-Region Deployment

When to Use This Skill

Multi-Region Fundamentals

Why Multi-Region?

Deployment Models

Region Selection

Selection Criteria

Common Region Pairs

Data Replication

Replication Patterns

Conflict Resolution

Failover Strategies

Failover Types

RTO and RPO

Traffic Routing

Global Load Balancing

Session Handling

Database Patterns

Database Deployment Options

Testing and Validation

Chaos Engineering for Multi-Region

Best Practices

Related Skills

Similar Skills