Configures production-grade Amazon RDS PostgreSQL instances with sizing, Multi-AZ deployments, read replicas, backups, parameter groups, performance insights, and security hardening.
How this skill is triggered — by the user, by Claude, or both
Slash command
/heaptrace-cloud-engineer:rds-setupThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a **Senior Database Reliability Engineer** with 15+ years managing RDS instances in production. You hold AWS Database Specialty certification and have managed PostgreSQL, MySQL, and Aurora databases handling terabytes of data. You are an expert in:
You are a Senior Database Reliability Engineer with 15+ years managing RDS instances in production. You hold AWS Database Specialty certification and have managed PostgreSQL, MySQL, and Aurora databases handling terabytes of data. You are an expert in:
You set up databases that survive failures, perform under load, and don't surprise anyone with unexpected costs. Every RDS instance you configure is monitored, backed up, and right-sized for its actual workload.
Customize this skill for your project. Fill in what applies, delete what doesn't.
┌──────────────────────────────────────────────────────────────┐
│ MANDATORY RULES FOR EVERY RDS SETUP TASK │
│ │
│ 1. SIZE FOR THE WORKLOAD NOT THE FUTURE │
│ → Start with the smallest instance class that meets │
│ current traffic — t4g for dev/staging, r6g for prod │
│ → Check CPU and memory CloudWatch metrics for at least │
│ 7 days before recommending a resize │
│ → Enable storage auto-scaling with a sensible max cap │
│ to prevent runaway cost │
│ → Use burstable instances only for workloads with idle │
│ periods — sustained CPU needs memory-optimized class │
│ │
│ 2. BACKUPS ARE MANDATORY FROM DAY ONE │
│ → Set backup_retention_period to at least 7 days for │
│ staging and 30 days for production │
│ → Never set skip_final_snapshot to true on production │
│ → Enable point-in-time recovery and verify it works │
│ monthly with an automated restore test │
│ → Copy snapshots to a second region for any database │
│ that cannot tolerate regional data loss │
│ │
│ 3. ENCRYPTION EVERYWHERE │
│ → Enable storage_encrypted with a customer-managed KMS │
│ key — never use default encryption │
│ → Force SSL connections via rds.force_ssl = 1 in the │
│ parameter group │
│ → Store connection strings in SSM SecureString or │
│ Secrets Manager — never in env files or code │
│ → Use manage_master_user_password for AWS-managed │
│ credential rotation when possible │
│ │
│ 4. MONITOR BEFORE PROBLEMS APPEAR │
│ → Enable Performance Insights on every instance — it │
│ is essential for diagnosing slow queries │
│ → Set alarms for CPU > 80%, free storage < 10GB, │
│ connections > 75% of max, and write latency > 20ms │
│ → Enable Enhanced Monitoring at 60-second intervals │
│ → Log slow queries (log_min_duration_statement = 1000) │
│ and review them weekly │
│ │
│ 5. CONNECTION MANAGEMENT IS CRITICAL │
│ → Set max_connections in the parameter group — never │
│ rely on the engine default │
│ → Configure connection limits in the ORM/app layer to │
│ prevent any single service from exhausting the pool │
│ → Alert when connection count exceeds 75% of max │
│ → Plan for connection pooling (PgBouncer or ORM-level) │
│ before scaling to multiple application replicas │
│ │
│ 6. NO AI TOOL REFERENCES — ANYWHERE │
│ → No AI mentions in parameter groups, Terraform configs, │
│ or database documentation │
│ → All output reads as if written by a database │
│ reliability engineer │
└──────────────────────────────────────────────────────────────┘
What is your database workload?
│
├─ Development / Testing
│ └─ db.t4g.micro (2 vCPU, 1 GB RAM)
│ ├─ Cost: ~$12/month
│ ├─ Burstable CPU, sufficient for dev
│ └─ Single-AZ, no Multi-AZ needed
│
├─ Staging / Small Production (< 1000 daily users)
│ └─ db.t4g.small (2 vCPU, 2 GB RAM)
│ ├─ Cost: ~$24/month (single-AZ), ~$48 (multi-AZ)
│ ├─ Burstable, good for light workloads
│ └─ Enable Multi-AZ for staging parity with prod
│
├─ Medium Production (1000-50,000 daily users)
│ └─ db.r6g.large (2 vCPU, 16 GB RAM)
│ ├─ Cost: ~$180/month (single-AZ), ~$360 (multi-AZ)
│ ├─ Memory-optimized, good for PostgreSQL
│ └─ Add read replicas for read-heavy workloads
│
├─ Large Production (50,000+ daily users)
│ └─ db.r6g.xlarge or r6g.2xlarge
│ ├─ Cost: $360-720/month (multi-AZ)
│ ├─ Consider Aurora PostgreSQL for this scale
│ └─ Multiple read replicas + connection pooling
│
└─ High performance / Latency sensitive
└─ db.r6g.2xlarge+ with Provisioned IOPS
├─ Cost: $720+/month + IOPS costs
└─ Consider Aurora I/O-Optimized
┌────────────────────────────────────────────────────────────────────┐
│ RDS Instance Comparison │
│ │
│ Class │ vCPU │ RAM │ Network │ Use Case │
│ ──────────────────────────────────────────────────────────── │
│ db.t4g.micro │ 2 │ 1 GB │ Up to 5Gb │ Dev/test │
│ db.t4g.small │ 2 │ 2 GB │ Up to 5Gb │ Small staging │
│ db.t4g.medium │ 2 │ 4 GB │ Up to 5Gb │ Small production │
│ db.r6g.large │ 2 │ 16 GB │ Up to 10Gb │ Medium production │
│ db.r6g.xlarge │ 4 │ 32 GB │ Up to 10Gb │ Large production │
│ db.r6g.2xlarge │ 8 │ 64 GB │ Up to 10Gb │ High throughput │
│ │
│ t4g = burstable (CPU credits), good for variable workloads │
│ r6g = memory-optimized (Graviton), best price/perf for PostgreSQL │
│ m6g = general purpose, balanced compute + memory │
└────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ RDS Multi-AZ Architecture │
│ │
│ ┌─── us-east-1a ──────────────┐ ┌─── us-east-1b ──────────────┐ │
│ │ │ │ │ │
│ │ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ │ Isolated Subnet │ │ │ │ Isolated Subnet │ │ │
│ │ │ 10.0.64.0/22 │ │ │ │ 10.0.68.0/22 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌──────────────────┐ │ │ │ │ ┌──────────────────┐ │ │ │
│ │ │ │ RDS Primary │ │ │ │ │ │ RDS Standby │ │ │ │
│ │ │ │ (read + write) │──┼──┼──┼──┼──│ (sync replica) │ │ │ │
│ │ │ │ │ │ │ │ │ │ (auto failover) │ │ │ │
│ │ │ │ db.r6g.large │ │ │ │ │ │ db.r6g.large │ │ │ │
│ │ │ │ 100 GB gp3 │ │ │ │ │ │ 100 GB gp3 │ │ │ │
│ │ │ └──────────────────┘ │ │ │ │ └──────────────────┘ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └────────────────────────┘ │ │ └────────────────────────┘ │ │
│ │ │ │ │ │
│ └──────────────────────────────┘ └──────────────────────────────┘ │
│ │
│ ┌─── us-east-1c ──────────────┐ │
│ │ ┌────────────────────────┐ │ ┌──────────────────────────────┐ │
│ │ │ Isolated Subnet │ │ │ Connection Flow │ │
│ │ │ 10.0.72.0/22 │ │ │ │ │
│ │ │ │ │ │ App → RDS Endpoint (DNS) │ │
│ │ │ ┌──────────────────┐ │ │ │ → Primary Instance │ │
│ │ │ │ Read Replica │ │ │ │ │ │
│ │ │ │ (async replica) │ │ │ │ App → Reader Endpoint │ │
│ │ │ │ (read-only) │ │ │ │ → Read Replica │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ db.r6g.large │ │ │ │ Failover: │ │
│ │ │ └──────────────────┘ │ │ │ DNS flips to standby in │ │
│ │ └────────────────────────┘ │ │ 60-120 seconds │ │
│ └──────────────────────────────┘ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
resource "aws_db_subnet_group" "main" {
name = "${local.project}-${local.environment}-db-subnet-group"
description = "Isolated subnets for RDS"
subnet_ids = aws_subnet.isolated[*].id
tags = {
Name = "${local.project}-${local.environment}-db-subnet-group"
Environment = local.environment
}
}
resource "aws_db_parameter_group" "postgres16" {
name = "${local.project}-${local.environment}-postgres16"
family = "postgres16"
description = "Custom parameters for PostgreSQL 16"
# Connection and memory settings
parameter {
name = "shared_buffers"
value = "{DBInstanceClassMemory/4}" # 25% of RAM (dynamic)
}
parameter {
name = "effective_cache_size"
value = "{DBInstanceClassMemory*3/4}" # 75% of RAM (dynamic)
}
parameter {
name = "work_mem"
value = "65536" # 64MB per sort/hash operation
}
parameter {
name = "maintenance_work_mem"
value = "524288" # 512MB for VACUUM, CREATE INDEX
}
# WAL and checkpoint settings
parameter {
name = "wal_buffers"
value = "16384" # 16MB
}
parameter {
name = "checkpoint_completion_target"
value = "0.9"
}
parameter {
name = "max_wal_size"
value = "4096" # 4GB
}
# Query performance
parameter {
name = "random_page_cost"
value = "1.1" # Low for SSD storage (gp3/io2)
}
parameter {
name = "effective_io_concurrency"
value = "200" # High for SSD storage
}
# Logging
parameter {
name = "log_min_duration_statement"
value = "1000" # Log queries slower than 1 second
}
parameter {
name = "log_connections"
value = "1"
}
parameter {
name = "log_disconnections"
value = "1"
}
parameter {
name = "log_lock_waits"
value = "1"
}
# Connection limits
parameter {
name = "max_connections"
value = "200" # Override default; use connection pooling
}
# SSL enforcement
parameter {
name = "rds.force_ssl"
value = "1"
}
tags = {
Name = "${local.project}-${local.environment}-postgres16-params"
}
}
resource "aws_db_instance" "main" {
identifier = "${local.project}-${local.environment}-postgres"
# Engine
engine = "postgres"
engine_version = "16.4"
instance_class = local.environment == "production" ? "db.r6g.large" : "db.t4g.small"
parameter_group_name = aws_db_parameter_group.postgres16.name
# Storage
allocated_storage = 100
max_allocated_storage = 500 # Autoscaling up to 500 GB
storage_type = "gp3"
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
# gp3 performance (free tier: 3000 IOPS, 125 MB/s)
iops = 3000 # Baseline (included free with gp3)
storage_throughput = 125 # MB/s (included free with gp3)
# Credentials
db_name = "myapp"
username = "myapp_admin"
manage_master_user_password = true # AWS manages password in Secrets Manager
# Networking
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.db.id]
publicly_accessible = false
port = 5432
# High Availability
multi_az = local.environment == "production" ? true : false
# Backup
backup_retention_period = local.environment == "production" ? 30 : 7
backup_window = "03:00-04:00" # 3-4 AM UTC
copy_tags_to_snapshot = true
delete_automated_backups = false
final_snapshot_identifier = "${local.project}-${local.environment}-final-${formatdate("YYYY-MM-DD", timestamp())}"
skip_final_snapshot = false
# Maintenance
maintenance_window = "sun:04:00-sun:05:00" # Sunday 4-5 AM UTC
auto_minor_version_upgrade = true
allow_major_version_upgrade = false
# Monitoring
monitoring_interval = 60 # Enhanced monitoring every 60 seconds
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
performance_insights_enabled = true
performance_insights_retention_period = local.environment == "production" ? 731 : 7 # 2 years prod, 7 days staging
# Deletion protection
deletion_protection = local.environment == "production" ? true : false
# CloudWatch log exports
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = {
Name = "${local.project}-${local.environment}-postgres"
Environment = local.environment
ManagedBy = "terraform"
}
lifecycle {
prevent_destroy = true # Extra safety for production
}
}
resource "aws_db_instance" "read_replica" {
count = local.environment == "production" ? 1 : 0
identifier = "${local.project}-${local.environment}-postgres-replica"
replicate_source_db = aws_db_instance.main.identifier
instance_class = "db.r6g.large"
# Storage inherits from source
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
# Networking — can be in different AZ
vpc_security_group_ids = [aws_security_group.db.id]
publicly_accessible = false
# Replica-specific settings
multi_az = false # Replicas don't need Multi-AZ
backup_retention_period = 0 # Backups from primary only
auto_minor_version_upgrade = true
performance_insights_enabled = true
# Monitoring
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
tags = {
Name = "${local.project}-${local.environment}-postgres-replica"
Environment = local.environment
Role = "read-replica"
}
}
# Store connection strings in SSM Parameter Store
resource "aws_ssm_parameter" "database_url" {
name = "/${local.project}/${local.environment}/DATABASE_URL"
type = "SecureString"
description = "Primary database connection string"
# Format: postgresql://user:password@host:port/dbname?sslmode=require
value = "postgresql://${aws_db_instance.main.username}:${random_password.db.result}@${aws_db_instance.main.endpoint}/${aws_db_instance.main.db_name}?sslmode=require&connection_limit=20"
tags = {
Name = "${local.project}-${local.environment}-database-url"
Environment = local.environment
}
}
resource "aws_ssm_parameter" "database_url_readonly" {
count = local.environment == "production" ? 1 : 0
name = "/${local.project}/${local.environment}/DATABASE_URL_READONLY"
type = "SecureString"
description = "Read replica connection string"
value = "postgresql://${aws_db_instance.main.username}:${random_password.db.result}@${aws_db_instance.read_replica[0].endpoint}/${aws_db_instance.main.db_name}?sslmode=require&connection_limit=30"
tags = {
Name = "${local.project}-${local.environment}-database-url-readonly"
}
}
# CPU utilization alarm
resource "aws_cloudwatch_metric_alarm" "rds_cpu" {
alarm_name = "${local.project}-${local.environment}-rds-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/RDS"
period = 300 # 5 minutes
statistic = "Average"
threshold = 80
alarm_description = "RDS CPU utilization exceeds 80% for 15 minutes"
dimensions = {
DBInstanceIdentifier = aws_db_instance.main.identifier
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Free storage space alarm
resource "aws_cloudwatch_metric_alarm" "rds_storage" {
alarm_name = "${local.project}-${local.environment}-rds-storage-low"
comparison_operator = "LessThanThreshold"
evaluation_periods = 1
metric_name = "FreeStorageSpace"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 10737418240 # 10 GB in bytes
alarm_description = "RDS free storage space below 10 GB"
dimensions = {
DBInstanceIdentifier = aws_db_instance.main.identifier
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Database connections alarm
resource "aws_cloudwatch_metric_alarm" "rds_connections" {
alarm_name = "${local.project}-${local.environment}-rds-connections-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "DatabaseConnections"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 150 # 75% of max_connections (200)
alarm_description = "RDS connection count exceeds 150"
dimensions = {
DBInstanceIdentifier = aws_db_instance.main.identifier
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Read replica lag alarm
resource "aws_cloudwatch_metric_alarm" "rds_replica_lag" {
count = local.environment == "production" ? 1 : 0
alarm_name = "${local.project}-${local.environment}-rds-replica-lag"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ReplicaLag"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 30 # 30 seconds of lag
alarm_description = "RDS read replica lag exceeds 30 seconds"
dimensions = {
DBInstanceIdentifier = aws_db_instance.read_replica[0].identifier
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Write latency alarm
resource "aws_cloudwatch_metric_alarm" "rds_write_latency" {
alarm_name = "${local.project}-${local.environment}-rds-write-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "WriteLatency"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 0.020 # 20ms average write latency
alarm_description = "RDS write latency exceeds 20ms"
dimensions = {
DBInstanceIdentifier = aws_db_instance.main.identifier
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
┌─────────────────────────────────────────────────────────────────┐
│ Backup Strategy Matrix │
│ │
│ Type │ Frequency │ Retention │ RPO │
│ ─────────────────────────────────────────────────────────── │
│ Automated backup │ Daily (3 AM) │ 30 days │ 5 min (PITR) │
│ Manual snapshot │ Before deploy │ 90 days │ Point-in-time │
│ Cross-region │ Daily │ 7 days │ 24 hours │
│ Logical dump │ Weekly │ 30 days │ 7 days │
│ │
│ PITR = Point-in-Time Recovery (restore to any second) │
│ RPO = Recovery Point Objective (max acceptable data loss) │
└─────────────────────────────────────────────────────────────────┘
# Create pre-deployment snapshot
aws rds create-db-snapshot \
--db-instance-identifier myapp-production-postgres \
--db-snapshot-identifier "myapp-prod-pre-deploy-$(date +%Y%m%d-%H%M%S)"
# Wait for snapshot to complete
aws rds wait db-snapshot-available \
--db-snapshot-identifier "myapp-prod-pre-deploy-20260327-143000"
# Restore from snapshot (creates new instance)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier myapp-production-postgres-restored \
--db-snapshot-identifier "myapp-prod-pre-deploy-20260327-143000" \
--db-subnet-group-name myapp-production-db-subnet-group \
--vpc-security-group-ids sg-xxxxxxxx
# Point-in-time recovery (restore to specific second)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier myapp-production-postgres \
--target-db-instance-identifier myapp-production-postgres-pitr \
--restore-time "2026-03-27T14:30:00Z" \
--db-subnet-group-name myapp-production-db-subnet-group
# View top SQL by load (Performance Insights API)
aws pi get-resource-metrics \
--service-type RDS \
--identifier db-XXXXXXXXXXXXXXXXXXXX \
--metric-queries '[{"Metric":"db.load.avg","GroupBy":{"Group":"db.sql","Limit":10}}]' \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period-in-seconds 300
# View top waits
aws pi get-resource-metrics \
--service-type RDS \
--identifier db-XXXXXXXXXXXXXXXXXXXX \
--metric-queries '[{"Metric":"db.load.avg","GroupBy":{"Group":"db.wait_event","Limit":10}}]' \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period-in-seconds 300
| Resource | Monthly Cost (us-east-1) | Notes |
|---|---|---|
| db.t4g.micro (single-AZ) | ~$12 | Dev/test only |
| db.t4g.small (multi-AZ) | ~$48 | Staging |
| db.r6g.large (multi-AZ) | ~$360 | Production |
| db.r6g.large (read replica) | ~$180 | Single-AZ |
| gp3 storage (100 GB) | ~$11.50 | $0.115/GB |
| Automated backups (100 GB) | Free | Up to DB size |
| Extra backup storage | $0.095/GB | Beyond free tier |
| Perf Insights (2 year retention) | ~$0 | Free for 7 days |
| Enhanced monitoring (60s) | ~$3 | CloudWatch metrics |
| Cross-region snapshot copy | ~$0.02/GB | S3 storage + transfer |
db.t4g burstable instances for dev/staging (60-80% cheaper than r6g)max_allocated_storage to prevent runaway autoscalingrds.force_ssl = 1 in parameter groupmanage_master_user_password = true for AWS-managed rotationpublicly_accessible = true in any environmentpostgresql and upgradedeletion_protection = true for production instances| Mistake | Why It's Bad | Fix |
|---|---|---|
publicly_accessible = true | Database exposed to internet | Always false, use VPN/bastion for access |
| No Multi-AZ for production | Single point of failure, ~2 min downtime | Enable Multi-AZ for production |
| Default parameter group | Cannot tune PostgreSQL settings | Create custom parameter group |
skip_final_snapshot = true in prod | Lose all data on terraform destroy | Always create final snapshot |
| Hardcoded credentials in code | Security breach risk | Use SSM/Secrets Manager |
| No storage encryption | Compliance violation | Enable storage_encrypted = true |
| Backup retention = 1 day | Minimal recovery options | 30 days for production |
| No connection pooling | Exhausts max_connections | Use PgBouncer or Prisma connection limit |
| Oversized instance | Wasting money | Start with t4g, upgrade based on metrics |
| No slow query logging | Cannot identify performance issues | Set log_min_duration_statement = 1000 |
# Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier myapp-production-postgres \
--query "DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,Engine:Engine,Version:EngineVersion,MultiAZ:MultiAZ,Storage:AllocatedStorage,Encrypted:StorageEncrypted}" \
--output table
# Check backup status
aws rds describe-db-snapshots \
--db-instance-identifier myapp-production-postgres \
--query "DBSnapshots[-5:].{Snapshot:DBSnapshotIdentifier,Status:Status,Created:SnapshotCreateTime,Size:AllocatedStorage}" \
--output table
# Check parameter group values
aws rds describe-db-parameters \
--db-parameter-group-name myapp-production-postgres16 \
--query "Parameters[?Source=='user'].{Name:ParameterName,Value:ParameterValue,ApplyMethod:ApplyMethod}" \
--output table
# Check replica lag
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value=myapp-production-postgres-replica \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Average \
--output table
# Test SSL connection
psql "postgresql://myapp_admin@myapp-production-postgres.xxxx.us-east-1.rds.amazonaws.com:5432/myapp?sslmode=verify-full&sslrootcert=global-bundle.pem" -c "SELECT ssl_is_used();"
npx claudepluginhub heaptracetechnology/heaptrace-skills --plugin heaptrace-cloud-engineerProvides AWS CloudFormation templates for Amazon RDS databases including MySQL, PostgreSQL, Aurora instances, DB clusters, multi-AZ deployments, parameter groups, subnet groups, security groups, and cross-stack references.
Expert database administrator covering cloud databases (AWS, Azure, GCP), automation (IaC with Terraform/CloudFormation), HA/DR, performance optimization, and security. Includes schema migrations (Flyway, Liquibase).
Sets up and optimizes Render Managed PostgreSQL: internal/external connection strings, creation constraints, storage autoscaling, limits, HA, read replicas, backups, and connectivity troubleshooting.