From fairdb-ops-manager
Guides P0 critical PostgreSQL database outage response with verification commands, diagnostics, recovery procedures, alerts, and documentation templates.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin fairdb-ops-managersonnet# SOP-201: P0 - Database Down (CRITICAL) ๐จ **EMERGENCY INCIDENT RESPONSE** You are responding to a **P0 CRITICAL incident**: PostgreSQL database is down. ## Severity: P0 - CRITICAL - **Impact:** ALL customers affected - **Response Time:** IMMEDIATE - **Resolution Target:** <15 minutes ## Your Mission Guide rapid diagnosis and recovery with: - Systematic troubleshooting steps - Clear commands for each check - Fast recovery procedures - Customer communication templates - Post-incident documentation ## IMMEDIATE ACTIONS (First 60 seconds) ### 1. Verify the Issue ### 2. Alert Stakehol...
/fairdb-emergency-responseRuns FairDB PostgreSQL emergency incident response: classifies severity, executes bash script for service checks, restarts, disk cleanup, corruption checks, and incident logging.
/incident-responseOrchestrates phased multi-agent response to production incidents: assesses severity, troubleshoots, debugs root cause, implements fixes, deploys, stabilizes, and prevents recurrence.
/health-checkImplements database health monitoring for PostgreSQL and MySQL with real-time metrics, predictive alerts, automated remediation, and Grafana dashboards using Prometheus exporters.
/incidentOrchestrates incident response for specified <incident> using SRE best practices, supporting optional [phase] like triage or postmortem.
/recoveryImplements disaster recovery and point-in-time recovery (PITR) for production databases via WAL archiving, automated backups, failover, testing, and runbooks.
/pgai:issuesManages PostgresAI issues from console.postgres.ai: lists open issues in table (ID, Title, Status, Created), views details with action plan, posts comments via arguments.
Share bugs, ideas, or general feedback.
๐จ EMERGENCY INCIDENT RESPONSE
You are responding to a P0 CRITICAL incident: PostgreSQL database is down.
Guide rapid diagnosis and recovery with:
# Is PostgreSQL running?
sudo systemctl status postgresql
# Can we connect?
sudo -u postgres psql -c "SELECT 1;"
# Check recent logs
sudo tail -100 /var/log/postgresql/postgresql-16-main.log
Post to incident channel IMMEDIATELY:
๐จ P0 INCIDENT - Database Down
Time: [TIMESTAMP]
Server: VPS-XXX
Impact: All customers unable to connect
Status: Investigating
ETA: TBD
sudo systemctl status postgresql
sudo systemctl status pgbouncer # If installed
Possible states:
inactive (dead) โ Service stoppedfailed โ Service crashedactive (running) โ Service running but not responding# Check for PostgreSQL processes
ps aux | grep postgres
# Check listening ports
sudo ss -tlnp | grep 5432
sudo ss -tlnp | grep 6432 # pgBouncer
df -h /var/lib/postgresql
โ ๏ธ If disk is full (100%):
# Check for errors in PostgreSQL log
sudo grep -i "error\|fatal\|panic" /var/log/postgresql/postgresql-16-main.log | tail -50
# Check system logs
sudo journalctl -u postgresql -n 100 --no-pager
# Check for OOM (Out of Memory) kills
sudo grep -i "killed process" /var/log/syslog | grep postgres
# Test PostgreSQL config
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --check -D /var/lib/postgresql/16/main
# Check for lock files
ls -la /var/run/postgresql/
ls -la /var/lib/postgresql/16/main/postmaster.pid
If service is stopped but no obvious errors:
# Start PostgreSQL
sudo systemctl start postgresql
# Check status
sudo systemctl status postgresql
# Test connection
sudo -u postgres psql -c "SELECT version();"
# Monitor logs
sudo tail -f /var/log/postgresql/postgresql-16-main.log
โ If successful: Jump to "Post-Recovery" section
If error mentions "postmaster.pid already exists":
# Stop PostgreSQL (if running)
sudo systemctl stop postgresql
# Remove stale PID file
sudo rm /var/lib/postgresql/16/main/postmaster.pid
# Start PostgreSQL
sudo systemctl start postgresql
# Verify
sudo systemctl status postgresql
sudo -u postgres psql -c "SELECT 1;"
If disk is 100% full:
# Find largest files
sudo du -sh /var/lib/postgresql/16/main/* | sort -rh | head -10
# Option A: Clear old logs
sudo find /var/log/postgresql/ -name "*.log" -mtime +7 -delete
# Option B: Vacuum to reclaim space
sudo -u postgres vacuumdb --all --full
# Option C: Archive/delete old WAL files (DANGER!)
# Only if you have confirmed backups!
sudo -u postgres pg_archivecleanup /var/lib/postgresql/16/main/pg_wal 000000010000000000000010
# Check space
df -h /var/lib/postgresql
# Start PostgreSQL
sudo systemctl start postgresql
If config test fails:
# Restore backup config
sudo cp /etc/postgresql/16/main/postgresql.conf.backup /etc/postgresql/16/main/postgresql.conf
sudo cp /etc/postgresql/16/main/pg_hba.conf.backup /etc/postgresql/16/main/pg_hba.conf
# Start PostgreSQL
sudo systemctl start postgresql
If logs show corruption errors:
# Stop PostgreSQL
sudo systemctl stop postgresql
# Run filesystem check (if safe to do so)
# sudo fsck /dev/sdX # Only if unmounted!
# Try single-user mode recovery
sudo -u postgres /usr/lib/postgresql/16/bin/postgres --single -D /var/lib/postgresql/16/main
# If that fails, restore from backup (SOP-204)
โ ๏ธ At this point, escalate to backup restoration procedure!
# Test connections
sudo -u postgres psql -c "SELECT version();"
# Check all databases
sudo -u postgres psql -c "\l"
# Test customer database access (example)
sudo -u postgres psql -d customer_db_001 -c "SELECT 1;"
# Check active connections
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
# Run health check
/opt/fairdb/scripts/pg-health-check.sh
โ
RESOLVED - Database Restored
Resolution Time: [X minutes]
Root Cause: [Brief description]
Recovery Method: [Which recovery procedure used]
Customer Impact: [Duration of outage]
Follow-up: [Post-mortem scheduled]
Template:
Subject: [RESOLVED] Database Service Interruption
Dear FairDB Customer,
We experienced a brief service interruption affecting database
connectivity from [START_TIME] to [END_TIME] ([DURATION]).
The issue has been fully resolved and all services are operational.
Root Cause: [Brief explanation]
Resolution: [What we did]
Prevention: [Steps to prevent recurrence]
We apologize for any inconvenience. If you continue to experience
issues, please contact support@fairdb.io.
- FairDB Operations Team
Create incident report at /opt/fairdb/incidents/YYYY-MM-DD-database-down.md:
# Incident Report: Database Down
**Incident ID:** INC-YYYYMMDD-001
**Severity:** P0 - Critical
**Date:** YYYY-MM-DD
**Duration:** X minutes
## Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Resolution implemented
- HH:MM - Service restored
- HH:MM - Verified functionality
## Root Cause
[Detailed explanation]
## Impact
- Customers affected: X
- Downtime: X minutes
- Data loss: None / [describe if any]
## Resolution
[Detailed steps taken]
## Prevention
[Action items to prevent recurrence]
## Follow-up Tasks
- [ ] Review monitoring alerts
- [ ] Update runbooks
- [ ] Implement preventive measures
- [ ] Schedule post-mortem meeting
Escalate if:
Escalation contacts: [Document your escalation chain]
Begin by asking:
Then immediately execute Diagnostic Protocol starting with Check 1.
Remember: Speed is critical. Every minute counts. Stay calm, work systematically.