Runbook Creation Skill

When to Use This Skill

Use this skill when:

Runbook Creation tasks - Working on operational runbook templates for incident response and procedures
Planning or design - Need guidance on Runbook Creation approaches
Best practices - Want to follow established patterns and standards

Overview

Create operational runbooks for incident response, maintenance procedures, and operational tasks.

MANDATORY: Documentation-First Approach

Before creating runbooks:

Invoke docs-management skill for runbook patterns
Verify SRE best practices via MCP servers (perplexity)
Base guidance on Google SRE principles

Runbook Types

Runbook Categories:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Incident Response Runbooks                                                  │
│  • Alert-triggered procedures                                                │
│  • Escalation paths                                                          │
│  • Communication templates                                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Operational Runbooks                                                        │
│  • Deployment procedures                                                     │
│  • Maintenance tasks                                                         │
│  • Backup/restore operations                                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  Troubleshooting Runbooks                                                    │
│  • Diagnostic procedures                                                     │
│  • Common issue resolution                                                   │
│  • Debug workflows                                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│  Emergency Runbooks                                                          │
│  • Disaster recovery                                                         │
│  • Security incident response                                                │
│  • Business continuity                                                       │
└─────────────────────────────────────────────────────────────────────────────┘

Standard Runbook Template

# Runbook: [TITLE]

| Property | Value |
|----------|-------|
| **ID** | RB-[NUMBER] |
| **Category** | [Incident/Operational/Troubleshooting/Emergency] |
| **Service** | [Service Name] |
| **Owner** | [Team/Individual] |
| **Last Updated** | [YYYY-MM-DD] |
| **Last Tested** | [YYYY-MM-DD] |
| **Review Frequency** | [Quarterly/Monthly/Annually] |

---

## Overview

**Purpose:** [What this runbook helps you accomplish]

**When to Use:** [Conditions that trigger this runbook]

**Expected Outcome:** [What success looks like]

**Estimated Duration:** [Time to complete]

---

## Prerequisites

### Required Access

- [ ] [System/Tool 1] - [Role/Permission needed]
- [ ] [System/Tool 2] - [Role/Permission needed]

### Required Knowledge

- [Skill/Knowledge 1]
- [Skill/Knowledge 2]

### Tools Needed

| Tool | Purpose | Access URL |
|------|---------|------------|
| [Tool 1] | [Purpose] | [URL/Link] |
| [Tool 2] | [Purpose] | [URL/Link] |

---

## Quick Reference

```text
Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace]          │
│ View logs: kubectl logs -f [pod-name] -n [namespace]           │
│ Restart service: kubectl rollout restart deployment/[name]     │
│ Check metrics: [monitoring-url]                                │
└────────────────────────────────────────────────────────────────┘

Procedure

Step 1: [Step Name]

Objective: [What this step accomplishes]

Actions:

[Action 1]

# Command example
kubectl get pods -n production

[Action 2]

Expected Result: [What you should see]

If This Fails: Go to Troubleshooting Section

Step 2: [Step Name]

Objective: [What this step accomplishes]

Actions:

[Action 1]
[Action 2]

Decision Point:

┌─────────────────────────────────────┐
│ Is the service responding?          │
│                                     │
│ YES → Continue to Step 3            │
│ NO  → Go to Step 4 (Escalation)     │
└─────────────────────────────────────┘

Step 3: [Verification]

Objective: Verify the issue is resolved

Verification Checklist:

Service is responding to health checks
Metrics show normal values
No new errors in logs
Users can access the service

Troubleshooting

Issue: [Common Issue 1]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

[Step 1]
[Step 2]

Issue: [Common Issue 2]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

[Step 1]
[Step 2]

Escalation

When to Escalate

Issue not resolved after [X] minutes
Impact affects [threshold]
Required access not available
Unsure of next steps

Escalation Path

Level	Contact	Method	Response Time
L1	On-call Engineer	PagerDuty	15 min
L2	Team Lead	Slack #incidents	30 min
L3	Engineering Manager	Phone	1 hour
L4	VP Engineering	Phone	As needed

Communication

Status Updates

Template:

[TIMESTAMP] - [SERVICE] - [STATUS]

Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Planned action]

Stakeholder Notification

Stakeholder	When to Notify	Method
Engineering	Immediately	Slack
Product	If user-impacting	Slack
Support	If customer-facing	Email
Leadership	If SEV1/SEV2	Phone

Post-Incident

Cleanup Tasks

Remove any temporary fixes
Update monitoring/alerts if needed
Document any new learnings

Post-Incident Review

Schedule post-mortem meeting
Gather timeline and evidence
Identify action items

Appendix

Related Runbooks

[RB-XXX: Related Runbook 1]
[RB-YYY: Related Runbook 2]

Reference Documentation

[Link to architecture docs]
[Link to service docs]

Revision History

Version	Date	Author	Changes
1.0	[Date]	[Name]	Initial version
1.1	[Date]	[Name]	[Changes]

Incident Response Runbook Template

# Incident Runbook: [Alert Name]

| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |

---

## Alert Details

**Trigger Condition:**
```text

[Alert query/condition]
Example: error_rate > 1% for 5 minutes

Alert Meaning: [What this alert indicates]

False Positive Indicators: [Signs this might be a false alarm]

Immediate Actions (First 5 Minutes)

1. Acknowledge Alert

# Acknowledge in PagerDuty
pd incident:acknowledge

# Or via Slack
/pd ack

2. Assess Impact

Quick Health Checks:

# Check service status
curl -s https://api.example.com/health | jq .

# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR

# Check pod status
kubectl get pods -n production -l app=service

Impact Assessment:

Check	Command	Expected	Actual
Health endpoint	`curl /health`	200 OK	[Result]
Error rate	`grep ERROR`	< 10	[Result]
Pod status	`kubectl get pods`	Running	[Result]

3. Initial Communication

Post in #incidents:

🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]

Diagnosis

Common Causes and Checks

Cause 1: High Traffic

# Check request rate
kubectl top pods -n production -l app=service

# Check HPA status
kubectl get hpa -n production

If traffic spike confirmed:

Scale replicas: kubectl scale deployment/service --replicas=10
Enable rate limiting if available

Cause 2: Database Issues

# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check slow queries
kubectl logs -l app=service | grep "slow query"

If database issues:

Check connection pool exhaustion
Look for long-running queries
Consider read replica failover

Cause 3: Dependency Failure

# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .

# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"

If dependency failure:

Verify external service status
Check for timeout configuration
Consider enabling fallback behavior

Resolution Steps

Quick Fixes

Issue	Quick Fix	Command
Pod crash loop	Restart deployment	`kubectl rollout restart deployment/service`
Memory pressure	Increase limits	`kubectl edit deployment/service`
Config error	Rollback config	`kubectl rollout undo deployment/service`

Rollback Procedure

# List recent deployments
kubectl rollout history deployment/service -n production

# Rollback to previous version
kubectl rollout undo deployment/service -n production

# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2

Resolution Verification

Verification Checklist:

Monitoring Period: Monitor for 15 minutes after resolution

Closure

Update Status

✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]

Post-Incident Tasks

Update incident timeline
Create post-mortem doc if SEV1/SEV2
File tickets for follow-up work
Update runbook if needed

Database Failover Runbook

# Runbook: Database Failover

| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |

---

## Overview

**Purpose:** Failover from primary database to replica when primary is unavailable.

**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover

**Expected Outcome:** Application traffic routed to new primary

**Estimated Duration:** 15-30 minutes

---

## Prerequisites

### Required Access

- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser

### Pre-Failover Checks

```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary

# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
  "SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"

Acceptable lag: < 1MB

Failover Procedure

Step 1: Confirm Primary is Unavailable

# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"

# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"

Expected: Connection timeout or error state

Step 2: Notify Stakeholders

🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes

Step 3: Promote Replica

# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
  --resource-group rg-prod \
  --name pg-replica

# Verify promotion
az postgres flexible-server show \
  --resource-group rg-prod \
  --name pg-replica \
  --query "replicationRole"

Expected: replicationRole: None (standalone)

Step 4: Update Connection Strings

# Update Kubernetes secret
kubectl create secret generic db-connection \
  --from-literal=host=pg-replica.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production

Step 5: Verify Application Connectivity

# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database

# Test application health
curl -s https://api.example.com/health | jq .database

Post-Failover

Immediate Tasks

Verify all applications connected to new primary
Check for data consistency
Monitor error rates

Recovery Tasks (Next 24 Hours)

Investigate original primary failure
Create new replica from new primary
Update DNS/connection strings permanently
Document incident and learnings

Rollback

If failover causes issues:

# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production

# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled

# Revert connection strings
kubectl create secret generic db-connection \
  --from-literal=host=pg-primary.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production

Runbook Quality Checklist

Criterion	Description	Check
Actionable	Every step has a specific action	[ ]
Testable	Can be practiced in non-prod	[ ]
Current	Reflects current system state	[ ]
Complete	Covers happy and error paths	[ ]
Accessible	Available during incidents	[ ]
Versioned	Changes tracked with dates	[ ]

Workflow

When creating runbooks:

Identify Need: What operation/incident needs documentation?
Gather Information: Interview operators, review past incidents
Draft Runbook: Use appropriate template
Validate Steps: Walk through with subject matter expert
Test in Non-Prod: Execute runbook in staging
Publish: Add to runbook collection
Train Team: Ensure operators know where to find it
Maintain: Review and update regularly

References

For detailed guidance:

Last Updated: 2025-12-26

runbook-creation