Creates standardized runbook templates for incident response, operational maintenance, troubleshooting, and emergency procedures using SRE best practices.
npx claudepluginhub melodic-software/claude-code-plugins --plugin documentation-standardsThis skill is limited to using the following tools:
Use this skill when:
Creates structured incident runbook templates with severity levels, triage, mitigation, resolution, escalation paths, and communication for outages and incidents.
Provides production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, communication, and escalation. Useful for SREs creating on-call procedures.
Creates structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates. Useful for documenting recurring alerts, standardizing on-call responses, and reducing MTTR.
Share bugs, ideas, or general feedback.
Use this skill when:
Create operational runbooks for incident response, maintenance procedures, and operational tasks.
Before creating runbooks:
docs-management skill for runbook patternsRunbook Categories:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Response Runbooks │
│ • Alert-triggered procedures │
│ • Escalation paths │
│ • Communication templates │
├─────────────────────────────────────────────────────────────────────────────┤
│ Operational Runbooks │
│ • Deployment procedures │
│ • Maintenance tasks │
│ • Backup/restore operations │
├─────────────────────────────────────────────────────────────────────────────┤
│ Troubleshooting Runbooks │
│ • Diagnostic procedures │
│ • Common issue resolution │
│ • Debug workflows │
├─────────────────────────────────────────────────────────────────────────────┤
│ Emergency Runbooks │
│ • Disaster recovery │
│ • Security incident response │
│ • Business continuity │
└─────────────────────────────────────────────────────────────────────────────┘
# Runbook: [TITLE]
| Property | Value |
|----------|-------|
| **ID** | RB-[NUMBER] |
| **Category** | [Incident/Operational/Troubleshooting/Emergency] |
| **Service** | [Service Name] |
| **Owner** | [Team/Individual] |
| **Last Updated** | [YYYY-MM-DD] |
| **Last Tested** | [YYYY-MM-DD] |
| **Review Frequency** | [Quarterly/Monthly/Annually] |
---
## Overview
**Purpose:** [What this runbook helps you accomplish]
**When to Use:** [Conditions that trigger this runbook]
**Expected Outcome:** [What success looks like]
**Estimated Duration:** [Time to complete]
---
## Prerequisites
### Required Access
- [ ] [System/Tool 1] - [Role/Permission needed]
- [ ] [System/Tool 2] - [Role/Permission needed]
### Required Knowledge
- [Skill/Knowledge 1]
- [Skill/Knowledge 2]
### Tools Needed
| Tool | Purpose | Access URL |
|------|---------|------------|
| [Tool 1] | [Purpose] | [URL/Link] |
| [Tool 2] | [Purpose] | [URL/Link] |
---
## Quick Reference
```text
Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace] │
│ View logs: kubectl logs -f [pod-name] -n [namespace] │
│ Restart service: kubectl rollout restart deployment/[name] │
│ Check metrics: [monitoring-url] │
└────────────────────────────────────────────────────────────────┘
Objective: [What this step accomplishes]
Actions:
[Action 1]
# Command example
kubectl get pods -n production
[Action 2]
Expected Result: [What you should see]
If This Fails: Go to Troubleshooting Section
Objective: [What this step accomplishes]
Actions:
Decision Point:
┌─────────────────────────────────────┐
│ Is the service responding? │
│ │
│ YES → Continue to Step 3 │
│ NO → Go to Step 4 (Escalation) │
└─────────────────────────────────────┘
Objective: Verify the issue is resolved
Verification Checklist:
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
| Level | Contact | Method | Response Time |
|---|---|---|---|
| L1 | On-call Engineer | PagerDuty | 15 min |
| L2 | Team Lead | Slack #incidents | 30 min |
| L3 | Engineering Manager | Phone | 1 hour |
| L4 | VP Engineering | Phone | As needed |
Template:
[TIMESTAMP] - [SERVICE] - [STATUS]
Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]
Actions Taken:
- [Action 1]
- [Action 2]
Next Steps:
- [Planned action]
| Stakeholder | When to Notify | Method |
|---|---|---|
| Engineering | Immediately | Slack |
| Product | If user-impacting | Slack |
| Support | If customer-facing | |
| Leadership | If SEV1/SEV2 | Phone |
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | [Date] | [Name] | Initial version |
| 1.1 | [Date] | [Name] | [Changes] |
# Incident Runbook: [Alert Name]
| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |
---
## Alert Details
**Trigger Condition:**
```text
[Alert query/condition]
Example: error_rate > 1% for 5 minutes
Alert Meaning: [What this alert indicates]
False Positive Indicators: [Signs this might be a false alarm]
# Acknowledge in PagerDuty
pd incident:acknowledge
# Or via Slack
/pd ack
Quick Health Checks:
# Check service status
curl -s https://api.example.com/health | jq .
# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR
# Check pod status
kubectl get pods -n production -l app=service
Impact Assessment:
| Check | Command | Expected | Actual |
|---|---|---|---|
| Health endpoint | curl /health | 200 OK | [Result] |
| Error rate | grep ERROR | < 10 | [Result] |
| Pod status | kubectl get pods | Running | [Result] |
Post in #incidents:
🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]
# Check request rate
kubectl top pods -n production -l app=service
# Check HPA status
kubectl get hpa -n production
If traffic spike confirmed:
kubectl scale deployment/service --replicas=10# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check slow queries
kubectl logs -l app=service | grep "slow query"
If database issues:
# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .
# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"
If dependency failure:
| Issue | Quick Fix | Command |
|---|---|---|
| Pod crash loop | Restart deployment | kubectl rollout restart deployment/service |
| Memory pressure | Increase limits | kubectl edit deployment/service |
| Config error | Rollback config | kubectl rollout undo deployment/service |
# List recent deployments
kubectl rollout history deployment/service -n production
# Rollback to previous version
kubectl rollout undo deployment/service -n production
# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2
Verification Checklist:
Monitoring Period: Monitor for 15 minutes after resolution
✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]
# Runbook: Database Failover
| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |
---
## Overview
**Purpose:** Failover from primary database to replica when primary is unavailable.
**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover
**Expected Outcome:** Application traffic routed to new primary
**Estimated Duration:** 15-30 minutes
---
## Prerequisites
### Required Access
- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser
### Pre-Failover Checks
```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary
# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
"SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"
Acceptable lag: < 1MB
# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"
# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"
Expected: Connection timeout or error state
🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes
# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
--resource-group rg-prod \
--name pg-replica
# Verify promotion
az postgres flexible-server show \
--resource-group rg-prod \
--name pg-replica \
--query "replicationRole"
Expected: replicationRole: None (standalone)
# Update Kubernetes secret
kubectl create secret generic db-connection \
--from-literal=host=pg-replica.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production
# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database
# Test application health
curl -s https://api.example.com/health | jq .database
If failover causes issues:
# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production
# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled
# Revert connection strings
kubectl create secret generic db-connection \
--from-literal=host=pg-primary.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production
| Criterion | Description | Check |
|---|---|---|
| Actionable | Every step has a specific action | [ ] |
| Testable | Can be practiced in non-prod | [ ] |
| Current | Reflects current system state | [ ] |
| Complete | Covers happy and error paths | [ ] |
| Accessible | Available during incidents | [ ] |
| Versioned | Changes tracked with dates | [ ] |
When creating runbooks:
For detailed guidance:
Last Updated: 2025-12-26