Operational runbook templates for incident response and procedures
Creates operational runbooks for incident response and procedures using standardized templates. Triggers when working on incident response, maintenance procedures, or troubleshooting documentation.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install documentation-standards@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Create operational runbooks for incident response, maintenance procedures, and operational tasks.
Before creating runbooks:
docs-management skill for runbook patternsRunbook Categories:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Response Runbooks │
│ • Alert-triggered procedures │
│ • Escalation paths │
│ • Communication templates │
├─────────────────────────────────────────────────────────────────────────────┤
│ Operational Runbooks │
│ • Deployment procedures │
│ • Maintenance tasks │
│ • Backup/restore operations │
├─────────────────────────────────────────────────────────────────────────────┤
│ Troubleshooting Runbooks │
│ • Diagnostic procedures │
│ • Common issue resolution │
│ • Debug workflows │
├─────────────────────────────────────────────────────────────────────────────┤
│ Emergency Runbooks │
│ • Disaster recovery │
│ • Security incident response │
│ • Business continuity │
└─────────────────────────────────────────────────────────────────────────────┘
# Runbook: [TITLE]
| Property | Value |
|----------|-------|
| **ID** | RB-[NUMBER] |
| **Category** | [Incident/Operational/Troubleshooting/Emergency] |
| **Service** | [Service Name] |
| **Owner** | [Team/Individual] |
| **Last Updated** | [YYYY-MM-DD] |
| **Last Tested** | [YYYY-MM-DD] |
| **Review Frequency** | [Quarterly/Monthly/Annually] |
---
## Overview
**Purpose:** [What this runbook helps you accomplish]
**When to Use:** [Conditions that trigger this runbook]
**Expected Outcome:** [What success looks like]
**Estimated Duration:** [Time to complete]
---
## Prerequisites
### Required Access
- [ ] [System/Tool 1] - [Role/Permission needed]
- [ ] [System/Tool 2] - [Role/Permission needed]
### Required Knowledge
- [Skill/Knowledge 1]
- [Skill/Knowledge 2]
### Tools Needed
| Tool | Purpose | Access URL |
|------|---------|------------|
| [Tool 1] | [Purpose] | [URL/Link] |
| [Tool 2] | [Purpose] | [URL/Link] |
---
## Quick Reference
```text
Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace] │
│ View logs: kubectl logs -f [pod-name] -n [namespace] │
│ Restart service: kubectl rollout restart deployment/[name] │
│ Check metrics: [monitoring-url] │
└────────────────────────────────────────────────────────────────┘
Objective: [What this step accomplishes]
Actions:
[Action 1]
# Command example
kubectl get pods -n production
[Action 2]
Expected Result: [What you should see]
If This Fails: Go to Troubleshooting Section
Objective: [What this step accomplishes]
Actions:
Decision Point:
┌─────────────────────────────────────┐
│ Is the service responding? │
│ │
│ YES → Continue to Step 3 │
│ NO → Go to Step 4 (Escalation) │
└─────────────────────────────────────┘
Objective: Verify the issue is resolved
Verification Checklist:
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
| Level | Contact | Method | Response Time |
|---|---|---|---|
| L1 | On-call Engineer | PagerDuty | 15 min |
| L2 | Team Lead | Slack #incidents | 30 min |
| L3 | Engineering Manager | Phone | 1 hour |
| L4 | VP Engineering | Phone | As needed |
Template:
[TIMESTAMP] - [SERVICE] - [STATUS]
Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]
Actions Taken:
- [Action 1]
- [Action 2]
Next Steps:
- [Planned action]
| Stakeholder | When to Notify | Method |
|---|---|---|
| Engineering | Immediately | Slack |
| Product | If user-impacting | Slack |
| Support | If customer-facing | |
| Leadership | If SEV1/SEV2 | Phone |
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | [Date] | [Name] | Initial version |
| 1.1 | [Date] | [Name] | [Changes] |
# Incident Runbook: [Alert Name]
| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |
---
## Alert Details
**Trigger Condition:**
```text
[Alert query/condition]
Example: error_rate > 1% for 5 minutes
Alert Meaning: [What this alert indicates]
False Positive Indicators: [Signs this might be a false alarm]
# Acknowledge in PagerDuty
pd incident:acknowledge
# Or via Slack
/pd ack
Quick Health Checks:
# Check service status
curl -s https://api.example.com/health | jq .
# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR
# Check pod status
kubectl get pods -n production -l app=service
Impact Assessment:
| Check | Command | Expected | Actual |
|---|---|---|---|
| Health endpoint | curl /health | 200 OK | [Result] |
| Error rate | grep ERROR | < 10 | [Result] |
| Pod status | kubectl get pods | Running | [Result] |
Post in #incidents:
🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]
# Check request rate
kubectl top pods -n production -l app=service
# Check HPA status
kubectl get hpa -n production
If traffic spike confirmed:
kubectl scale deployment/service --replicas=10# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check slow queries
kubectl logs -l app=service | grep "slow query"
If database issues:
# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .
# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"
If dependency failure:
| Issue | Quick Fix | Command |
|---|---|---|
| Pod crash loop | Restart deployment | kubectl rollout restart deployment/service |
| Memory pressure | Increase limits | kubectl edit deployment/service |
| Config error | Rollback config | kubectl rollout undo deployment/service |
# List recent deployments
kubectl rollout history deployment/service -n production
# Rollback to previous version
kubectl rollout undo deployment/service -n production
# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2
Verification Checklist:
Monitoring Period: Monitor for 15 minutes after resolution
✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]
# Runbook: Database Failover
| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |
---
## Overview
**Purpose:** Failover from primary database to replica when primary is unavailable.
**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover
**Expected Outcome:** Application traffic routed to new primary
**Estimated Duration:** 15-30 minutes
---
## Prerequisites
### Required Access
- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser
### Pre-Failover Checks
```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary
# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
"SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"
Acceptable lag: < 1MB
# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"
# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"
Expected: Connection timeout or error state
🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes
# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
--resource-group rg-prod \
--name pg-replica
# Verify promotion
az postgres flexible-server show \
--resource-group rg-prod \
--name pg-replica \
--query "replicationRole"
Expected: replicationRole: None (standalone)
# Update Kubernetes secret
kubectl create secret generic db-connection \
--from-literal=host=pg-replica.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production
# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database
# Test application health
curl -s https://api.example.com/health | jq .database
If failover causes issues:
# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production
# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled
# Revert connection strings
kubectl create secret generic db-connection \
--from-literal=host=pg-primary.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production
| Criterion | Description | Check |
|---|---|---|
| Actionable | Every step has a specific action | [ ] |
| Testable | Can be practiced in non-prod | [ ] |
| Current | Reflects current system state | [ ] |
| Complete | Covers happy and error paths | [ ] |
| Accessible | Available during incidents | [ ] |
| Versioned | Changes tracked with dates | [ ] |
When creating runbooks:
For detailed guidance:
Last Updated: 2025-12-26
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.