Runbook Authoring Skill

When to Use This Skill

Use this skill when:

Runbook Authoring tasks - Working on author operational runbooks for incident response and troubleshooting
Planning or design - Need guidance on Runbook Authoring approaches
Best practices - Want to follow established patterns and standards

Overview

Create operational runbooks for incident response and troubleshooting.

MANDATORY: Documentation-First Approach

Before authoring runbooks:

Invoke docs-management skill for runbook patterns
Verify SRE practices via MCP servers (perplexity)
Base guidance on operational best practices

Runbook Purpose

RUNBOOK GOALS:

┌─────────────────────────────────────────────────────────────────┐
│                    WHY RUNBOOKS?                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  REDUCE MTTR (Mean Time To Recovery)                             │
│  ├── Pre-documented steps save diagnosis time                    │
│  ├── No need to figure it out during incident                    │
│  └── Consistent approach every time                              │
│                                                                  │
│  ENABLE ANYONE TO RESPOND                                        │
│  ├── On-call doesn't need to be expert                           │
│  ├── Knowledge transfer from senior to junior                    │
│  ├── Reduces bus factor                                          │
│  └── New team members can respond effectively                    │
│                                                                  │
│  DOCUMENT TRIBAL KNOWLEDGE                                       │
│  ├── Capture what experts know                                   │
│  ├── Make implicit knowledge explicit                            │
│  └── Preserve knowledge when people leave                        │
│                                                                  │
│  IMPROVE OVER TIME                                               │
│  ├── Each incident improves the runbook                          │
│  ├── Capture new failure modes                                   │
│  └── Evolve with the system                                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Runbook Types

RUNBOOK CATEGORIES:

ALERT RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per alert                                            │
│                                                                  │
│ Purpose: What to do when this specific alert fires               │
│ Linked from: Alert annotations                                   │
│ Contains: Triage, diagnosis, remediation for this alert          │
└─────────────────────────────────────────────────────────────────┘

SYMPTOM RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per symptom/problem                                  │
│                                                                  │
│ Purpose: How to diagnose/fix a type of problem                   │
│ Examples: "High Latency", "Out of Memory", "Connection Errors"   │
│ Contains: Decision tree for multiple potential causes            │
└─────────────────────────────────────────────────────────────────┘

PROCEDURE RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per operational procedure                            │
│                                                                  │
│ Purpose: How to perform maintenance tasks                        │
│ Examples: "Database Failover", "Certificate Rotation"            │
│ Contains: Step-by-step procedures                                │
└─────────────────────────────────────────────────────────────────┘

SERVICE RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per service                                          │
│                                                                  │
│ Purpose: Service overview and common operations                  │
│ Contains: Architecture, dependencies, common issues              │
└─────────────────────────────────────────────────────────────────┘

Runbook Structure

RUNBOOK ANATOMY:

┌─────────────────────────────────────────────────────────────────┐
│ HEADER                                                           │
│ ├── Title, service, last updated                                 │
│ ├── Alert link (if alert runbook)                                │
│ └── Quick summary                                                │
├─────────────────────────────────────────────────────────────────┤
│ OVERVIEW                                                         │
│ ├── What this runbook covers                                     │
│ ├── When to use it                                               │
│ └── Expected outcome                                             │
├─────────────────────────────────────────────────────────────────┤
│ QUICK ACTIONS (Optimize for speed)                               │
│ ├── 2-3 most common fixes                                        │
│ ├── Copy-paste commands                                          │
│ └── "Try these first" section                                    │
├─────────────────────────────────────────────────────────────────┤
│ DIAGNOSIS                                                        │
│ ├── How to verify the problem                                    │
│ ├── Key metrics/logs to check                                    │
│ ├── Decision tree for root cause                                 │
│ └── Common causes and indicators                                 │
├─────────────────────────────────────────────────────────────────┤
│ REMEDIATION                                                      │
│ ├── Step-by-step fix for each cause                              │
│ ├── Rollback procedures                                          │
│ └── Verification steps                                           │
├─────────────────────────────────────────────────────────────────┤
│ ESCALATION                                                       │
│ ├── When to escalate                                             │
│ ├── Who to contact                                               │
│ └── What information to provide                                  │
├─────────────────────────────────────────────────────────────────┤
│ REFERENCES                                                       │
│ ├── Related dashboards                                           │
│ ├── Architecture docs                                            │
│ └── Related runbooks                                             │
└─────────────────────────────────────────────────────────────────┘

Alert Runbook Template

# Runbook: {Alert Name}

**Service:** {Service Name}
**Alert:** {Alert Name}
**Severity:** {P1/P2/P3/P4}
**Owner:** {Team/Person}

---

## Overview

**What this alert means:**
{One sentence explanation of what triggered this alert}

**User impact:**
{How users are affected when this fires}

**Expected resolution time:**
{Typical time to resolve}

---

## Quick Actions

> **Try these first before deep diagnosis**

### 1. Check if it's a known issue

```bash
# Check recent incidents
open https://status.example.com

# Check deployment history
kubectl rollout history deployment/orders-api -n production

2. Quick restart (if safe)

# Rolling restart (no downtime)
kubectl rollout restart deployment/orders-api -n production

# Wait for rollout
kubectl rollout status deployment/orders-api -n production

3. Rollback recent deployment

# Rollback to previous version
kubectl rollout undo deployment/orders-api -n production

Diagnosis

Step 1: Verify the alert is real

# Check current metric value
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])' | jq

Dashboard: Grafana - Orders API Overview

Step 2: Identify the scope

Check	Command/Action
All pods affected?	`kubectl get pods -n production -l app=orders-api`
All endpoints?	Check per-endpoint error rate in Grafana
Started when?	Check Grafana annotations for deployments

Step 3: Determine root cause

DECISION TREE:

Is it all pods or specific pods?
├── All pods → Likely code/config issue
│   ├── Recent deployment? → Rollback
│   └── No deployment? → Check dependencies
│
└── Specific pods → Likely infrastructure issue
    ├── Same node? → Node issue
    └── Random? → Check pod logs

Common Causes

Cause	Indicators	Solution
Bad deployment	Errors started after deploy	Rollback
Database issues	Connection errors in logs	Check DB
Memory pressure	OOMKilled pods	Scale up or fix leak
Dependency down	Timeout errors	Check dependency status

Remediation

Cause 1: Bad Deployment

Indicators:

Errors started immediately after deployment
New error types appearing

Fix:

# 1. Rollback to previous version
kubectl rollout undo deployment/orders-api -n production

# 2. Verify rollback
kubectl rollout status deployment/orders-api -n production

# 3. Check error rate dropping
watch -n5 'curl -s "http://prometheus:9090/..." | jq'

Verification:

Error rate returning to normal
No new 5xx errors in logs
Alert auto-resolves

Cause 2: Database Connection Issues

Indicators:

"Connection refused" or "Connection timeout" in logs
Database metrics showing high connections

Fix:

# 1. Check database status
psql -h db.example.com -U admin -c "SELECT count(*) FROM pg_stat_activity"

# 2. If connection pool exhausted, restart pods
kubectl rollout restart deployment/orders-api -n production

# 3. If database is down, escalate to DBA

Cause 3: Memory Pressure

Indicators:

OOMKilled in pod events
Memory usage climbing before crash

Fix:

# 1. Check for OOMKilled
kubectl get events -n production --field-selector reason=OOMKilled

# 2. Increase memory limit temporarily
kubectl set resources deployment/orders-api -n production \
  --limits=memory=2Gi

# 3. File ticket for memory leak investigation

Escalation

When to escalate

Quick actions didn't resolve
Root cause unclear after 15 minutes
Data loss or corruption suspected
Multiple services affected

Who to contact

Situation	Contact	Method
Database issues	DBA on-call	PagerDuty: dba-oncall
Network issues	Platform team	Slack: #platform-oncall
Security concern	Security team	PagerDuty: security
Unknown	Engineering Manager	Phone: XXX-XXX-XXXX

Information to provide

When escalating, include:

Alert name and time started
Actions already taken
Current hypothesis
Link to incident channel

References

Dashboard: Grafana - Orders API
Logs: Kibana - Orders API Errors
Architecture: Confluence - Orders API Design
Related Runbooks:
- Database Connection Issues
- High Latency Troubleshooting

Revision History

Date	Change	Author
2024-01-15	Added memory pressure section	@engineer
2024-01-01	Initial version	@oncall

Procedure Runbook Template

# Procedure: {Procedure Name}

**Purpose:** {What this procedure accomplishes}
**When to use:** {Circumstances requiring this procedure}
**Duration:** {Expected time to complete}
**Risk Level:** {Low/Medium/High}

---

## Prerequisites

- [ ] Access to production Kubernetes cluster
- [ ] Database admin credentials
- [ ] Approval from {approver} if during business hours

## Pre-Flight Checks

```bash
# Verify you have correct access
kubectl auth can-i '*' '*' -n production

# Check current system state
kubectl get pods -n production

# Confirm no ongoing incidents
open https://status.example.com

Procedure Steps

Step 1: {First Step Title}

Purpose: {Why this step}

# Commands to execute
{command}

Expected output:

{expected output}

Verification:

{Check to confirm step succeeded}

Step 2: {Second Step Title}

Purpose: {Why this step}

{command}

Expected output:

{expected output}

Verification:

{Check to confirm step succeeded}

Step 3: {Third Step Title}

...

Rollback Procedure

If something goes wrong, follow these steps to restore previous state:

Step 1: {Rollback Step}

{rollback command}

Step 2: {Verify Rollback}

{verification command}

Post-Procedure Verification

Service is healthy (check dashboard)
No new errors in logs
Dependent services unaffected
Monitoring shows expected state

Cleanup

# Remove temporary resources
{cleanup commands}

Troubleshooting

Issue: {Common Issue 1}

Symptom: {What you see} Cause: {Why it happens} Solution: {How to fix}

Issue: {Common Issue 2}

...

Template References

{Related documentation}
{Architecture diagrams}

Runbook Quality Checklist

# Runbook Quality Checklist

## Structure
- [ ] Clear title and service identification
- [ ] Last updated date is recent
- [ ] Owner/team identified
- [ ] Severity/risk level stated

## Quick Actions
- [ ] 2-3 most common fixes at the top
- [ ] Commands are copy-paste ready
- [ ] No unnecessary explanation before actions

## Diagnosis
- [ ] Decision tree for root cause
- [ ] Key metrics/logs to check
- [ ] Links to dashboards
- [ ] Common causes listed

## Remediation
- [ ] Step-by-step for each cause
- [ ] Verification after each fix
- [ ] Rollback procedures included
- [ ] Commands are tested and work

## Escalation
- [ ] Clear escalation criteria
- [ ] Contact information current
- [ ] Information to provide when escalating

## Usability
- [ ] Can be followed at 3am under stress
- [ ] No jargon without explanation
- [ ] Links are not broken
- [ ] Screenshots where helpful

Workflow

When authoring runbooks:

Start from Incidents: Create runbooks after incidents
Optimize for Speed: Quick actions first
Include Commands: Copy-paste ready
Add Decision Trees: Help diagnose root cause
Define Escalation: When and who
Test Regularly: Verify runbooks work
Update After Incidents: Improve based on learnings

runbook-authoring

Runbook Authoring Skill

When to Use This Skill

Overview

MANDATORY: Documentation-First Approach

Runbook Purpose

Runbook Types

Runbook Structure

Alert Runbook Template

2. Quick restart (if safe)

3. Rollback recent deployment

Diagnosis

Step 1: Verify the alert is real

Step 2: Identify the scope

Step 3: Determine root cause

Common Causes

Remediation

Cause 1: Bad Deployment

Cause 2: Database Connection Issues

Cause 3: Memory Pressure

Escalation

When to escalate

Who to contact

Information to provide

References

Revision History

Procedure Runbook Template

Procedure Steps

Step 1: {First Step Title}

Step 2: {Second Step Title}

Step 3: {Third Step Title}

Rollback Procedure

Step 1: {Rollback Step}

Step 2: {Verify Rollback}

Post-Procedure Verification

Cleanup

Troubleshooting

Issue: {Common Issue 1}

Issue: {Common Issue 2}

Template References

Runbook Quality Checklist

Workflow

Further Reading

Similar Skills