Skill

writing-runbooks-recipe

Creates operational runbooks for incident response, investigation, and resolution. Use when writing runbooks, documenting incident procedures, or creating operational guides for monitoring alerts. Based on Google SRE Book and SRE Workbook best practices.

npx claudepluginhub andercore-labs/claudes-kitchen --plugin operational-excellence

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**SCOPE:** Incident response documentation and operational procedures.

Supporting Assets

nodejs-runbooks.md

SKILL.md

Similar Skills

github-deep-research

63.9k

Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.

2 files

bytedance-deer-flow-1

surprise-me

63.9k

Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.

bytedance-deer-flow-1

image-generation

63.9k

Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.

2 files

bytedance-deer-flow-1

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 20, 2026

Actions

View Source View Plugin View on GitHub View README

Operational Runbooks

SCOPE: Incident response documentation and operational procedures.

PHILOSOPHY: "Thinking through and recording best practices ahead of time produces roughly a 3x improvement in MTTR vs 'winging it.'" — Google SRE Workbook

Quick Reference

docs/runbooks/{alert-name}.md
├── Alert Details (trigger, severity, impact)
├── Triage & Verification (is this real?)
├── Impact Assessment (user/business consequences)
├── Mitigation (stop the bleeding)
├── Investigation (find root cause)
├── Resolution (fix permanently)
├── Validation (confirm health restored)
└── Escalation (when to page someone)

When to Use

Alert without runbook | Incident procedure unclear | New service deployment | Monitor creation | Operational documentation | Knowledge transfer | On-call training

SRE Best Practices

Practice	Rationale
1 runbook per alert	Reduces MTTR, stress, human error
Target sleep-deprived engineer	Assume reader is tired, stressed, new to system
Actionable steps	Clear commands to run, not theory
Update after incidents	Fresh information from responders
Link to dashboards	Direct access to relevant monitoring
Include warnings	Prevent escalation from well-intentioned actions
Avoid how-to guides	Runbooks for incidents, not general ops

Five Essential Qualities (SRE Workbook)

Runbooks must be:

Quality	Definition	Test
Actionable	Clear steps to reduce MTTR	Can new hire follow without help?
Accessible	Easily discoverable when needed	Linked from alert? Searchable?
Accurate	Current and reliable information	Tested in last 90 days?
Authoritative	Single source of truth	No conflicting docs?
Adaptable	Straightforward to update	Updated after last incident?

When to Automate Instead

⚠️ Automation Signal: If runbook is deterministic list of commands run every time, automate it instead.

Automate:

Rollback procedures (predictable steps)
Scaling decisions (based on metrics)
Service restarts (no human judgment needed)
Log collection (repetitive data gathering)

Keep as Runbook:

Complex judgment calls (escalation decisions)
Context-dependent actions (depends on root cause)
Rare edge cases (not worth automation cost)
Learning exercises (teaches system behavior)

Runbook Structure

Required Sections (SRE Workbook)

Section	Purpose	Content	SRE Principle
Alert Details	Trigger definition	Name, severity, threshold, query, dashboard links	Single source of truth
Triage	Verify reality	Is alert accurate? False positive checks	Reduce noise
Impact	Business consequence	User/revenue/SLA impact, blast radius	Justify urgency
Mitigation	Stop the bleeding	Quick actions to stabilize system	Reduce MTTR
Investigation	Find root cause	Diagnostic commands, correlation, logs	Enable learning
Resolution	Fix permanently	Steps to resolve, rollback procedures	Prevent recurrence
Validation	Confirm health	Metrics showing system recovered	Avoid premature closure
Escalation	When to escalate	Conditions, contacts, SLA	Clear ownership

Template (SRE Workbook Format)

# {Alert Name} Runbook

## Alert Details

**Monitor:** {Datadog monitor name/ID} | [Dashboard](link) | [SLO](link)
**Urgency:** P1 (Critical) | P2 (High) | P3 (Medium) | P4 (Low)
**Threshold:** {trigger condition}
**Query:** `{Datadog query}`
**Last Updated:** {date} by {responder}

## Triage & Verification

**Goal:** Confirm this is a real incident, not a false positive.

### Quick Checks
```bash
# Verify alert is still active
dog metric query "avg:http.server.error_rate{service:my-service,env:prod}"

# Check if this is a known issue
# Navigate to: Incident Slack channel → Search for service name

False Positive Indicators

Deployment in progress (expected errors during rollout)
Scheduled maintenance window active
Load test running (known traffic spike)
Alert fired <2 minutes (wait for confirmation)

If false positive: Acknowledge alert, document in #incidents, return to monitoring.

Impact Assessment

User Impact: {How users are affected} Business Impact: {Revenue, SLA, compliance consequences} Blast Radius: {Affected services, customers, regions}

Example:

- Users unable to complete checkout
- Est. $X,XXX/min revenue loss
- SLO burn: 10% of monthly budget in 1 hour
- Affects: Production EU region, all tenants

Incident Severity Justification: Why this triggers SEV-1/SEV-2/SEV-3/SEV-4 incident.

Mitigation (Stop the Bleeding)

Goal: Stabilize system immediately, reduce customer impact.

⚠️ WARNING: These are temporary fixes. Full resolution required afterward.

Quick Mitigation Actions

Option A: Rollback (if recent deployment)

# Rollback to previous version
kubectl rollout undo deployment/my-service -n prod

# Monitor rollback progress
kubectl rollout status deployment/my-service -n prod

Option B: Scale horizontally (if capacity issue)

# Scale up replicas
kubectl scale deployment/my-service --replicas=8 -n prod

# Monitor pod health
watch kubectl get pods -n prod -l app=my-service

Option C: Traffic shedding (if overload)

# Enable rate limiting (if available)
curl -X POST http://my-service/admin/ratelimit --data '{"enabled":true,"limit":1000}'

# Or route traffic away temporarily
# Update load balancer / service mesh configuration

Mitigation Time Target: <15 minutes for P1 alerts, <30 minutes for P2 alerts

Investigation (Find Root Cause)

Goal: Identify why the incident occurred while system is stable.

1. Timeline Correlation

# Recent deployments
kubectl rollout history deployment/my-service -n prod

# Recent config changes
git log --since="1 hour ago" --all -- config/

# Check Datadog events for deployments, scaling, alerts
# Navigate to: Datadog Events → Filter by service

2. Dependency Health

# Database connectivity
dog service_check check db.connection db-primary 0
dog metric query "max:db.pool.active{service:my-service}"

# Kafka consumer lag
dog metric query "max:kafka.consumer_lag{service:my-service,consumer_group:*}"

# Redis availability
dog metric query "avg:redis.connections.active{service:my-service}"

3. Log Analysis

# Recent error logs
dog search query "service:my-service status:error" --from "1h"

# Look for patterns: timeouts, connection errors, OOM kills
kubectl logs -n prod deployment/my-service --tail=100 | grep -i error

4. Trace Analysis

# Navigate to: Datadog APM → Filter service + error:true
# Look for: High latency spans, error rates by endpoint, downstream failures

Resolution (Permanent Fix)

Goal: Implement lasting solution, prevent recurrence.

Root Cause → Resolution Map

If: Database connection pool exhausted

Increase pool size in config: DB_POOL_SIZE=50
Optimize slow queries identified in investigation
Add connection pool monitoring

If: Memory leak causing restarts

Keep rollback in place (mitigation from above)
Profile heap usage in staging: node --inspect app.js
Fix leak in code, add heap monitoring, redeploy

If: Downstream dependency timeout

Increase timeout: HTTP_TIMEOUT=5000
Add circuit breaker pattern
Implement retries with exponential backoff

Resolution Steps

{Specific action to fix root cause}
```
{command or code change}
```
{Deploy fix}
```
{deployment command}
```

{Monitor for recurrence}

# Watch metrics for 30 minutes
dog metric query "avg:http.server.error_rate{service:my-service}"

Validation (Confirm Health Restored)

Goal: Verify system fully recovered before closing incident.

Health Indicators

# Error rate normalized (<1%)
dog metric query "avg:http.server.error_rate{service:my-service,env:prod}"

# Latency back to baseline (p99 <500ms)
dog metric query "p99:http.server.duration{service:my-service,env:prod}"

# No active alerts for this service
dog monitor show_all --tags "service:my-service" --group_states "alert"

Business Validation

User impact resolved (check support tickets, customer reports)
SLO burn rate returned to normal
Downstream services recovered
No related alerts firing

Validation Time: Monitor for 2x mitigation duration before closing.

Escalation

Escalate if:

Problem persists > {duration}
Multiple services affected
Unable to identify root cause

Contacts:

Primary: {Team} ({Slack channel} / {PagerDuty})
Secondary: {Engineering Manager}
Exec: {VP Engineering} (for SEV-1 incidents only)

SLA: {Response time by alert urgency / incident severity}

Post-Incident (SRE Learning Loop)

Goal: Learn from incident, improve system reliability.

Phase 1: Document Incident

Update this runbook with fresh information from incident response
Create blameless postmortem (if SEV-1/SEV-2)
Document what worked and what didn't during response

Phase 2: Fix Systemic Issues

File bugs for root cause fixes and monitoring gaps
Review alert thresholds (too sensitive? too late?)
Add missing dashboards or metrics

Phase 3: Store Metrics (SRE Reliability Data)

Record MTTR (mean time to recovery)
Record MTTA (mean time to acknowledge)
Record MTTD (mean time to detect)
Log incident severity and service tier impact
Track error budget burn during incident
Update service incident history in Rootly/tracking system

Metrics Storage:

# Example: Store incident metrics in tracking system
# Format: {timestamp, service, severity, mttr_minutes, mtta_minutes, error_budget_burn}
echo "2025-01-15T14:30:00Z,my-service,SEV-2,45,5,2.5%" >> incidents.log

# Or use API to update incident management system
curl -X POST https://rootly.com/api/incidents \
  -H "Authorization: Bearer $ROOTLY_TOKEN" \
  -d '{
    "service": "my-service",
    "severity": "high",
    "mttr_minutes": 45,
    "mtta_minutes": 5,
    "error_budget_burn_percent": 2.5
  }'

Phase 4: Team Learning

Run "Wheel of Misfortune" exercise with team (SRE Workbook)
Share debugging techniques learned during incident
Update on-call training materials with new patterns

Metrics Usage:

Trend MTTR over time → measure runbook effectiveness
Analyze repeat incidents → identify systemic failures
Calculate SLO impact → prioritize reliability work
Inform tier decisions → data-driven service criticality


## Language-Specific Runbooks

Runtime-specific runbook patterns available:
- [nodejs-runbooks.md](nodejs-runbooks.md): Node.js patterns (event loop, GC, heap)
- Add Python, Go, Java patterns as needed

## Runbook-Alert Linking

### In Datadog Monitor

```json
{
  "name": "High Error Rate - My Service",
  "message": "Error rate exceeded threshold\n\nRunbook: https://github.com/org/repo/blob/main/docs/runbooks/high-error-rate.md\n\n@slack-oncall",
  "tags": [
    "service:my-service",
    "env:prod",
    "runbook_url:docs/runbooks/high-error-rate.md"
  ]
}

In IaC (Terraform)

resource "datadog_monitor" "high_error_rate" {
  name    = "High Error Rate - My Service"
  type    = "metric alert"
  message = <<-EOT
    Error rate exceeded threshold

    Runbook: https://github.com/org/repo/blob/main/docs/runbooks/high-error-rate.md

    @slack-oncall
  EOT

  query = "avg(last_5m):sum:http.server.errors{service:my-service,env:prod}.as_count() / sum:http.server.requests{service:my-service,env:prod}.as_count() > 0.05"

  tags = [
    "service:my-service",
    "env:prod",
    "runbook_url:docs/runbooks/high-error-rate.md"
  ]
}

Validation Checklist (SRE Quality Gates)

Completeness

Runbook exists for every critical/major alert (1:1 mapping)
Alert message links to runbook with dashboard links
Alert has runbook_url tag
All 8 required sections present (Alert, Triage, Impact, Mitigation, Investigation, Resolution, Validation, Escalation)
"Last Updated" timestamp and responder name

Actionability (Target: Sleep-Deprived Engineer)

Commands are copy-paste ready
No assumptions about system knowledge
False positive checks included
Warnings for dangerous actions
Mitigation time targets specified

Accuracy

Commands tested in non-prod environment
Dashboard links verified
Escalation contacts current
Updated after last incident response

Maintainability

Runtime-specific runbooks present (see language-specific patterns)
Avoids generic "check logs" advice
Not a general how-to guide (incident-focused only)
Scheduled quarterly review date set


## Common Patterns

### Database Connection Pool Exhausted

```bash
# Investigation
dog metric query "max:db.pool.active{service:my-service}"
dog metric query "max:db.pool.waiting{service:my-service}"

# Resolution
# 1. Scale service horizontally
kubectl scale deployment/my-service --replicas=8 -n prod

# 2. Or increase pool size (if DB can handle)
# Update config: DB_POOL_SIZE=50

# 3. Find slow queries
kubectl logs -n prod deployment/my-service | grep "slow query"

Kafka Consumer Lag

# Investigation
dog metric query "max:kafka.consumer_lag{service:my-service,consumer_group:*}"

# Resolution
# 1. Scale consumers
kubectl scale deployment/my-service-consumer --replicas=10 -n prod

# 2. Check processing time
dog metric query "avg:kafka.message.processing_time{service:my-service}"

# 3. Optimize message handlers (if slow)

Circuit Breaker Open

# Investigation
dog metric query "sum:circuit_breaker.state{service:my-service,state:open}"

# Check downstream service health
dog service_check check downstream.health downstream-service 0

# Resolution
# 1. Fix downstream service first
# 2. Reset circuit breaker (if healthy)
curl -X POST http://my-service/admin/circuit-breaker/reset

# 3. Monitor recovery
dog metric query "sum:circuit_breaker.state{service:my-service}"

Reference

Datadog:

Monitor API: https://docs.datadoghq.com/api/latest/monitors/
Service checks: dog service_check check
Metric queries: dog metric query
APM traces: Navigate to Datadog APM → Filter by service

Kubernetes:

Deployment history: kubectl rollout history deployment/SERVICE
Pod status: kubectl get pods -l app=SERVICE
Logs: kubectl logs deployment/SERVICE
Scale: kubectl scale deployment/SERVICE --replicas=N

Escalation:

PagerDuty integration
Slack alerts (@slack-oncall)
Alert urgency: P1 (Critical), P2 (High), P3 (Medium), P4 (Low)
Incident severity: SEV-1 (Critical), SEV-2 (High), SEV-3 (Medium), SEV-4 (Low)

Language-Specific:

See nodejs-runbooks.md for Node.js patterns
Add Python, Go, Java references as needed