Author operational runbooks for incident response and troubleshooting
Creates operational runbooks for incident response and troubleshooting with structured templates.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install observability-planning@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Create operational runbooks for incident response and troubleshooting.
Before authoring runbooks:
docs-management skill for runbook patternsRUNBOOK GOALS:
┌─────────────────────────────────────────────────────────────────┐
│ WHY RUNBOOKS? │
├─────────────────────────────────────────────────────────────────┤
│ │
│ REDUCE MTTR (Mean Time To Recovery) │
│ ├── Pre-documented steps save diagnosis time │
│ ├── No need to figure it out during incident │
│ └── Consistent approach every time │
│ │
│ ENABLE ANYONE TO RESPOND │
│ ├── On-call doesn't need to be expert │
│ ├── Knowledge transfer from senior to junior │
│ ├── Reduces bus factor │
│ └── New team members can respond effectively │
│ │
│ DOCUMENT TRIBAL KNOWLEDGE │
│ ├── Capture what experts know │
│ ├── Make implicit knowledge explicit │
│ └── Preserve knowledge when people leave │
│ │
│ IMPROVE OVER TIME │
│ ├── Each incident improves the runbook │
│ ├── Capture new failure modes │
│ └── Evolve with the system │
│ │
└─────────────────────────────────────────────────────────────────┘
RUNBOOK CATEGORIES:
ALERT RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per alert │
│ │
│ Purpose: What to do when this specific alert fires │
│ Linked from: Alert annotations │
│ Contains: Triage, diagnosis, remediation for this alert │
└─────────────────────────────────────────────────────────────────┘
SYMPTOM RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per symptom/problem │
│ │
│ Purpose: How to diagnose/fix a type of problem │
│ Examples: "High Latency", "Out of Memory", "Connection Errors" │
│ Contains: Decision tree for multiple potential causes │
└─────────────────────────────────────────────────────────────────┘
PROCEDURE RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per operational procedure │
│ │
│ Purpose: How to perform maintenance tasks │
│ Examples: "Database Failover", "Certificate Rotation" │
│ Contains: Step-by-step procedures │
└─────────────────────────────────────────────────────────────────┘
SERVICE RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per service │
│ │
│ Purpose: Service overview and common operations │
│ Contains: Architecture, dependencies, common issues │
└─────────────────────────────────────────────────────────────────┘
RUNBOOK ANATOMY:
┌─────────────────────────────────────────────────────────────────┐
│ HEADER │
│ ├── Title, service, last updated │
│ ├── Alert link (if alert runbook) │
│ └── Quick summary │
├─────────────────────────────────────────────────────────────────┤
│ OVERVIEW │
│ ├── What this runbook covers │
│ ├── When to use it │
│ └── Expected outcome │
├─────────────────────────────────────────────────────────────────┤
│ QUICK ACTIONS (Optimize for speed) │
│ ├── 2-3 most common fixes │
│ ├── Copy-paste commands │
│ └── "Try these first" section │
├─────────────────────────────────────────────────────────────────┤
│ DIAGNOSIS │
│ ├── How to verify the problem │
│ ├── Key metrics/logs to check │
│ ├── Decision tree for root cause │
│ └── Common causes and indicators │
├─────────────────────────────────────────────────────────────────┤
│ REMEDIATION │
│ ├── Step-by-step fix for each cause │
│ ├── Rollback procedures │
│ └── Verification steps │
├─────────────────────────────────────────────────────────────────┤
│ ESCALATION │
│ ├── When to escalate │
│ ├── Who to contact │
│ └── What information to provide │
├─────────────────────────────────────────────────────────────────┤
│ REFERENCES │
│ ├── Related dashboards │
│ ├── Architecture docs │
│ └── Related runbooks │
└─────────────────────────────────────────────────────────────────┘
# Runbook: {Alert Name}
**Service:** {Service Name}
**Alert:** {Alert Name}
**Severity:** {P1/P2/P3/P4}
**Owner:** {Team/Person}
---
## Overview
**What this alert means:**
{One sentence explanation of what triggered this alert}
**User impact:**
{How users are affected when this fires}
**Expected resolution time:**
{Typical time to resolve}
---
## Quick Actions
> **Try these first before deep diagnosis**
### 1. Check if it's a known issue
```bash
# Check recent incidents
open https://status.example.com
# Check deployment history
kubectl rollout history deployment/orders-api -n production
# Rolling restart (no downtime)
kubectl rollout restart deployment/orders-api -n production
# Wait for rollout
kubectl rollout status deployment/orders-api -n production
# Rollback to previous version
kubectl rollout undo deployment/orders-api -n production
# Check current metric value
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])' | jq
Dashboard: Grafana - Orders API Overview
| Check | Command/Action |
|---|---|
| All pods affected? | kubectl get pods -n production -l app=orders-api |
| All endpoints? | Check per-endpoint error rate in Grafana |
| Started when? | Check Grafana annotations for deployments |
DECISION TREE:
Is it all pods or specific pods?
├── All pods → Likely code/config issue
│ ├── Recent deployment? → Rollback
│ └── No deployment? → Check dependencies
│
└── Specific pods → Likely infrastructure issue
├── Same node? → Node issue
└── Random? → Check pod logs
| Cause | Indicators | Solution |
|---|---|---|
| Bad deployment | Errors started after deploy | Rollback |
| Database issues | Connection errors in logs | Check DB |
| Memory pressure | OOMKilled pods | Scale up or fix leak |
| Dependency down | Timeout errors | Check dependency status |
Indicators:
Fix:
# 1. Rollback to previous version
kubectl rollout undo deployment/orders-api -n production
# 2. Verify rollback
kubectl rollout status deployment/orders-api -n production
# 3. Check error rate dropping
watch -n5 'curl -s "http://prometheus:9090/..." | jq'
Verification:
Indicators:
Fix:
# 1. Check database status
psql -h db.example.com -U admin -c "SELECT count(*) FROM pg_stat_activity"
# 2. If connection pool exhausted, restart pods
kubectl rollout restart deployment/orders-api -n production
# 3. If database is down, escalate to DBA
Indicators:
Fix:
# 1. Check for OOMKilled
kubectl get events -n production --field-selector reason=OOMKilled
# 2. Increase memory limit temporarily
kubectl set resources deployment/orders-api -n production \
--limits=memory=2Gi
# 3. File ticket for memory leak investigation
| Situation | Contact | Method |
|---|---|---|
| Database issues | DBA on-call | PagerDuty: dba-oncall |
| Network issues | Platform team | Slack: #platform-oncall |
| Security concern | Security team | PagerDuty: security |
| Unknown | Engineering Manager | Phone: XXX-XXX-XXXX |
When escalating, include:
| Date | Change | Author |
|---|---|---|
| 2024-01-15 | Added memory pressure section | @engineer |
| 2024-01-01 | Initial version | @oncall |
# Procedure: {Procedure Name}
**Purpose:** {What this procedure accomplishes}
**When to use:** {Circumstances requiring this procedure}
**Duration:** {Expected time to complete}
**Risk Level:** {Low/Medium/High}
---
## Prerequisites
- [ ] Access to production Kubernetes cluster
- [ ] Database admin credentials
- [ ] Approval from {approver} if during business hours
## Pre-Flight Checks
```bash
# Verify you have correct access
kubectl auth can-i '*' '*' -n production
# Check current system state
kubectl get pods -n production
# Confirm no ongoing incidents
open https://status.example.com
Purpose: {Why this step}
# Commands to execute
{command}
Expected output:
{expected output}
Verification:
Purpose: {Why this step}
{command}
Expected output:
{expected output}
Verification:
...
If something goes wrong, follow these steps to restore previous state:
{rollback command}
{verification command}
# Remove temporary resources
{cleanup commands}
Symptom: {What you see} Cause: {Why it happens} Solution: {How to fix}
...
# Runbook Quality Checklist
## Structure
- [ ] Clear title and service identification
- [ ] Last updated date is recent
- [ ] Owner/team identified
- [ ] Severity/risk level stated
## Quick Actions
- [ ] 2-3 most common fixes at the top
- [ ] Commands are copy-paste ready
- [ ] No unnecessary explanation before actions
## Diagnosis
- [ ] Decision tree for root cause
- [ ] Key metrics/logs to check
- [ ] Links to dashboards
- [ ] Common causes listed
## Remediation
- [ ] Step-by-step for each cause
- [ ] Verification after each fix
- [ ] Rollback procedures included
- [ ] Commands are tested and work
## Escalation
- [ ] Clear escalation criteria
- [ ] Contact information current
- [ ] Information to provide when escalating
## Usability
- [ ] Can be followed at 3am under stress
- [ ] No jargon without explanation
- [ ] Links are not broken
- [ ] Screenshots where helpful
When authoring runbooks:
For detailed guidance:
Last Updated: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.