Creates operational runbooks for incident response, investigation, and resolution. Use when writing runbooks, documenting incident procedures, or creating operational guides for monitoring alerts. Based on Google SRE Book and SRE Workbook best practices.
npx claudepluginhub andercore-labs/claudes-kitchen --plugin operational-excellenceThis skill uses the workspace's default tool permissions.
**SCOPE:** Incident response documentation and operational procedures.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
SCOPE: Incident response documentation and operational procedures.
PHILOSOPHY: "Thinking through and recording best practices ahead of time produces roughly a 3x improvement in MTTR vs 'winging it.'" — Google SRE Workbook
docs/runbooks/{alert-name}.md
├── Alert Details (trigger, severity, impact)
├── Triage & Verification (is this real?)
├── Impact Assessment (user/business consequences)
├── Mitigation (stop the bleeding)
├── Investigation (find root cause)
├── Resolution (fix permanently)
├── Validation (confirm health restored)
└── Escalation (when to page someone)
Alert without runbook | Incident procedure unclear | New service deployment | Monitor creation | Operational documentation | Knowledge transfer | On-call training
| Practice | Rationale |
|---|---|
| 1 runbook per alert | Reduces MTTR, stress, human error |
| Target sleep-deprived engineer | Assume reader is tired, stressed, new to system |
| Actionable steps | Clear commands to run, not theory |
| Update after incidents | Fresh information from responders |
| Link to dashboards | Direct access to relevant monitoring |
| Include warnings | Prevent escalation from well-intentioned actions |
| Avoid how-to guides | Runbooks for incidents, not general ops |
Runbooks must be:
| Quality | Definition | Test |
|---|---|---|
| Actionable | Clear steps to reduce MTTR | Can new hire follow without help? |
| Accessible | Easily discoverable when needed | Linked from alert? Searchable? |
| Accurate | Current and reliable information | Tested in last 90 days? |
| Authoritative | Single source of truth | No conflicting docs? |
| Adaptable | Straightforward to update | Updated after last incident? |
⚠️ Automation Signal: If runbook is deterministic list of commands run every time, automate it instead.
Automate:
Keep as Runbook:
| Section | Purpose | Content | SRE Principle |
|---|---|---|---|
| Alert Details | Trigger definition | Name, severity, threshold, query, dashboard links | Single source of truth |
| Triage | Verify reality | Is alert accurate? False positive checks | Reduce noise |
| Impact | Business consequence | User/revenue/SLA impact, blast radius | Justify urgency |
| Mitigation | Stop the bleeding | Quick actions to stabilize system | Reduce MTTR |
| Investigation | Find root cause | Diagnostic commands, correlation, logs | Enable learning |
| Resolution | Fix permanently | Steps to resolve, rollback procedures | Prevent recurrence |
| Validation | Confirm health | Metrics showing system recovered | Avoid premature closure |
| Escalation | When to escalate | Conditions, contacts, SLA | Clear ownership |
# {Alert Name} Runbook
## Alert Details
**Monitor:** {Datadog monitor name/ID} | [Dashboard](link) | [SLO](link)
**Urgency:** P1 (Critical) | P2 (High) | P3 (Medium) | P4 (Low)
**Threshold:** {trigger condition}
**Query:** `{Datadog query}`
**Last Updated:** {date} by {responder}
## Triage & Verification
**Goal:** Confirm this is a real incident, not a false positive.
### Quick Checks
```bash
# Verify alert is still active
dog metric query "avg:http.server.error_rate{service:my-service,env:prod}"
# Check if this is a known issue
# Navigate to: Incident Slack channel → Search for service name
If false positive: Acknowledge alert, document in #incidents, return to monitoring.
User Impact: {How users are affected} Business Impact: {Revenue, SLA, compliance consequences} Blast Radius: {Affected services, customers, regions}
Example:
- Users unable to complete checkout
- Est. $X,XXX/min revenue loss
- SLO burn: 10% of monthly budget in 1 hour
- Affects: Production EU region, all tenants
Incident Severity Justification: Why this triggers SEV-1/SEV-2/SEV-3/SEV-4 incident.
Goal: Stabilize system immediately, reduce customer impact.
⚠️ WARNING: These are temporary fixes. Full resolution required afterward.
Option A: Rollback (if recent deployment)
# Rollback to previous version
kubectl rollout undo deployment/my-service -n prod
# Monitor rollback progress
kubectl rollout status deployment/my-service -n prod
Option B: Scale horizontally (if capacity issue)
# Scale up replicas
kubectl scale deployment/my-service --replicas=8 -n prod
# Monitor pod health
watch kubectl get pods -n prod -l app=my-service
Option C: Traffic shedding (if overload)
# Enable rate limiting (if available)
curl -X POST http://my-service/admin/ratelimit --data '{"enabled":true,"limit":1000}'
# Or route traffic away temporarily
# Update load balancer / service mesh configuration
Mitigation Time Target: <15 minutes for P1 alerts, <30 minutes for P2 alerts
Goal: Identify why the incident occurred while system is stable.
# Recent deployments
kubectl rollout history deployment/my-service -n prod
# Recent config changes
git log --since="1 hour ago" --all -- config/
# Check Datadog events for deployments, scaling, alerts
# Navigate to: Datadog Events → Filter by service
# Database connectivity
dog service_check check db.connection db-primary 0
dog metric query "max:db.pool.active{service:my-service}"
# Kafka consumer lag
dog metric query "max:kafka.consumer_lag{service:my-service,consumer_group:*}"
# Redis availability
dog metric query "avg:redis.connections.active{service:my-service}"
# Recent error logs
dog search query "service:my-service status:error" --from "1h"
# Look for patterns: timeouts, connection errors, OOM kills
kubectl logs -n prod deployment/my-service --tail=100 | grep -i error
# Navigate to: Datadog APM → Filter service + error:true
# Look for: High latency spans, error rates by endpoint, downstream failures
Goal: Implement lasting solution, prevent recurrence.
If: Database connection pool exhausted
DB_POOL_SIZE=50If: Memory leak causing restarts
node --inspect app.jsIf: Downstream dependency timeout
HTTP_TIMEOUT=5000{command or code change}
{deployment command}
# Watch metrics for 30 minutes
dog metric query "avg:http.server.error_rate{service:my-service}"
Goal: Verify system fully recovered before closing incident.
# Error rate normalized (<1%)
dog metric query "avg:http.server.error_rate{service:my-service,env:prod}"
# Latency back to baseline (p99 <500ms)
dog metric query "p99:http.server.duration{service:my-service,env:prod}"
# No active alerts for this service
dog monitor show_all --tags "service:my-service" --group_states "alert"
Validation Time: Monitor for 2x mitigation duration before closing.
Escalate if:
Contacts:
SLA: {Response time by alert urgency / incident severity}
Goal: Learn from incident, improve system reliability.
Metrics Storage:
# Example: Store incident metrics in tracking system
# Format: {timestamp, service, severity, mttr_minutes, mtta_minutes, error_budget_burn}
echo "2025-01-15T14:30:00Z,my-service,SEV-2,45,5,2.5%" >> incidents.log
# Or use API to update incident management system
curl -X POST https://rootly.com/api/incidents \
-H "Authorization: Bearer $ROOTLY_TOKEN" \
-d '{
"service": "my-service",
"severity": "high",
"mttr_minutes": 45,
"mtta_minutes": 5,
"error_budget_burn_percent": 2.5
}'
Metrics Usage:
## Language-Specific Runbooks
Runtime-specific runbook patterns available:
- [nodejs-runbooks.md](nodejs-runbooks.md): Node.js patterns (event loop, GC, heap)
- Add Python, Go, Java patterns as needed
## Runbook-Alert Linking
### In Datadog Monitor
```json
{
"name": "High Error Rate - My Service",
"message": "Error rate exceeded threshold\n\nRunbook: https://github.com/org/repo/blob/main/docs/runbooks/high-error-rate.md\n\n@slack-oncall",
"tags": [
"service:my-service",
"env:prod",
"runbook_url:docs/runbooks/high-error-rate.md"
]
}
resource "datadog_monitor" "high_error_rate" {
name = "High Error Rate - My Service"
type = "metric alert"
message = <<-EOT
Error rate exceeded threshold
Runbook: https://github.com/org/repo/blob/main/docs/runbooks/high-error-rate.md
@slack-oncall
EOT
query = "avg(last_5m):sum:http.server.errors{service:my-service,env:prod}.as_count() / sum:http.server.requests{service:my-service,env:prod}.as_count() > 0.05"
tags = [
"service:my-service",
"env:prod",
"runbook_url:docs/runbooks/high-error-rate.md"
]
}
## Common Patterns
### Database Connection Pool Exhausted
```bash
# Investigation
dog metric query "max:db.pool.active{service:my-service}"
dog metric query "max:db.pool.waiting{service:my-service}"
# Resolution
# 1. Scale service horizontally
kubectl scale deployment/my-service --replicas=8 -n prod
# 2. Or increase pool size (if DB can handle)
# Update config: DB_POOL_SIZE=50
# 3. Find slow queries
kubectl logs -n prod deployment/my-service | grep "slow query"
# Investigation
dog metric query "max:kafka.consumer_lag{service:my-service,consumer_group:*}"
# Resolution
# 1. Scale consumers
kubectl scale deployment/my-service-consumer --replicas=10 -n prod
# 2. Check processing time
dog metric query "avg:kafka.message.processing_time{service:my-service}"
# 3. Optimize message handlers (if slow)
# Investigation
dog metric query "sum:circuit_breaker.state{service:my-service,state:open}"
# Check downstream service health
dog service_check check downstream.health downstream-service 0
# Resolution
# 1. Fix downstream service first
# 2. Reset circuit breaker (if healthy)
curl -X POST http://my-service/admin/circuit-breaker/reset
# 3. Monitor recovery
dog metric query "sum:circuit_breaker.state{service:my-service}"
Datadog:
dog service_check checkdog metric queryKubernetes:
kubectl rollout history deployment/SERVICEkubectl get pods -l app=SERVICEkubectl logs deployment/SERVICEkubectl scale deployment/SERVICE --replicas=NEscalation:
Language-Specific: