Runbook template: symptom -> probable cause -> diagnostic steps -> fix procedure -> escalation for each alert. Use when creating on-call documentation for a service.
From sde-executionnpx claudepluginhub chavangorakh1999/sde-skills --plugin sde-executionThis skill uses the workspace's default tool permissions.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Calculates TAM/SAM/SOM using top-down, bottom-up, and value theory methodologies for market sizing, revenue estimation, and startup validation.
A runbook is the operational guide for an alert: what it means, how to diagnose it, and how to fix it. A good runbook means a 2am alert doesn't require waking up an expert.
Service or alert to document: $ARGUMENTS
# [Service Name] On-Call Runbook
**Last updated:** [Date]
**Owner:** [Team]
**Escalation:** [Who to call if runbook doesn't resolve it]
**Postmortems:** [Link to past postmortems for context]
---
## Service Overview
[1-2 sentences: what does this service do, what does it depend on, what depends on it]
**Architecture:**
[Client] -> [Service] -> [Database, Cache, External APIs]
**Critical dependencies:**
- PostgreSQL: stores [what]
- Redis: used for [what]
- Stripe API: used for [what]
**SLO:** 99.9% of requests return 2xx within 500ms (28-day window)
---
## Alert Reference
| Alert Name | SEV | Likely Cause | Quick Fix |
---
## Alert: HighErrorRate
**Fires when:** HTTP 5xx error rate > 1% for 5 minutes
**Expected baseline:** < 0.1% error rate
**SEV:** 2
### Triage
```bash
# Step 1: Check error logs (last 15 minutes)
# CloudWatch Logs Insights:
fields @timestamp, level, errorMessage, requestId, path
| filter level = "error" and @timestamp > ago(15m)
| stats count(*) as cnt by errorMessage
| sort cnt desc | limit 10
# Step 2: Check recent deployments
git log --oneline --since="2h ago"
# Was there a deploy in the last 2 hours? -> consider rollback
# Step 3: Check downstream services
# Stripe status: https://status.stripe.com
# AWS status: https://health.aws.amazon.com
# Check circuit breaker state in metrics dashboard
A. Bad deploy (most common)
kubectl rollout undo deployment/[service] or re-deploy previous tagB. Database connection exhausted
SELECT count(*) FROM pg_stat_activity WHERE datname = '[db]';C. Downstream service (e.g., Stripe) unavailable
D. Disk full
df -h on the affected instanceFires when: P99 request duration > 2000ms for 5 minutes Expected baseline: P99 < 500ms SEV: 3 (2 if revenue-impacting)
# Step 1: Identify which endpoints are slow
fields path, durationMs
| filter @timestamp > ago(15m)
| stats pct(durationMs, 99) as p99 by path
| sort p99 desc | limit 10
# Step 2: Check DB query times
# Look for: slow_query in logs, or check pg_stat_statements
SELECT query, mean_exec_time, calls, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
# Step 3: Check external API latency
# Look for: action="external_call" and high durationMs in logs
A. Slow database queries
CREATE INDEX CONCURRENTLY ... (non-blocking)B. N+1 query problem (after code change)
C. External service slow (Stripe, etc.)
D. Memory pressure / GC pauses
Fires when: Service health check fails for 3 consecutive checks (90 seconds) SEV: 1
kubectl get pods -n [namespace]kubectl describe pod [pod-name] -> increase memorykubectl logs [pod-name] --previous -> look for fatal errorkubectl rollout restart deployment/[service]Send within 5 minutes: "We are experiencing a service outage for [service]. Impact: [users affected]. Engineering team is actively investigating. Next update in 15 minutes."
# Kubernetes: rolling restart (replaces pods one at a time)
kubectl rollout restart deployment/[service] -n [namespace]
# Verify rollout status
kubectl rollout status deployment/[service] -n [namespace]
kubectl scale deployment/[service] --replicas=10 -n [namespace]
[Step-by-step to enable read-only mode if writes must be suspended]
---
### Output Format
Produce a complete runbook document for the service/alert, following the template above. Include:
1. Service overview and dependency map
2. SLO definition
3. One section per alert with: fires when, triage steps, probable causes, fixes, escalation
4. Maintenance procedures for common operational tasks
5. Escalation contacts and when to use them