Use when creating incident response procedures and on-call playbooks. Covers incident management, communication protocols, and post-mortem documentation.
Generates incident response procedures and on-call playbooks for production outages.
/plugin marketplace add TheBushidoCollective/han/plugin install jutsu-runbooks@hanThis skill is limited to using the following tools:
Creating effective incident response procedures for handling production incidents and on-call scenarios.
SEV-1 (Critical)
SEV-2 (High)
SEV-3 (Medium)
SEV-4 (Low)
# Incident Response: [Alert/Issue Name]
**Severity:** SEV-1/SEV-2/SEV-3/SEV-4
**Response Time:** Immediate / 15 min / 1 hour / Next day
**Owner:** On-call Engineer
## Incident Detection
**This runbook is triggered by:**
- PagerDuty alert: `api_error_rate_high`
- Customer report in #support
- Monitoring dashboard showing anomaly
## Initial Response (First 5 Minutes)
### 1. Acknowledge & Assess
```bash
# Check current status
curl https://api.example.com/health
kubectl get pods -n production
Determine severity:
SEV-1:
/incident create SEV-1 API OutageSEV-2:
SEV-3:
Create incident doc (copy template):
Incident: API Outage
Started: 2025-01-15 14:30 UTC
Severity: SEV-1
Timeline:
14:30 - Alert fired
14:31 - On-call acknowledged
14:32 - Assessed as SEV-1
14:33 - Created incident channel
Goal: Stop the bleeding, restore service
Option A: Rollback Recent Deploy
# Check recent deploys
kubectl rollout history deployment/api-server
# Rollback if deployed < 30 min ago
kubectl rollout undo deployment/api-server
When to use: Deploy coincides with incident start.
Option B: Scale Up
# Increase replicas
kubectl scale deployment/api-server --replicas=20
When to use: High traffic, resource exhaustion.
Option C: Restart Services
# Restart pods
kubectl rollout restart deployment/api-server
When to use: Memory leak, connection pool issues.
Option D: Enable Circuit Breaker
# Disable failing external service calls
kubectl set env deployment/api-server FEATURE_EXTERNAL_API=false
When to use: Third-party service degraded.
SEV-1: Every 10 minutes SEV-2: Every 30 minutes SEV-3: Hourly
**[14:45] UPDATE**
**Status:** Investigating
**Impact:** API returning 503 errors. ~75% of requests failing.
**Actions Taken:**
- Rolled back deploy from 14:25
- Increased pod replicas to 15
**Next Steps:**
- Monitoring rollback impact
- Investigating database connection issues
**ETA:** Unknown
**Customer Impact:** Users cannot place orders.
**Workaround:** None available.
## Status Messages
**Investigating:**
> We are aware of elevated error rates on the API.
> Investigating the root cause. Updates every 10 minutes.
**Identified:**
> Root cause identified: database connection pool exhausted.
> Implementing fix now.
**Monitoring:**
> Fix deployed. Error rate dropping.
> Monitoring for 30 minutes before declaring resolved.
**Resolved:**
> Incident resolved. Error rate back to baseline.
> Post-mortem to follow.
While service is recovering, investigate root cause:
# Capture logs before they rotate
kubectl logs deployment/api-server > incident-logs.txt
# Snapshot metrics
curl -H "Authorization: Bearer $DD_API_KEY" \
"https://api.datadoghq.com/api/v1/graph/snapshot?..." > metrics.png
# Database state
psql -c "SELECT * FROM pg_stat_activity" > db-state.txt
## Timeline
| Time | Event | Evidence |
|------|-------|----------|
| 14:20 | Deploy started | GitHub Actions log |
| 14:25 | Deploy completed | ArgoCD |
| 14:30 | Error rate spike | Datadog alert |
| 14:32 | Database connections maxed | CloudWatch |
| 14:35 | Rollback initiated | kubectl history |
| 14:38 | Service recovered | Datadog metrics |
## Root Cause
**Immediate Cause:**
Deploy introduced N+1 query pattern in user endpoint.
**Contributing Factors:**
- Missing database index on users.created_at
- No query performance testing in CI
- Database connection pool too small for traffic spike
**Why It Wasn't Caught:**
- Staging has 10x less traffic than production
- Load testing doesn't cover this endpoint
- No alerting on query performance
Criteria (ALL must be met):
## Immediate (Within 1 hour)
- [ ] Post resolution update to #incidents
- [ ] Update status page to "operational"
- [ ] Thank responders
- [ ] Close PagerDuty incident
## Short-term (Within 24 hours)
- [ ] Create post-mortem ticket
- [ ] Schedule post-mortem meeting
- [ ] Extract action items
- [ ] Update runbook with learnings
## Long-term (Within 1 week)
- [ ] Complete action items from post-mortem
- [ ] Add monitoring/alerting to prevent recurrence
- [ ] Document in incident database
# Post-Mortem: API Outage - 2025-01-15
**Date:** 2025-01-15
**Duration:** 14:30 UTC - 14:45 UTC (15 minutes)
**Severity:** SEV-1
**Impact:** 75% of API requests failing
**Authors:** On-call engineer, Team lead
## Summary
On January 15th at 14:30 UTC, our API experienced a complete outage affecting
75% of requests. The incident lasted 15 minutes and was caused by a database
connection pool exhaustion triggered by an N+1 query in a recent deploy.
## Impact
**Customer Impact:**
- ~1,500 users unable to complete purchases
- Estimated revenue loss: $50,000
- 47 support tickets filed
**Internal Impact:**
- 3 engineers pulled from other work
- 15 minutes of complete outage
- Engineering manager paged
## Timeline (All times UTC)
**14:20** - Deploy #1234 merged and started deployment
**14:25** - Deploy completed, new code serving traffic
**14:30** - Alert fired: `api_error_rate_high`
**14:31** - On-call engineer acknowledged
**14:32** - Assessed as SEV-1, created incident channel
**14:33** - Identified database connection pool exhausted
**14:35** - Initiated rollback to previous version
**14:38** - Rollback complete, error rate dropping
**14:40** - Service stabilized, monitoring
**14:45** - Declared resolved
## Root Cause
The deploy introduced an N+1 query in the `/users/recent` endpoint. For each
user returned, the code made an additional database query to fetch their
profile picture URL. With 50 concurrent requests, this resulted in 50 × 20 =
1,000 database queries, exhausting the connection pool (configured for 100
connections).
**Code change:**
```diff
- user.profile_picture_url # Preloaded in query
+ user.get_profile_picture() # Additional query per user
users.created_at not indexed, making base query slow| Action | Owner | Deadline | Priority |
|---|---|---|---|
| Add database index on users.created_at | Alice | 2025-01-16 | P0 |
| Increase connection pool to 200 | Bob | 2025-01-16 | P0 |
| Add query performance test to CI | Charlie | 2025-01-20 | P1 |
| Implement automatic rollback on error spike | Dave | 2025-01-30 | P1 |
| Create ORM query linter to detect N+1 | Eve | 2025-02-15 | P2 |
## On-Call Playbook
```markdown
# On-Call Playbook
## Before Your On-Call Shift
**1 week before:**
- [ ] Review recent incidents
- [ ] Update on-call runbooks if needed
- [ ] Test PagerDuty notifications
**1 day before:**
- [ ] Verify laptop ready (charged, VPN working)
- [ ] Test access to all systems
- [ ] Review current system status
- [ ] Check calendar for conflicting events
## During Your Shift
### When You Get Paged
**Within 1 minute:**
1. Acknowledge alert in PagerDuty
2. Check alert details for severity
3. Open relevant runbook
**Within 5 minutes:**
4. Assess severity (is it really SEV-1?)
5. Create incident channel if SEV-1/SEV-2
6. Post initial status update
### Escalation Decision Tree
Get paged
|
Can I handle this alone?
/ \
Yes No
| |
Work it Escalate
| |
Fixed? Loop in team
/ \ |
Yes No Work together
| | |
Close Need Fixed?
help |
\ Yes
\ |
\ Close
\ |
Escalate
### Handoff Procedure
**End of shift checklist:**
- [ ] No active incidents
- [ ] Status doc updated
- [ ] Next on-call acknowledged handoff
- [ ] Brief next on-call on any ongoing issues
**Handoff template:**
Hey @next-oncall! Handing off on-call. Here's the status:
Active Issues: None
Watch Items:
Recent Incidents:
System Status:
Let me know if you have questions!
## After Your Shift
- [ ] Update runbooks with any new learnings
- [ ] Complete post-mortems for incidents
- [ ] File bug tickets for issues found
- [ ] Share feedback on alerting/runbooks
# Bad: Reactive chaos
EVERYTHING IS DOWN! RESTART ALL THE THINGS!
# Good: Calm assessment
Service is degraded. Let me check:
1. What's the actual impact?
2. When did it start?
3. What's the quickest safe mitigation?
# Bad: Silent fixing
*Fixes issue without telling anyone*
*Marks incident as resolved*
# Good: Regular updates
[14:30] Investigating API errors
[14:40] Root cause identified, deploying fix
[14:45] Fix deployed, monitoring
[15:00] Service stable, incident resolved
# Bad: Move on quickly
Fixed it! Moving on to next task.
# Good: Learn from incidents
- Document what happened
- Identify action items
- Prevent recurrence
- Share learnings with team
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.