Plan incident management processes, postmortems, and blameless culture
Plans incident management processes, postmortems, and blameless culture workflows. Use when creating incident response procedures, defining severity levels, or conducting blameless postmortems after incidents.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install observability-planning@melodic-softwareThis skill is limited to using the following tools:
Use this skill when:
Plan incident management processes, postmortems, and blameless culture.
Before planning incident response:
docs-management skill for incident management patternsINCIDENT LIFECYCLE:
┌─────────────────────────────────────────────────────────────────┐
│ INCIDENT LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. DETECTION │
│ ├── Alert fires │
│ ├── Customer report │
│ └── Internal discovery │
│ │ │
│ ▼ │
│ 2. TRIAGE │
│ ├── Assess severity │
│ ├── Assign incident commander │
│ └── Open incident channel │
│ │ │
│ ▼ │
│ 3. RESPONSE │
│ ├── Diagnose root cause │
│ ├── Implement fix │
│ ├── Communicate status │
│ └── Coordinate resources │
│ │ │
│ ▼ │
│ 4. RESOLUTION │
│ ├── Verify fix │
│ ├── Close incident │
│ └── Initial timeline │
│ │ │
│ ▼ │
│ 5. POSTMORTEM │
│ ├── Blameless analysis │
│ ├── Identify contributing factors │
│ ├── Define action items │
│ └── Share learnings │
│ │
└─────────────────────────────────────────────────────────────────┘
SEVERITY DEFINITIONS:
┌─────────┬────────────────────────────────────────────────────────┐
│ SEV1 │ CRITICAL - Major user/business impact │
│ │ │
│ │ Examples: │
│ │ - Complete service outage │
│ │ - Data breach/security incident │
│ │ - Revenue-impacting failure │
│ │ - SLA breach imminent │
│ │ │
│ │ Response: │
│ │ - Page immediately │
│ │ - All hands on deck │
│ │ - Exec communication │
│ │ - Status page update │
├─────────┼────────────────────────────────────────────────────────┤
│ SEV2 │ HIGH - Significant impact, workaround possible │
│ │ │
│ │ Examples: │
│ │ - Partial outage │
│ │ - Major feature unavailable │
│ │ - Performance severely degraded │
│ │ │
│ │ Response: │
│ │ - Page on-call │
│ │ - Incident commander assigned │
│ │ - Customer communication │
├─────────┼────────────────────────────────────────────────────────┤
│ SEV3 │ MEDIUM - Limited impact, non-critical │
│ │ │
│ │ Examples: │
│ │ - Minor feature broken │
│ │ - Small subset of users affected │
│ │ - Non-urgent degradation │
│ │ │
│ │ Response: │
│ │ - Business hours response │
│ │ - Track in ticket │
├─────────┼────────────────────────────────────────────────────────┤
│ SEV4 │ LOW - Minimal impact, cosmetic issues │
│ │ │
│ │ Examples: │
│ │ - UI glitch │
│ │ - Non-critical bug │
│ │ │
│ │ Response: │
│ │ - Normal ticket workflow │
└─────────┴────────────────────────────────────────────────────────┘
INCIDENT ROLES:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ INCIDENT COMMANDER (IC) │
│ ├── Owns the incident │
│ ├── Coordinates response │
│ ├── Makes decisions │
│ ├── Delegates tasks │
│ └── Keeps everyone informed │
│ │
│ TECHNICAL LEAD │
│ ├── Leads technical investigation │
│ ├── Diagnoses root cause │
│ ├── Implements fixes │
│ └── Advises IC on technical matters │
│ │
│ COMMUNICATIONS LEAD │
│ ├── Manages external communication │
│ ├── Updates status page │
│ ├── Drafts customer notifications │
│ └── Handles stakeholder updates │
│ │
│ SCRIBE │
│ ├── Documents timeline │
│ ├── Records decisions │
│ ├── Captures actions taken │
│ └── Prepares postmortem draft │
│ │
│ SUBJECT MATTER EXPERTS (SMEs) │
│ ├── Provide domain expertise │
│ ├── Execute specific tasks │
│ └── Advise on their area │
│ │
└─────────────────────────────────────────────────────────────────┘
# Incident Response Checklist
## Detection (T+0)
- [ ] Alert acknowledged
- [ ] Initial assessment of severity
- [ ] Incident channel created (#incident-YYYY-MM-DD-{name})
## Triage (T+5 min)
- [ ] Severity assigned (SEV1/2/3/4)
- [ ] Incident Commander identified
- [ ] Technical Lead identified
- [ ] Initial status posted to channel
## Response (T+10 min)
### For SEV1/SEV2:
- [ ] Communications Lead assigned
- [ ] Status page updated (Investigating)
- [ ] Stakeholders notified (Slack #incidents)
- [ ] Customer-facing communication drafted (if needed)
### Technical Response:
- [ ] Scope identified (which services/users affected)
- [ ] Root cause hypothesis formed
- [ ] Fix being implemented or workaround in place
- [ ] Monitoring for improvement
## Communication Cadence
| Severity | Internal Update | External Update |
|----------|-----------------|-----------------|
| SEV1 | Every 15 min | Every 30 min |
| SEV2 | Every 30 min | Every hour |
| SEV3 | Every hour | As needed |
## Resolution
- [ ] Fix deployed and verified
- [ ] Metrics returning to normal
- [ ] Status page updated (Resolved)
- [ ] Customer notification sent (if applicable)
- [ ] Incident channel archived
## Follow-up
- [ ] Postmortem scheduled (within 48h for SEV1/2)
- [ ] Timeline documented
- [ ] Action items created
- [ ] Postmortem shared
BLAMELESS CULTURE PRINCIPLES:
┌─────────────────────────────────────────────────────────────────┐
│ BLAMELESS POSTMORTEMS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CORE BELIEF: │
│ People don't come to work to do a bad job. When incidents │
│ happen, the system failed, not the person. │
│ │
│ FOCUS ON: │
│ ✓ What happened (facts, not fault) │
│ ✓ Why the system allowed it (systemic issues) │
│ ✓ How to prevent recurrence (improvements) │
│ ✓ What we learned (knowledge sharing) │
│ │
│ AVOID: │
│ ✗ "Who made the mistake?" │
│ ✗ "Why didn't they check?" │
│ ✗ "They should have known better" │
│ ✗ Assigning personal blame │
│ │
│ REFRAME TO: │
│ "Why did the system allow this?" │
│ "What safeguards were missing?" │
│ "How can we make this impossible to happen again?" │
│ │
└─────────────────────────────────────────────────────────────────┘
BLAMELESS LANGUAGE:
Instead of... Say...
─────────────────────────────────────────────────────────────────
"John broke production" "A configuration change caused..."
"She didn't test it" "Testing didn't catch this because..."
"They should have known" "The system didn't make this obvious"
"Human error" "The process allowed this to happen"
# Postmortem: {Incident Title}
**Date:** {YYYY-MM-DD}
**Severity:** {SEV1/2/3/4}
**Duration:** {X hours Y minutes}
**Authors:** {Names}
**Status:** {Draft/Reviewed/Published}
---
## Executive Summary
{2-3 sentence summary of what happened, impact, and key learnings}
---
## Impact
| Metric | Value |
|--------|-------|
| Duration | {X hours Y minutes} |
| Users affected | {Number or percentage} |
| Revenue impact | {$X or "Minimal"} |
| SLO impact | {Error budget consumed} |
| Support tickets | {Number} |
---
## Timeline
All times in UTC.
| Time | Event |
|------|-------|
| 14:00 | Deployment of orders-api v2.3.1 started |
| 14:05 | Deployment completed |
| 14:12 | Error rate alert fired |
| 14:15 | On-call acknowledged, began investigation |
| 14:22 | Root cause identified (database migration issue) |
| 14:25 | Rollback initiated |
| 14:32 | Rollback completed, service recovering |
| 14:45 | Error rate returned to normal |
| 14:50 | Incident resolved |
---
## Root Cause Analysis
### What happened
{Detailed technical explanation of what failed and why}
### Contributing factors
1. **{Factor 1}**
- {Explanation}
- {Why it contributed}
2. **{Factor 2}**
- {Explanation}
- {Why it contributed}
3. **{Factor 3}**
- {Explanation}
- {Why it contributed}
### 5 Whys Analysis
1. **Why did users see errors?**
- Because the API was returning 500 errors
2. **Why was the API returning 500 errors?**
- Because database queries were failing
3. **Why were database queries failing?**
- Because the migration added a NOT NULL column without a default
4. **Why didn't we catch this before production?**
- Because staging didn't have representative data
5. **Why didn't staging have representative data?**
- Because we don't have a data anonymization pipeline
---
## What Went Well
- {Positive thing 1}
- {Positive thing 2}
- {Positive thing 3}
## What Could Have Gone Better
- {Improvement area 1}
- {Improvement area 2}
- {Improvement area 3}
---
## Action Items
| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | Add database migration validation to CI | @engineer | P1 | 2024-01-20 | Open |
| 2 | Create staging data pipeline | @data-team | P2 | 2024-02-01 | Open |
| 3 | Add rollback automation | @platform | P2 | 2024-01-25 | Open |
| 4 | Update deployment runbook | @oncall | P3 | 2024-01-22 | Open |
---
## Lessons Learned
### Technical
- {Technical lesson 1}
- {Technical lesson 2}
### Process
- {Process lesson 1}
- {Process lesson 2}
### Communication
- {Communication lesson 1}
---
## Appendix
### Related Links
- [Incident Slack Channel](#)
- [Deployment Dashboard](#)
- [Error Logs Query](#)
### Supporting Data
{Graphs, screenshots, log snippets}
---
## Sign-off
| Role | Name | Date |
|------|------|------|
| Author | {Name} | {Date} |
| Reviewer | {Name} | {Date} |
| Approved | {Name} | {Date} |
INCIDENT METRICS TO TRACK:
MTTR (Mean Time To Recovery):
┌─────────────────────────────────────────────────────────────────┐
│ Time from incident start to resolution │
│ │
│ Breakdown: │
│ - Time to detect (TTD) │
│ - Time to acknowledge (TTA) │
│ - Time to diagnose (TTDiag) │
│ - Time to fix (TTF) │
│ │
│ MTTR = TTD + TTA + TTDiag + TTF │
└─────────────────────────────────────────────────────────────────┘
Other Key Metrics:
┌─────────────────────────────────────────────────────────────────┐
│ Incident frequency │ Incidents per week/month │
│ Severity distribution │ % SEV1 vs SEV2 vs SEV3 │
│ Time in incident │ Engineer hours spent on incidents │
│ Repeat incidents │ Same root cause recurring │
│ Action item completion │ % of postmortem items completed │
│ Customer impact │ Users affected, revenue lost │
└─────────────────────────────────────────────────────────────────┘
PROACTIVE INCIDENT PREVENTION:
PRE-PRODUCTION:
┌─────────────────────────────────────────────────────────────────┐
│ - Code review with reliability focus │
│ - Automated testing (unit, integration, E2E) │
│ - Chaos engineering in staging │
│ - Load testing before major releases │
│ - Feature flags for gradual rollout │
│ - Deployment checklists │
└─────────────────────────────────────────────────────────────────┘
PRODUCTION SAFEGUARDS:
┌─────────────────────────────────────────────────────────────────┐
│ - Canary deployments │
│ - Progressive rollouts │
│ - Automated rollback on error spike │
│ - Circuit breakers │
│ - Rate limiting │
│ - Redundancy and failover │
└─────────────────────────────────────────────────────────────────┘
DETECTION:
┌─────────────────────────────────────────────────────────────────┐
│ - SLO-based alerting │
│ - Anomaly detection │
│ - Synthetic monitoring │
│ - Real user monitoring (RUM) │
│ - Error tracking │
└─────────────────────────────────────────────────────────────────┘
LEARNING:
┌─────────────────────────────────────────────────────────────────┐
│ - Blameless postmortems │
│ - Action item follow-through │
│ - Incident pattern analysis │
│ - GameDays and chaos experiments │
│ - Cross-team incident reviews │
└─────────────────────────────────────────────────────────────────┘
When planning incident response:
For detailed guidance:
Last Updated: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.