Postmortem
When NOT to use: Incident still ongoing (focus on resolution first), looking to assign blame (antithesis of blameless culture), or issue is trivial with no learning value.
Workflow
Copy this checklist and track your progress:
Postmortem Progress:
- [ ] Step 1: Assemble timeline and quantify impact
- [ ] Step 2: Conduct root cause analysis
- [ ] Step 3: Define corrective and preventive actions
- [ ] Step 4: Document and share postmortem
- [ ] Step 5: Track action items to completion
Step 1: Assemble timeline and quantify impact
Gather facts: when detected, when started, key events, when resolved. Quantify impact: users affected, duration, revenue/SLA impact, customer complaints. For straightforward incidents use resources/template.md. For complex incidents with multiple causes or cascading failures, study resources/methodology.md for advanced timeline reconstruction techniques.
Step 2: Conduct root cause analysis
Ask "Why?" 5 times to get from symptom to root cause, or use fishbone diagram for complex incidents with multiple contributing factors. See Root Cause Analysis Techniques for guidance. Focus on system failures (process gaps, missing safeguards) not human errors.
Step 3: Define corrective and preventive actions
For each root cause, identify actions to prevent recurrence. Must be specific (not "improve testing"), owned (named person), and time-bound (deadline). Categorize as immediate fixes vs. long-term improvements. See Corrective Actions for framework.
Step 4: Document and share postmortem
Create postmortem document using template. Include timeline, impact, root cause, actions, what went well. Share widely (engineering, product, leadership) to enable learning. Present in team meeting for discussion. Archive in knowledge base.
Step 5: Track action items to completion
Assign owners, set deadlines, add to project tracker. Review progress in standups or weekly meetings. Close postmortem only when all actions complete. Self-assess quality using resources/evaluators/rubric_postmortem.json. Minimum standard: ≥3.5 average score.
Common Patterns
By Incident Type
Production Outages (system failures, downtime):
- Timeline: Detection → Investigation → Mitigation → Resolution
- Impact: Users affected, duration, SLA breach, revenue loss
- Root cause: Often config errors, deployment issues, infrastructure limits
- Actions: Improve monitoring, runbooks, rollback procedures, capacity planning
Security Incidents (breaches, vulnerabilities):
- Timeline: Breach occurrence → Detection (often delayed) → Containment → Remediation
- Impact: Data exposed, compliance risk, reputation damage
- Root cause: Missing security controls, access management gaps, unpatched vulnerabilities
- Actions: Security audits, access reviews, patch management, training
Product/Project Failures (launches, deadlines):
- Timeline: Planning → Execution → Launch/Deadline → Outcome vs. Expectations
- Impact: Revenue miss, user churn, wasted effort, opportunity cost
- Root cause: Poor requirements, unrealistic estimates, misalignment, inadequate testing
- Actions: Improve discovery, estimation, stakeholder alignment, validation processes
Process Failures (operational, procedural):
- Timeline: Process initiation → Breakdown point → Impact realization
- Impact: Delays, quality issues, rework, team frustration
- Root cause: Unclear process, missing steps, handoff failures, tooling gaps
- Actions: Document processes, automate workflows, improve communication, training
By Root Cause Category
Human Error (surface cause, dig deeper):
- Don't stop at "person made mistake"
- Ask: Why was mistake possible? Why not caught? Why no safeguard?
- Actions: Reduce error likelihood (checklists, automation), increase error detection (testing, reviews), mitigate error impact (rollback, redundancy)
Process Gap (missing or unclear procedures):
- Symptoms: "Didn't know to do X", "Not in runbook", "First time"
- Actions: Document process, create checklist, formalize approval gates, onboarding
Technical Debt (deferred maintenance):
- Symptoms: "Known issue", "Fragile system", "Workaround failed"
- Actions: Prioritize tech debt, allocate 20% capacity, refactor, replace legacy systems
External Dependencies (third-party failures):
- Symptoms: "Vendor down", "API failed", "Partner issue"
- Actions: Add redundancy, circuit breakers, graceful degradation, SLA monitoring, vendor diversification
Systemic Issues (organizational, cultural):
- Symptoms: "Always rushed", "No time to test", "Pressure to ship"
- Actions: Address root organizational issues (unrealistic deadlines, resource constraints, incentive misalignment)
Root Cause Analysis Techniques
5 Whys:
- Start with problem statement
- Ask "Why did this happen?" → Answer
- Ask "Why did that happen?" → Answer
- Repeat 5 times (or until root cause found)
- Root cause: Fixable at organizational/system level
Example: Database outage → Why? Bad config → Why? Wrong value → Why? Template error → Why? New team member unfamiliar → Why? No config review in onboarding
Fishbone Diagram (Ishikawa):
- Categories: People, Process, Technology, Environment
- Brainstorm causes in each category
- Identify most likely root causes for investigation
- Useful for complex incidents with multiple contributing factors
Fault Tree Analysis:
- Top: Failure event (e.g., "System down")
- Gates: AND (all required) vs OR (any sufficient)
- Leaves: Base causes (e.g., "Config error" OR "Network failure")
- Trace path from failure to root causes
Corrective Actions Framework
Types of Actions:
- Immediate Fixes: Deployed within days (hotfix, manual process, workaround)
- Short-term Improvements: Completed within weeks (better monitoring, updated runbook, process change)
- Long-term Investments: Completed within months (architecture changes, new systems, cultural shifts)
SMART Actions:
- Specific: "Add config validation" not "Improve deploys"
- Measurable: "Reduce MTTR from 2hr to 30min" not "Faster response"
- Assignable: Named owner, not "team"
- Realistic: Given capacity and constraints
- Time-bound: Explicit deadline
Prioritization:
- High impact, low effort: Do immediately
- High impact, high effort: Schedule as strategic project
- Low impact, low effort: Do if spare capacity
- Low impact, high effort: Consider skipping (cost > benefit)
Prevention Hierarchy (from most to least effective):
- Eliminate: Remove hazard entirely (e.g., deprecate risky feature)
- Substitute: Replace with safer alternative (e.g., use managed service vs self-host)
- Engineering controls: Add safeguards (e.g., rate limits, circuit breakers, automated testing)
- Administrative controls: Improve processes (e.g., runbooks, checklists, reviews)
- Training: Educate people (least effective alone, combine with others)
Guardrails
Blameless Culture:
- ❌ "Engineer caused outage by deploying bad config" → ✓ "Deployment pipeline allowed bad config to reach production"
- ❌ "PM didn't validate requirements" → ✓ "Requirements validation process missing"
- ❌ "Designer made mistake" → ✓ "Design review process didn't catch issue"
- Focus: What system/process failed? Not who made error.
Root Cause Depth:
- ❌ Stopping at surface: "Bug caused outage" → ✓ Deep analysis: "Bug deployed because testing gap, no staging env, rushed release pressure"
- ❌ Single cause: "Database failure" → ✓ Multiple causes: "Database + no failover + alerting delay + unclear runbook"
- Rule: Keep asking "Why?" until you reach actionable systemic improvements
Actionability:
- ❌ Vague: "Improve testing", "Better communication", "More careful" → ✓ Specific: "Add E2E test suite covering top 10 user flows by Apr 1 (Owner: Alex)"
- ❌ No owner: "Team should document" → ✓ Owned: "Sam documents incident response runbook by Mar 15"
- ❌ No deadline: "Eventually migrate" → ✓ Time-bound: "Complete migration by Q2 end"
Impact Quantification:
- ❌ Qualitative: "Many users affected", "Significant downtime" → ✓ Quantitative: "50K users (20% of base), 2-hour outage, $20K revenue loss"
- ❌ No metrics: "Bad customer experience" → ✓ Metrics: "NPS dropped from 50 to 30, 100 support tickets, 5 churned customers ($50K ARR)"
Timeliness:
- ❌ Wait 2 weeks → Memory fades, urgency lost → ✓ Conduct within 48 hours while fresh
- ❌ Never follow up → Actions forgotten → ✓ Track actions, review weekly, close when complete
Quick Reference
Resources:
Success Criteria:
- ✓ Timeline clear with timestamps and key events
- ✓ Impact quantified (users, duration, revenue, metrics)
- ✓ Root cause identified (systemic, not individual blame)
- ✓ Corrective actions SMART (specific, measurable, assigned, realistic, time-bound)
- ✓ Blameless tone (focus on systems/processes)
- ✓ Documented and shared within 48 hours
- ✓ Action items tracked to completion
Common Mistakes:
- ❌ Blame individuals → culture of fear, hide future issues
- ❌ Superficial root cause → doesn't prevent recurrence
- ❌ Vague actions → nothing actually improves
- ❌ No follow-through → actions never completed, same incident repeats
- ❌ Delayed postmortem → details forgotten, less useful
- ❌ Not sharing → no organizational learning
- ❌ Defensive tone → misses opportunity to improve