**Date of Incident**: YYYY-MM-DD
Generates blameless post-mortem templates for incidents. Helps teams conduct structured root cause analysis, document timelines, and create actionable prevention strategies.
/plugin marketplace add anton-abyzov/specweave/plugin install sw-infra@specweaveDate of Incident: YYYY-MM-DD Date of Post-Mortem: YYYY-MM-DD Author: [Name] Reviewers: [Names] Severity: SEV1 / SEV2 / SEV3
What Happened: [One-paragraph summary of incident]
Impact: [Brief impact summary - users, duration, business]
Root Cause: [Root cause in one sentence]
Resolution: [How it was fixed]
Example:
What Happened: On October 26, 2025, the application became unavailable for 30 minutes due to database connection pool exhaustion.
Impact: All users were unable to access the application from 14:00-14:30 UTC. Approximately 10,000 users affected.
Root Cause: Payment service had a connection leak (connections not properly closed in error handling path), which exhausted the database connection pool during high traffic.
Resolution: Application was restarted to release connections (immediate fix), and the connection leak was fixed in code (permanent fix).
| Time (UTC) | Event | Actor |
|---|---|---|
| 14:00 | Alert: "Database Connection Pool Exhausted" | Monitoring |
| 14:02 | On-call engineer paged | PagerDuty |
| 14:02 | Jane acknowledged alert | SRE (Jane) |
| 14:05 | Confirmed database connections at max (100/100) | SRE (Jane) |
| 14:08 | Checked application logs for connection usage | SRE (Jane) |
| 14:10 | Identified connection leak in payment service | SRE (Jane) |
| 14:12 | Decision: Restart payment service to free connections | SRE (Jane) |
| 14:15 | Payment service restarted | SRE (Jane) |
| 14:17 | Database connections dropped to 20/100 | SRE (Jane) |
| 14:20 | Health checks passing, traffic restored | SRE (Jane) |
| 14:25 | Monitoring for stability | SRE (Jane) |
| 14:30 | Incident declared resolved | SRE (Jane) |
| 15:00 | Developer identified code fix | Dev (Mike) |
| 16:00 | Code fix deployed to production | Dev (Mike) |
| 16:30 | Verified no recurrence after 1 hour | SRE (Jane) |
Total Duration: 30 minutes (outage) + 2.5 hours (full resolution)
Users Affected:
Services Affected:
Business Impact:
1. Why did the application become unavailable? → Database connection pool was exhausted (100/100 connections in use)
2. Why was the connection pool exhausted? → Payment service had a connection leak (connections not being released)
3. Why were connections not being released?
→ Error handling path in payment service missing conn.close() in finally block
4. Why was the error path missing conn.close()?
→ Developer oversight during code review
5. Why didn't code review catch this? → No automated test or linter to check connection cleanup
Root Cause: Connection leak in payment service error handling path, compounded by lack of automated testing for connection cleanup.
Technical Factors:
Process Factors:
Human Factors:
How Detected: Automated monitoring alert
Alert: "Database Connection Pool Exhausted"
SELECT count(*) FROM pg_stat_activity >= 100Detection Quality:
Response Timeline:
What Worked Well:
What Could Be Improved:
Immediate (Restore service):
Restarted payment service to release connections
systemctl restart payment-serviceMonitored connection pool for 30 minutes
Short-term (Prevent immediate recurrence):
Fixed connection leak in payment service code
finally block with conn.close()Increased connection pool size
max_connections from 100 to 200Added connection pool monitoring alert
Action Items (with owners and deadlines):
| # | Action | Priority | Owner | Due Date | Status |
|---|---|---|---|---|---|
| 1 | Add automated test for connection cleanup | P1 | Lisa (QA) | 2025-10-27 | ✅ Done |
| 2 | Add linter rule to check connection cleanup | P1 | Mike (Dev) | 2025-10-27 | ✅ Done |
| 3 | Add connection timeout (30s) | P2 | Tom (DBA) | 2025-10-28 | ⏳ In Progress |
| 4 | Review all DB queries for connection leaks | P2 | Mike (Dev) | 2025-11-02 | 📅 Planned |
| 5 | Load test before high-traffic events | P3 | John (DevOps) | 2025-11-10 | 📅 Planned |
| 6 | Create runbook: Connection Pool Issues | P3 | Jane (SRE) | 2025-10-28 | ✅ Done |
| 7 | Add circuit breaker to prevent cascades | P3 | Mike (Dev) | 2025-11-15 | 📅 Planned |
Monitoring was effective
Response was fast
Communication was clear
Team collaboration
Connection leak in production
No early warning
Capacity planning gap
No runbook
No circuit breaker
YES - This incident was preventable.
How it could have been prevented:
finallyAutomated Testing
require-connection-cleanupMonitoring & Alerting
Capacity Planning
Resilience Patterns
Code Review
Runbooks
Training
Capacity Planning
Blameless Culture
Psychological Safety
Continuous Learning
2025-09-15: Database connection pool exhausted (similar issue)
2025-08-10: Payment service OOM crash
Availability:
MTTR (Mean Time To Resolution):
Thanks to:
This post-mortem has been reviewed and approved:
Next Review: [Date] - Check action item progress
Remember: Incidents are learning opportunities. The goal is not to find fault, but to improve our systems and processes.
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences