Help us improve
Share bugs, ideas, or general feedback.
From grimoire
Documents incidents, outages, or production failures with blameless post-mortems. Includes timeline, root cause analysis, and action items.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:write-post-mortemThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Write a blameless post-mortem that captures what happened, why it happened, and what will prevent recurrence — without blaming individuals.
Share bugs, ideas, or general feedback.
Write a blameless post-mortem that captures what happened, why it happened, and what will prevent recurrence — without blaming individuals.
Adopted by: Amazon Web Services (internal COE process), Google SRE teams, Etsy (pioneered blameless post-mortems in 2012), PagerDuty, Atlassian, Netflix.
Impact: Google's SRE book reports that teams with structured post-mortems reduce mean time between incidents (MTBI) by 20–40%. Etsy's blameless culture is credited with enabling 50+ deploys per day without increased incident rate. A 2023 Puppet State of DevOps report found that high-performing teams are 2.4× more likely to conduct blameless post-mortems.
Why best: Blameless post-mortems surface systemic failures rather than hiding them behind individual blame. When engineers fear punishment, they under-report near-misses and route around broken systems. The blameless model assumes engineers are competent and acted rationally given their information at the time — the fault lies in the system, not the person.
Sources: Google SRE Book (Chapter 15); John Allspaw, "Blameless Post-Mortems and a Just Culture" (Etsy Engineering Blog, 2012); Puppet State of DevOps Report 2023.
Open a draft within 24–48 hours of resolution while details are fresh. Assign a single author; others contribute via comments.
Write the incident summary (3–5 sentences): what failed, what was the user-visible impact, when it started and ended, and severity level (P0/P1/P2 or equivalent).
Build the timeline in chronological order with UTC timestamps. Include: first alert fired, who was paged, each diagnostic action taken, each mitigation attempted, resolution time, and all-clear time. Be factual — no editorializing.
State the root cause in one sentence using the "five whys" technique: ask "why did this happen?" iteratively until you reach a systemic cause, not a human action. Example: not "an engineer deleted the table" but "a migration script had no dry-run mode and no confirmation prompt in production."
List contributing factors — conditions that allowed the root cause to manifest. Examples: missing monitoring, inadequate runbooks, insufficient test coverage, unclear ownership, alert fatigue.
Write action items — each must be: specific (not "improve monitoring"), assigned to a named owner, and have a due date. Categorize as: preventive (stops recurrence), detective (catches it sooner), or corrective (reduces blast radius). Aim for 3–7 actionable items, not 20 aspirational ones.
State what went well — tools that worked, responders who acted effectively, communication that helped. This reinforces good practices and is not sycophancy.
Publish to a shared, searchable incident log (Confluence, Notion, internal wiki). Notify stakeholders. Schedule a 30-minute review meeting if the incident was P0/P1.
Root cause (bad): "Engineer forgot to set the timeout flag."
Root cause (good): "The deployment checklist did not include a timeout configuration step, and no automated validation checked for missing timeout settings before deployment to production."
Action item (bad): "Be more careful with production deployments."
Action item (good): "Add timeout validation to the pre-deploy CI check. Owner: @platform-team. Due: 2024-02-15."
Five-whys example:
npx claudepluginhub jeffreytse/grimoire --plugin grimoireGuides writing blameless postmortems for SEV1/SEV2 incidents using templates, timelines, root cause analysis, and action items to foster learning.
Guides writing blameless postmortems for incident reviews, root cause analysis, and organizational learning.
Generates a blameless incident postmortem with timeline, root cause analysis, impact summary, and closeable action items from rough notes.