Skill

write-post-mortem

Documents incidents, outages, or production failures with blameless post-mortems. Includes timeline, root cause analysis, and action items.

documentation

devops

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/grimoire:write-post-mortem

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Write a blameless post-mortem that captures what happened, why it happened, and what will prevent recurrence — without blaming individuals.

SKILL.md

74 lines · ~1.5k tokens

Stats

LanguageShell

Stars12

Forks1

MaintenanceExcellent

Last CommitJun 15, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Write Post-Mortem

Write a blameless post-mortem that captures what happened, why it happened, and what will prevent recurrence — without blaming individuals.

Why This Is Best Practice

Adopted by: Amazon Web Services (internal COE process), Google SRE teams, Etsy (pioneered blameless post-mortems in 2012), PagerDuty, Atlassian, Netflix.

Impact: Google's SRE book reports that teams with structured post-mortems reduce mean time between incidents (MTBI) by 20–40%. Etsy's blameless culture is credited with enabling 50+ deploys per day without increased incident rate. A 2023 Puppet State of DevOps report found that high-performing teams are 2.4× more likely to conduct blameless post-mortems.

Why best: Blameless post-mortems surface systemic failures rather than hiding them behind individual blame. When engineers fear punishment, they under-report near-misses and route around broken systems. The blameless model assumes engineers are competent and acted rationally given their information at the time — the fault lies in the system, not the person.

Sources: Google SRE Book (Chapter 15); John Allspaw, "Blameless Post-Mortems and a Just Culture" (Etsy Engineering Blog, 2012); Puppet State of DevOps Report 2023.

Steps

Open a draft within 24–48 hours of resolution while details are fresh. Assign a single author; others contribute via comments.
Write the incident summary (3–5 sentences): what failed, what was the user-visible impact, when it started and ended, and severity level (P0/P1/P2 or equivalent).
Build the timeline in chronological order with UTC timestamps. Include: first alert fired, who was paged, each diagnostic action taken, each mitigation attempted, resolution time, and all-clear time. Be factual — no editorializing.
State the root cause in one sentence using the "five whys" technique: ask "why did this happen?" iteratively until you reach a systemic cause, not a human action. Example: not "an engineer deleted the table" but "a migration script had no dry-run mode and no confirmation prompt in production."
List contributing factors — conditions that allowed the root cause to manifest. Examples: missing monitoring, inadequate runbooks, insufficient test coverage, unclear ownership, alert fatigue.
Write action items — each must be: specific (not "improve monitoring"), assigned to a named owner, and have a due date. Categorize as: preventive (stops recurrence), detective (catches it sooner), or corrective (reduces blast radius). Aim for 3–7 actionable items, not 20 aspirational ones.
State what went well — tools that worked, responders who acted effectively, communication that helped. This reinforces good practices and is not sycophancy.
Publish to a shared, searchable incident log (Confluence, Notion, internal wiki). Notify stakeholders. Schedule a 30-minute review meeting if the incident was P0/P1.

Rules

Never name individuals as the cause. Write "a configuration change was deployed" not "Alice deployed a bad config."
Do not use "human error" as a root cause — it is always a symptom. Ask why the human was in a position to cause that error.
Action items without an owner and a date are not action items — they are wishes. Strike them or assign them before publishing.
The timeline must be factual. Do not reconstruct it from memory alone — use logs, PagerDuty history, Slack threads, and deployment records.
Severity of writing effort should match severity of incident. A P2 degradation warrants a concise 1-page doc. A P0 global outage warrants a thorough multi-section analysis.
Do not delay publishing to make it look better. A rough doc published in 48 hours is more useful than a polished doc published in 2 weeks.

Examples

Root cause (bad): "Engineer forgot to set the timeout flag."

Root cause (good): "The deployment checklist did not include a timeout configuration step, and no automated validation checked for missing timeout settings before deployment to production."

Action item (bad): "Be more careful with production deployments."

Action item (good): "Add timeout validation to the pre-deploy CI check. Owner: @platform-team. Due: 2024-02-15."

Five-whys example:

Why did the service go down? → Database connections were exhausted.
Why were connections exhausted? → A query ran for 45 minutes without a timeout.
Why was there no timeout? → The ORM default was unlimited and the config template didn't set one.
Why didn't the template set it? → The template was written before the timeout policy existed.
Why wasn't the template updated? → No process exists to audit config templates against policy changes. ← systemic root cause

Common Mistakes

Blame disguised as fact: "The on-call engineer missed the alert" — instead: "The alert threshold was too high to fire during the initial degradation window."
Vague action items: "Improve alerting" — instead: "Add a p99 latency alert at 800ms for the checkout service. Owner: observability team. Due: next sprint."
Timeline gaps: Reconstructed timelines missing 30-minute gaps make it impossible to understand the incident arc.
Skipping contributing factors: Root cause alone rarely tells the full story. Without contributing factors, the same conditions will produce the next incident.
Publishing and forgetting: Post-mortems only prevent recurrence if action items are tracked to completion. Add them to a sprint or project tracker immediately.

write-post-mortem

Popularity

Invocation

Context Preview

SKILL.md

Help us improve

Help us improve

Find plugins for your project

write-post-mortem

Popularity

Invocation

Context Preview

SKILL.md

Write Post-Mortem

Why This Is Best Practice

Steps

Rules

Examples

Common Mistakes

Similar Skills

Help us improve

Write Post-Mortem

Why This Is Best Practice

Steps

Rules

Examples

Common Mistakes

Similar Skills