From grimoire
Execute structured live incident response: declare severity, assign roles, mitigate, communicate, resolve, and run blameless postmortems for production incidents.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:plan-incident-responseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Execute a structured live incident: declare severity, assign roles, mitigate, communicate, resolve, and run a blameless postmortem.
Execute a structured live incident: declare severity, assign roles, mitigate, communicate, resolve, and run a blameless postmortem.
Adopted by: Google's SRE practice (documented in the SRE book, O'Reilly 2016) is the canonical reference, followed by Atlassian, PagerDuty, Slack, and Stripe who have all published their incident response processes derived from the same model. The PagerDuty Incident Response Guide (response.pagerduty.com) is used by thousands of engineering organizations and is openly licensed. Impact: Google SRE data (SRE book, Ch. 14) shows that unstructured incident response — where responders simultaneously debug, communicate, and make decisions without role separation — extends mean time to resolution (MTTR) by 2–4× compared to role-separated response. PagerDuty's State of Digital Operations report (2022) found that organizations with a documented incident response process resolve incidents 23% faster and have 30% lower escalation rates. Why best: Under stress, cognitive load spills from the Incident Commander to debuggers, degrading both. Role separation (IC, responder, communicator) is borrowed from aviation crew resource management (CRM) — the same principle that reduced aviation fatal accidents by 65% between 1978 and 2000 after CRM was mandated (FAA, 1995). Blameless postmortems (from the SRE book) produce systemic fixes rather than scapegoating, which Google attributes to their ability to publish post-mortems publicly and drive industry-wide learning.
Sources: Beyer, Jones, Petoff, Murphy, "Site Reliability Engineering" (O'Reilly 2016, Ch. 14–15); PagerDuty Incident Response Guide (response.pagerduty.com); FAA Advisory Circular 120-51E (Crew Resource Management Training, 2004)
Anyone can declare an incident. Do not wait for certainty — declare early and downgrade later if wrong. The cost of a false positive is minutes; the cost of a delayed declaration is compounded user impact.
| Severity | Definition | Response time | War room |
|---|---|---|---|
| SEV-1 | Complete outage or data loss affecting all users or production data | Immediate (< 5 min) | Required |
| SEV-2 | Significant degradation: major feature down, >20% of users affected | < 15 min | Required |
| SEV-3 | Partial degradation: minor feature, <20% of users, workaround exists | < 1 hour | Optional |
| SEV-4 | Non-urgent: cosmetic, low-traffic path, no user impact reported | Next business day | No |
Post severity in the incident channel immediately:
🚨 SEV-2 DECLARED — [timestamp UTC]
Summary: Payment checkout returning 500 errors for ~30% of users
IC: @alice | Responder: @bob | Comms: @carol
War room: #incident-2024-11-14 | Status page: updating
Incident Commander (IC):
Responder(s):
Communications Lead (Comms):
Create a dedicated Slack/Teams channel: #inc-YYYY-MM-DD-short-description.
Pin to the channel:
Start a video call for SEV-1/2. Mute everyone except active speakers. Use channel for async updates so the video call stays uncluttered.
Before proposing a fix, the IC demands a timeline of changes:
# What changed in the last 2 hours?
git log --oneline --since="2 hours ago"
kubectl rollout history deployment/payment-api
# Check infrastructure changes
terraform plan -out=tfplan && terraform show tfplan # or check change log
Post findings to the incident channel with timestamps:
13:42 UTC — payment-api v2.14.1 deployed (contains DB schema migration)
13:47 UTC — first 500 errors appear in Datadog
13:52 UTC — error rate climbs to 32%
The timeline almost always points to root cause faster than log spelunking.
Mitigation = stopping the bleeding. Fix = preventing recurrence.
Common mitigations (fastest options first):
IC approves mitigation before execution. Responder announces: "About to rollback payment-api to v2.13.9 — standing by for IC go."
Update the status page within 10 minutes of declaration. Use this language:
Never write: "Due to [technical cause], users experienced..." — this is speculation before root cause is confirmed. Never promise an ETA unless you have high confidence.
Resolution requires:
IC announces in channel:
✅ RESOLVED — [timestamp UTC]
Duration: 47 min
Impact: ~30% checkout failure rate for 47 min
Mitigation: Rolled back payment-api to v2.13.9
Next: Postmortem scheduled for [date], owner: @alice
Schedule within 5 business days for SEV-1/2. Within 2 weeks for SEV-3.
Postmortem document structure:
Blameless means: the goal is systemic improvement, not assigning fault. People follow the processes and tools available to them. If a person made a mistake, ask why the system made the mistake easy to make.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireRuns incident response workflow: triage severity and roles, draft communications, track mitigation, generate blameless postmortem from alerts or status updates.
Manages active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and severity levels. Triggers on outage, P0/P1, downtime, on-call, service down.
Guides incident management with lifecycle stages, severity levels, roles, metrics like MTTR. Use for runbooks, on-call rotations, postmortems.