Skill

incident

Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.

Git

npx claudepluginhub arbazkhan971/godmode

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/godmode:incident

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- User invokes `/godmode:incident`

SKILL.md

258 lines · ~1.7k tokens

Similar Skills

incident

Executes structured production incident response: triages P1-P3 severity, contains blast radius (rollback, mitigation), root-causes after stabilization, logs timeline, generates postmortem. Triggers on outages or 'incident'.

rune

Incident Response Workflow

Handles production incidents with structured triage, severity classification, mitigation, root cause analysis, and post-mortem writing.

hatch3r

nw-post-mortem-framework

500

Structures blameless post-mortems with incident timelines, impact assessment, root cause analysis, response evaluation, action items, and lessons learned. Useful after production incidents or outages.

Stats

LanguageShell

Stars18

Forks8

MaintenanceExcellent

Last CommitApr 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Incident — Incident Response & Post-Mortem

Activate When

User invokes /godmode:incident
User reports production outage or degradation
User says "production is down", "write a post-mortem"
Monitoring alerts or PagerDuty notifications shared

Workflow

Step 1: Incident Classification

# Check recent deployments (common root cause)
git log --oneline --since="2 hours ago" | head -10

# Check error rates if monitoring accessible
curl -s "http://localhost:9090/api/v1/query?\
query=rate(http_requests_total{code=~'5..'}[5m])" \
  2>/dev/null | head -5

INCIDENT CLASSIFICATION:
ID: INC-<YYYY-MM-DD>-<NNN>
Title: <concise impact description>
Severity: <SEV1 | SEV2 | SEV3 | SEV4>
Status: INVESTIGATING | IDENTIFIED | MONITORING | RESOLVED

SEVERITY MATRIX:
| Level | Impact           | Response Time |
|-------|-----------------|---------------|
| SEV1  | Complete outage | < 15 min      |
| SEV2  | Major degradation| < 30 min     |
| SEV3  | Partial degradation| < 2 hours  |
| SEV4  | Minimal impact  | Next business day|

IF error rate > 50%: SEV1
IF error rate 10-50% or major feature broken: SEV2
IF error rate 1-10% or workaround exists: SEV3
IF cosmetic or < 1% impact: SEV4

Step 2: Timeline Construction

INCIDENT TIMELINE — INC-<ID>:
| Timestamp (UTC) | Event                    |
|-----------------|--------------------------|
| HH:MM:SS        | First alert triggered    |
| HH:MM:SS        | On-call acknowledged     |
| HH:MM:SS        | Root cause identified    |
| HH:MM:SS        | Mitigation applied       |
| HH:MM:SS        | Service restored         |
| HH:MM:SS        | Incident resolved        |

EVIDENCE per entry:
  - Monitoring dashboards (screenshots/links)
  - Log snippets with timestamps
  - Deploy records (commit SHA, timestamp)
  - Customer reports / support tickets

Step 3: Impact Assessment

IMPACT:
Duration: <start> to <end> (<total minutes>)
Users affected: <number or percentage>
Requests failed: <number or error rate %>
Revenue impact: <estimated $ or unknown>
SLA consumed: <budget used, remaining>
Data impact: <lost, corrupted, exposed, or NONE>

THRESHOLDS:
  MTTD target: < 5 minutes (symptom to alert)
  MTTA target: < 15 minutes (alert to response)
  MTTR target: < 60 minutes (detection to resolution)
  IF MTTR > 120 min for SEV1: escalate process review

Step 4: Root Cause Analysis (5 Whys)

1. Why did <symptom> happen?
   → Because <immediate cause>
2. Why did <immediate cause> happen?
   → Because <deeper cause>
3. Why did <deeper cause> happen?
   → Because <process gap>
4. Why did <process gap> exist?
   → Because <organizational factor>
5. Why did <organizational factor> persist?
   → Because <root cause>

ROOT CAUSE: <single sentence>
Contributing factors:
  - <missing monitoring>
  - <insufficient testing>
  - <unclear runbook>

Step 5: Blameless Post-Mortem

# Post-Mortem: INC-<ID> — <Title>
Severity: <SEV>, Duration: <total>

## Summary
<2-3 sentences: what, impact, resolution>

## Timeline
<from Step 2>

## Impact
<from Step 3>

## Root Cause
<from Step 4>

## What Went Well
- <detection speed, team response, tooling>

## What Went Wrong
- <slow detection, missing alerts, unclear ownership>

## Where We Got Lucky
- <things that could have been worse>

## Action Items
<from Step 6>

Blameless principles:

Name systems, not people.
Focus on process gaps.
Assume good intent.

Step 6: Action Item Tracking

| # | Action      | Type    | Priority | Owner | Due  |
|---|-------------|---------|----------|-------|------|
| 1 | <action>    | PREVENT | P0       | <team>| <date>|
| 2 | <action>    | DETECT  | P0       | <team>| <date>|
| 3 | <action>    | MITIGATE| P1       | <team>| <date>|

Types: PREVENT (stop recurrence), DETECT (faster),
  MITIGATE (reduce impact), PROCESS (improve response)
Priority: P0 = 1 week, P1 = 2 weeks, P2 = 1 month

IF action item > 30 days old: escalate weekly
IF no owner: action item is invalid — assign now
IF vague ("be more careful"): rewrite as specific action

Step 7: Metrics

MTTD: <min>, MTTA: <min>, MTTR: <min>
Frequency (30d): SEV1=<N> SEV2=<N> SEV3=<N>
Action items: <completed>/<total>
Repeat incidents: <count>

Commit: "incident: INC-<ID> — <severity> — <title>"

Key Behaviors

Never ask to continue. Loop autonomously until done.

Severity drives response. SEV1 = drop everything.
Timeline is sacred. Timestamp every action.
Blameless or useless. Blame systems, not humans.
Actionable items only. Not "be more careful."
Follow up. Track completion weekly.

HARD RULES

Never assign blame to individuals.
Never skip post-mortem for SEV1 or SEV2.
Never guess at timeline — use logs and git history.
Never inflate or deflate severity.
Never create action items without owner and due date.
Never let action items exceed 30 days without follow-up.
Always timestamp in UTC.
Always attach evidence to timeline entries.
Always include "Where We Got Lucky" section.
Always update status page within severity timeframe.

Auto-Detection

1. Active alerts: error messages, 5xx codes
2. Existing docs: docs/incidents/, postmortems/
3. Monitoring: terraform, docker-compose, k8s
4. Environment: deployment configs, on-call tools

Investigation Loop

WHILE status != "RESOLVED":
  1. GATHER: logs (15min window), error rates, deploys
  2. UPDATE timeline
  3. FORM hypothesis: "caused by X because Y"
  4. TEST: check logs/metrics that confirm or deny
  5. IF confirmed: apply mitigation, monitor 5-10 min
  6. IF denied: record, form new hypothesis
  7. IF monitoring stable 15+ min: status = RESOLVED
  8. IF > 10 iterations: escalate severity

Output Format

Print: Incident: SEV{N} — {title}. MTTR: {min}m. Action items: {count}. Status: {status}.

TSV Logging

timestamp	severity	title	mttr_min	action_items	status

Keep/Discard Discipline

KEEP if: evidence confirms hypothesis AND mitigation
  reduces error rate
DISCARD if: evidence contradicts OR no effect

Stop Conditions

STOP investigation when ANY of:
  - Root cause identified with evidence
  - Service stable 15+ min (shift to post-mortem)
  - User requests stop

STOP post-mortem when:
  - All sections complete
  - All action items have owners and deadlines

Error Recovery

Cannot find root cause: widen to 24h before incident.
Timeline gaps: cross-reference logs, metrics, Slack.
Post-mortem blame-oriented: redirect to systems.
Recurring root cause: escalate to leadership.

incident

Popularity

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

incident

Popularity

Invocation

Context Preview

SKILL.md

Incident — Incident Response & Post-Mortem

Activate When

Workflow

Step 1: Incident Classification

Step 2: Timeline Construction

Step 3: Impact Assessment

Step 4: Root Cause Analysis (5 Whys)

Step 5: Blameless Post-Mortem

Step 6: Action Item Tracking

Step 7: Metrics

Key Behaviors

HARD RULES

Auto-Detection

Investigation Loop

Output Format

TSV Logging

Keep/Discard Discipline

Stop Conditions

Error Recovery

Similar Skills

Help us improve

Incident — Incident Response & Post-Mortem

Activate When

Workflow

Step 1: Incident Classification

Step 2: Timeline Construction

Step 3: Impact Assessment

Step 4: Root Cause Analysis (5 Whys)

Step 5: Blameless Post-Mortem

Step 6: Action Item Tracking

Step 7: Metrics

Key Behaviors

HARD RULES

Auto-Detection

Investigation Loop

Output Format

TSV Logging

Keep/Discard Discipline

Stop Conditions

Error Recovery