Help us improve
Share bugs, ideas, or general feedback.
From godmode
Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.
npx claudepluginhub arbazkhan971/godmodeHow this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:incidentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User invokes `/godmode:incident`
Executes structured production incident response: triages P1-P3 severity, contains blast radius (rollback, mitigation), root-causes after stabilization, logs timeline, generates postmortem. Triggers on outages or 'incident'.
Handles production incidents with structured triage, severity classification, mitigation, root cause analysis, and post-mortem writing.
Structures blameless post-mortems with incident timelines, impact assessment, root cause analysis, response evaluation, action items, and lessons learned. Useful after production incidents or outages.
Share bugs, ideas, or general feedback.
/godmode:incident# Check recent deployments (common root cause)
git log --oneline --since="2 hours ago" | head -10
# Check error rates if monitoring accessible
curl -s "http://localhost:9090/api/v1/query?\
query=rate(http_requests_total{code=~'5..'}[5m])" \
2>/dev/null | head -5
INCIDENT CLASSIFICATION:
ID: INC-<YYYY-MM-DD>-<NNN>
Title: <concise impact description>
Severity: <SEV1 | SEV2 | SEV3 | SEV4>
Status: INVESTIGATING | IDENTIFIED | MONITORING | RESOLVED
SEVERITY MATRIX:
| Level | Impact | Response Time |
|-------|-----------------|---------------|
| SEV1 | Complete outage | < 15 min |
| SEV2 | Major degradation| < 30 min |
| SEV3 | Partial degradation| < 2 hours |
| SEV4 | Minimal impact | Next business day|
IF error rate > 50%: SEV1
IF error rate 10-50% or major feature broken: SEV2
IF error rate 1-10% or workaround exists: SEV3
IF cosmetic or < 1% impact: SEV4
INCIDENT TIMELINE — INC-<ID>:
| Timestamp (UTC) | Event |
|-----------------|--------------------------|
| HH:MM:SS | First alert triggered |
| HH:MM:SS | On-call acknowledged |
| HH:MM:SS | Root cause identified |
| HH:MM:SS | Mitigation applied |
| HH:MM:SS | Service restored |
| HH:MM:SS | Incident resolved |
EVIDENCE per entry:
- Monitoring dashboards (screenshots/links)
- Log snippets with timestamps
- Deploy records (commit SHA, timestamp)
- Customer reports / support tickets
IMPACT:
Duration: <start> to <end> (<total minutes>)
Users affected: <number or percentage>
Requests failed: <number or error rate %>
Revenue impact: <estimated $ or unknown>
SLA consumed: <budget used, remaining>
Data impact: <lost, corrupted, exposed, or NONE>
THRESHOLDS:
MTTD target: < 5 minutes (symptom to alert)
MTTA target: < 15 minutes (alert to response)
MTTR target: < 60 minutes (detection to resolution)
IF MTTR > 120 min for SEV1: escalate process review
1. Why did <symptom> happen?
→ Because <immediate cause>
2. Why did <immediate cause> happen?
→ Because <deeper cause>
3. Why did <deeper cause> happen?
→ Because <process gap>
4. Why did <process gap> exist?
→ Because <organizational factor>
5. Why did <organizational factor> persist?
→ Because <root cause>
ROOT CAUSE: <single sentence>
Contributing factors:
- <missing monitoring>
- <insufficient testing>
- <unclear runbook>
# Post-Mortem: INC-<ID> — <Title>
Severity: <SEV>, Duration: <total>
## Summary
<2-3 sentences: what, impact, resolution>
## Timeline
<from Step 2>
## Impact
<from Step 3>
## Root Cause
<from Step 4>
## What Went Well
- <detection speed, team response, tooling>
## What Went Wrong
- <slow detection, missing alerts, unclear ownership>
## Where We Got Lucky
- <things that could have been worse>
## Action Items
<from Step 6>
Blameless principles:
| # | Action | Type | Priority | Owner | Due |
|---|-------------|---------|----------|-------|------|
| 1 | <action> | PREVENT | P0 | <team>| <date>|
| 2 | <action> | DETECT | P0 | <team>| <date>|
| 3 | <action> | MITIGATE| P1 | <team>| <date>|
Types: PREVENT (stop recurrence), DETECT (faster),
MITIGATE (reduce impact), PROCESS (improve response)
Priority: P0 = 1 week, P1 = 2 weeks, P2 = 1 month
IF action item > 30 days old: escalate weekly
IF no owner: action item is invalid — assign now
IF vague ("be more careful"): rewrite as specific action
MTTD: <min>, MTTA: <min>, MTTR: <min>
Frequency (30d): SEV1=<N> SEV2=<N> SEV3=<N>
Action items: <completed>/<total>
Repeat incidents: <count>
Commit: "incident: INC-<ID> — <severity> — <title>"
Never ask to continue. Loop autonomously until done.
1. Active alerts: error messages, 5xx codes
2. Existing docs: docs/incidents/, postmortems/
3. Monitoring: terraform, docker-compose, k8s
4. Environment: deployment configs, on-call tools
WHILE status != "RESOLVED":
1. GATHER: logs (15min window), error rates, deploys
2. UPDATE timeline
3. FORM hypothesis: "caused by X because Y"
4. TEST: check logs/metrics that confirm or deny
5. IF confirmed: apply mitigation, monitor 5-10 min
6. IF denied: record, form new hypothesis
7. IF monitoring stable 15+ min: status = RESOLVED
8. IF > 10 iterations: escalate severity
Print: Incident: SEV{N} — {title}. MTTR: {min}m. Action items: {count}. Status: {status}.
timestamp severity title mttr_min action_items status
KEEP if: evidence confirms hypothesis AND mitigation
reduces error rate
DISCARD if: evidence contradicts OR no effect
STOP investigation when ANY of:
- Root cause identified with evidence
- Service stable 15+ min (shift to post-mortem)
- User requests stop
STOP post-mortem when:
- All sections complete
- All action items have owners and deadlines