Incident response and post-mortem skill. Severity classification (SEV1-4), timeline construction, blameless post-mortems, action item tracking. Triggers on: /godmode:incident, "production is down", "post-mortem", "incident report".
From godmodenpx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:incident# Check recent deployments (common root cause)
git log --oneline --since="2 hours ago" | head -10
# Check error rates if monitoring accessible
curl -s "http://localhost:9090/api/v1/query?\
query=rate(http_requests_total{code=~'5..'}[5m])" \
2>/dev/null | head -5
INCIDENT CLASSIFICATION:
ID: INC-<YYYY-MM-DD>-<NNN>
Title: <concise impact description>
Severity: <SEV1 | SEV2 | SEV3 | SEV4>
Status: INVESTIGATING | IDENTIFIED | MONITORING | RESOLVED
SEVERITY MATRIX:
| Level | Impact | Response Time |
|-------|-----------------|---------------|
| SEV1 | Complete outage | < 15 min |
| SEV2 | Major degradation| < 30 min |
| SEV3 | Partial degradation| < 2 hours |
| SEV4 | Minimal impact | Next business day|
IF error rate > 50%: SEV1
IF error rate 10-50% or major feature broken: SEV2
IF error rate 1-10% or workaround exists: SEV3
IF cosmetic or < 1% impact: SEV4
INCIDENT TIMELINE — INC-<ID>:
| Timestamp (UTC) | Event |
|-----------------|--------------------------|
| HH:MM:SS | First alert triggered |
| HH:MM:SS | On-call acknowledged |
| HH:MM:SS | Root cause identified |
| HH:MM:SS | Mitigation applied |
| HH:MM:SS | Service restored |
| HH:MM:SS | Incident resolved |
EVIDENCE per entry:
- Monitoring dashboards (screenshots/links)
- Log snippets with timestamps
- Deploy records (commit SHA, timestamp)
- Customer reports / support tickets
IMPACT:
Duration: <start> to <end> (<total minutes>)
Users affected: <number or percentage>
Requests failed: <number or error rate %>
Revenue impact: <estimated $ or unknown>
SLA consumed: <budget used, remaining>
Data impact: <lost, corrupted, exposed, or NONE>
THRESHOLDS:
MTTD target: < 5 minutes (symptom to alert)
MTTA target: < 15 minutes (alert to response)
MTTR target: < 60 minutes (detection to resolution)
IF MTTR > 120 min for SEV1: escalate process review
1. Why did <symptom> happen?
→ Because <immediate cause>
2. Why did <immediate cause> happen?
→ Because <deeper cause>
3. Why did <deeper cause> happen?
→ Because <process gap>
4. Why did <process gap> exist?
→ Because <organizational factor>
5. Why did <organizational factor> persist?
→ Because <root cause>
ROOT CAUSE: <single sentence>
Contributing factors:
- <missing monitoring>
- <insufficient testing>
- <unclear runbook>
# Post-Mortem: INC-<ID> — <Title>
Severity: <SEV>, Duration: <total>
## Summary
<2-3 sentences: what, impact, resolution>
## Timeline
<from Step 2>
## Impact
<from Step 3>
## Root Cause
<from Step 4>
## What Went Well
- <detection speed, team response, tooling>
## What Went Wrong
- <slow detection, missing alerts, unclear ownership>
## Where We Got Lucky
- <things that could have been worse>
## Action Items
<from Step 6>
Blameless principles:
| # | Action | Type | Priority | Owner | Due |
|---|-------------|---------|----------|-------|------|
| 1 | <action> | PREVENT | P0 | <team>| <date>|
| 2 | <action> | DETECT | P0 | <team>| <date>|
| 3 | <action> | MITIGATE| P1 | <team>| <date>|
Types: PREVENT (stop recurrence), DETECT (faster),
MITIGATE (reduce impact), PROCESS (improve response)
Priority: P0 = 1 week, P1 = 2 weeks, P2 = 1 month
IF action item > 30 days old: escalate weekly
IF no owner: action item is invalid — assign now
IF vague ("be more careful"): rewrite as specific action
MTTD: <min>, MTTA: <min>, MTTR: <min>
Frequency (30d): SEV1=<N> SEV2=<N> SEV3=<N>
Action items: <completed>/<total>
Repeat incidents: <count>
Commit: "incident: INC-<ID> — <severity> — <title>"
Never ask to continue. Loop autonomously until done.
1. Active alerts: error messages, 5xx codes
2. Existing docs: docs/incidents/, postmortems/
3. Monitoring: terraform, docker-compose, k8s
4. Environment: deployment configs, on-call tools
WHILE status != "RESOLVED":
1. GATHER: logs (15min window), error rates, deploys
2. UPDATE timeline
3. FORM hypothesis: "caused by X because Y"
4. TEST: check logs/metrics that confirm or deny
5. IF confirmed: apply mitigation, monitor 5-10 min
6. IF denied: record, form new hypothesis
7. IF monitoring stable 15+ min: status = RESOLVED
8. IF > 10 iterations: escalate severity
Print: Incident: SEV{N} — {title}. MTTR: {min}m. Action items: {count}. Status: {status}.
timestamp severity title mttr_min action_items status
KEEP if: evidence confirms hypothesis AND mitigation
reduces error rate
DISCARD if: evidence contradicts OR no effect
STOP investigation when ANY of:
- Root cause identified with evidence
- Service stable 15+ min (shift to post-mortem)
- User requests stop
STOP post-mortem when:
- All sections complete
- All action items have owners and deadlines