From rune
Triages production incidents by severity (P1-P3), contains blast radius via rollback or strategies, root-causes issues, documents timeline, generates postmortems. Triggers on outages, errors, or 'incident' mentions.
npx claudepluginhub rune-kit/rune --plugin @rune/analyticsThis skill uses the workspace's default tool permissions.
Structured incident response for production issues. Follows a strict order: triage first, contain before investigating, root-cause after stable, postmortem last. Prevents the most common incident anti-pattern — developers debugging while the system is still on fire. Covers P1 outages, P2 degraded service, and P3 minor issues with appropriate urgency at each level.
Classifies incidents by severity (SEV1-4), constructs timelines, assesses impact, performs 5 Whys root cause analysis, and generates blameless post-mortems for production issues.
Guides incident response workflows from detection and triage through containment, resolution, and postmortem, with severity frameworks and resilience patterns like circuit breakers.
Guides structured incident response: classify severity (P0-P3), triage affected systems, mitigate via rollback/hotfix, root cause analysis, post-mortem for outages, issues, security incidents.
Share bugs, ideas, or general feedback.
Structured incident response for production issues. Follows a strict order: triage first, contain before investigating, root-cause after stable, postmortem last. Prevents the most common incident anti-pattern — developers debugging while the system is still on fire. Covers P1 outages, P2 degraded service, and P3 minor issues with appropriate urgency at each level.
/rune incident "description of what's broken" — direct user invocationlaunch (L1): watchdog alerts during Phase 3 VERIFYdeploy (L2): health check fails post-deploywatchdog emits incident.detected — no manual invocation neededwatchdog (L3): current system state — which endpoints are down, response timesautopsy (L2): root cause analysis after containmentjournal (L3): record incident timeline and decisionssentinel (L2): check for security dimension (data exposure, unauthorized access)neural-memory (ext): after resolution — capture incident root cause + fix pattern cross-session so the same failure mode is never diagnosed twicelaunch (L1): monitoring alert during production verificationdeploy (L2): post-deploy health check failure/rune incident direct invocationClassify severity using this matrix:
| Severity | Definition | Contain Within |
|---|---|---|
| P1 | Full outage — core feature unavailable for all users | 15 minutes |
| P2 | Partial degradation — feature broken for subset of users or degraded for all | 1 hour |
| P3 | Minor issue — cosmetic, edge case, or non-blocking degradation | 4 hours |
P1 indicators: 5xx on root /, auth endpoint down, payment flow broken, data loss detected
P2 indicators: elevated error rate (>1%) on key flow, 1+ regions down, performance >5x baseline
P3 indicators: UI glitch, non-critical feature broken, low error rate (<0.1%)
Emit: TRIAGE: [P1|P2|P3] — [one-line impact description]
Choose containment strategy based on what's available and severity:
| Strategy | When to Use |
|---|---|
| Rollback | Last deploy caused regression (check git log vs incident start time) |
| Feature flag off | Feature-gated code — disable without deploy |
| Traffic shift | Multi-region: route away from affected region |
| Scale up | Resource exhaustion (CPU/memory/connection pool) |
| Rate limit | Abuse pattern or traffic spike |
| Manual intervention | DB locked record, stuck job, cache corruption |
Execute containment action. Then invoke watchdog to verify system is stable before proceeding.
Emit: CONTAINED: [strategy used] — [timestamp] or CONTAINMENT_FAILED: [what was tried] — escalate
Invoke watchdog with current base_url and critical endpoints.
Proceed to Step 4 only if watchdog returns ALL_HEALTHY or DEGRADED with upward trend.
If watchdog returns DOWN — return to Step 2 with a different containment strategy.
Invoke sentinel to check if the incident has a security dimension:
If sentinel returns BLOCK: escalate to security incident — different protocol (notify security team, preserve logs, document access chain).
If sentinel returns PASS or WARN: continue to root cause.
Invoke autopsy with context:
autopsy returns: root cause hypothesis with evidence, affected code paths, contributing factors.
Do not attempt fixes — incident only investigates. Any code changes are a separate task.
Construct incident timeline using:
Format:
[HH:MM] Incident detected — [who/what detected it]
[HH:MM] Triage: [P1/P2/P3] — [impact]
[HH:MM] Containment started — [strategy]
[HH:MM] CONTAINED — [watchdog confirms stable]
[HH:MM] RCA: [root cause summary]
[HH:MM] Resolution: [what was done]
Invoke journal to record the timeline and decisions in .rune/adr/ as an incident ADR.
Generate postmortem report and save as .rune/incidents/INCIDENT-[YYYY-MM-DD]-[slug].md:
# Incident Report: [title]
**Severity**: [P1|P2|P3]
**Date**: [YYYY-MM-DD]
**Duration**: [time from detection to resolution]
**Impact**: [users affected, data affected, revenue impact if known]
## Timeline
[from Step 6]
## Root Cause
[from autopsy — specific, not vague]
## Contributing Factors
[from autopsy — what made this worse]
## What Went Well
[containment speed, detection, communication]
## What Went Wrong
[detection lag, failed first containment, etc.]
## Prevention Actions
| Action | Owner | Due | Priority |
|--------|-------|-----|----------|
| [specific action] | [team/person] | [date] | P1/P2/P3 |
## Lessons Learned
[3-5 bullet points]
## Incident Response: [title]
### Triage
P2 — Login service returning 503 for ~30% of users
### Containment
Strategy: Rollback to commit abc123 (pre-deploy from 14:32)
Status: CONTAINED at 15:07 — watchdog confirms ALL_HEALTHY
### Security Check
sentinel: PASS — no data exposure detected
### Root Cause (from autopsy)
Connection pool exhausted — new feature added synchronous DB call in middleware,
reducing available connections from 20 to 3 under load
File: src/middleware/auth.ts:47
### Timeline
14:32 Deploy completed
14:45 Alerts fired — 503 rate >1%
14:47 TRIAGE: P2
14:52 Containment: rollback initiated
15:07 CONTAINED
15:20 RCA complete
15:35 Postmortem drafted
### Postmortem saved
.rune/incidents/INCIDENT-2026-02-24-login-503.md
| Gate | Requires | If Missing |
|---|---|---|
| Triage Gate | Severity classified (P1/P2/P3) before any other step | Classify before proceeding |
| Containment Gate | watchdog confirms HEALTHY/DEGRADED-improving before RCA | Return to containment if still DOWN |
| Security Gate | sentinel ran before closing incident | Run sentinel — do not skip |
| Postmortem Gate | All sections populated (Timeline, RCA, Prevention Actions) before status = Resolved | Complete or note as DRAFT |
| Artifact | Format | Location |
|---|---|---|
| Incident response report | Markdown | inline (chat output) |
| Incident timeline | Text (HH:MM format) | inline + postmortem |
| Postmortem document | Markdown | .rune/incidents/INCIDENT-<date>-<slug>.md |
| Prevention actions table | Markdown table | postmortem |
| Journal entry (incident ADR) | Text | .rune/adr/ (via rune:journal) |
Known failure modes for this skill. Check these before declaring done.
| Failure Mode | Severity | Mitigation |
|---|---|---|
| Starting RCA before containment confirmed | CRITICAL | HARD-GATE: check CONTAINED status before calling autopsy |
| Declaring incident resolved without watchdog verification | HIGH | MUST call watchdog after containment — not just assume |
| Postmortem Prevention Actions without owners or dates | MEDIUM | Every action needs owner + due date — otherwise it never happens |
| Skipping sentinel because "looks like a performance issue" | HIGH | Security dimension is not always obvious — always run sentinel |
| P1 triage without 15-minute containment urgency | HIGH | P1 SLA = 15 min to contain — flag if containment exceeds threshold |
~3000-8000 tokens input, ~1000-2500 tokens output. Sonnet for response coordination.