Help us improve
Share bugs, ideas, or general feedback.
From rampstack-skills
Manages active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and severity levels. Triggers on outage, P0/P1, downtime, on-call, service down.
npx claudepluginhub rampstackco/claude-skills --plugin rampstack-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/rampstack-skills:incident-responseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Manage active production incidents from detection to resolution. Stack-agnostic. Tool-agnostic.
Runs incident response workflow: triage severity and roles, draft communications, track mitigation, generate blameless postmortem from alerts or status updates.
Guides incident management with lifecycle stages, severity levels, roles, metrics like MTTR. Use for runbooks, on-call rotations, postmortems.
Creates incident response runbooks with severity levels, detection triggers, communication steps, and mitigations like Kubernetes rollbacks and scaling via kubectl/bash.
Share bugs, ideas, or general feedback.
Manage active production incidents from detection to resolution. Stack-agnostic. Tool-agnostic.
This skill is for active incidents and incident process. For after-the-fact analysis, use after-action-report. For planned launches, use launch-runbook.
after-action-report)launch-runbook)qa-testing)How the incident becomes known.
Detection sources:
On detection:
Establish severity and impact.
Severity rubric:
| Severity | Definition | Response |
|---|---|---|
| SEV-1 (Critical) | Major customer-facing functionality broken. Data integrity at risk. Security breach. | All-hands. Incident commander. Active war room. Public communication required. |
| SEV-2 (Major) | Significant degradation. Some customers affected. Revenue impact. | Incident commander assigned. Active response. Internal communication. May or may not need public communication. |
| SEV-3 (Minor) | Limited impact. Workaround available. Affecting a small group of users. | Standard on-call response. Single owner. |
| SEV-4 (Low) | Cosmetic, edge-case, or low-frequency. No urgent action needed. | Tracked as bug. Addressed in normal queue. |
Severity can change. Re-evaluate as more info emerges.
Stop the bleeding before fixing the cause.
Mitigation patterns (faster than full fix):
Mitigation principle: Stop user impact first. Cause analysis second.
Three audiences during an incident:
Internal team:
Internal stakeholders:
External / customers:
Communication principles:
Verified fix, customers restored, incident closed.
Resolution criteria:
After closure:
| Role | Responsibility |
|---|---|
| Incident commander (IC) | Owns the response. Calls decisions. Assigns work. Not necessarily the most technical person; needs to coordinate. |
| Communications lead | Owns internal and external messaging. Reduces IC's communication burden. |
| Operations lead | Drives the technical investigation and mitigation. Often the most senior on-call engineer. |
| Scribe | Captures the timeline as the incident unfolds. Critical for AAR. |
| Subject matter experts | Pulled in as needed. Service owners, database experts, security experts. |
For small teams or low-severity incidents, one person can hold multiple roles. Each role's responsibilities should still be explicit.
The IC's authority:
Non-decisions to avoid:
When in doubt: act. A wrong action that can be rolled back beats inaction while users suffer.
Initial:
"We are investigating reports of [issue]. Updates to follow."
Identified:
"We have identified the issue affecting [scope]. Engineers are working on a fix. Next update by [time]."
Monitoring:
"A fix has been applied. We are monitoring to confirm resolution. Next update by [time]."
Resolved:
"This incident has been resolved. Service has been restored. A full incident report will be posted within [timeframe]."
Patterns to avoid:
During an active incident: incident channel updates and status page updates as per the framework above.
After incident close: a brief incident summary feeding into the AAR.
# Incident: [Brief title]
**Date:** [YYYY-MM-DD]
**Severity:** [SEV-1 / 2 / 3 / 4]
**Duration:** [Detection to resolution]
**Customer impact:** [Who, how many, how]
## Summary
[1 to 2 paragraphs]
## Timeline
[Timestamped events]
## Mitigation
[What was done]
## Action items
[Follow-ups, with owners]
## AAR scheduled for
[Date]
references/incident-playbook.md - Severity definitions, roles, status page templates, decision rubrics.