From devops
Guide incident response — detect, assess, mitigate, root cause, prevent recurrence.
npx claudepluginhub hpsgd/turtlestack --plugin devopsThis skill is limited to using the following tools:
Respond to $ARGUMENTS.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides implementation of event-driven hooks in Claude Code plugins using prompt-based validation and bash commands for PreToolUse, Stop, and session events.
Respond to $ARGUMENTS.
The cardinal rule: mitigate first, root-cause second. The goal is to stop the bleeding before you diagnose the disease. Never spend 30 minutes investigating while users are down.
Gather the symptoms (2 minutes max):
Classify severity:
| Severity | Criteria | Response time | Communication cadence |
|---|---|---|---|
| SEV-1 (Critical) | Service down, data loss, security breach, revenue impact | Immediate | Every 15 minutes |
| SEV-2 (High) | Major feature degraded, affecting many users, no workaround | < 30 min | Every 30 minutes |
| SEV-3 (Medium) | Feature degraded, workaround exists, limited user impact | < 2 hours | Every 2 hours |
| SEV-4 (Low) | Minor issue, cosmetic, single user affected | Next business day | Resolution only |
Rules:
Answer these questions before taking action:
Build a timeline (MANDATORY):
HH:MM UTC — [Event] — [Source of information]
14:23 UTC — Error rate spike to 15% (normal: <1%) — Datadog alert
14:25 UTC — Deployment abc123 completed — GitHub Actions
14:27 UTC — First customer report via support — Zendesk ticket #4521
The timeline is the single most important artifact. Update it continuously.
Choose the fastest path to reduce user impact. Speed over elegance.
Mitigation options (in order of preference):
| Option | Speed | Risk | When to use |
|---|---|---|---|
| Feature flag off | Seconds | Low | Feature is behind a flag |
| Rollback deployment | 1-5 min | Low | Recent deployment is the likely cause |
| Scale up/out | 1-5 min | Low | Load-related, capacity issue |
| Traffic redirect | 1-5 min | Medium | Regional issue, failover available |
| Configuration change | 1-10 min | Medium | Bad config deployed |
| Hotfix deploy | 10-30 min | Higher | Root cause identified and fix is small |
| Service isolation | 1-5 min | Medium | Cascade prevention, circuit breaker |
Rules:
Only after mitigation is confirmed effective:
Check the change log — recent deployments, config changes, dependency updates, infrastructure changes
git log --oneline --since="2 hours ago"
Correlate with the timeline — what changed within 30 minutes before the incident started?
Trace the failure path:
Form a hypothesis — specific and falsifiable:
Test the hypothesis with one change — confirm or refute before moving to the next
Identify contributing factors beyond the root cause:
For every root cause, define concrete prevention measures:
| Prevention type | Example | Timeline |
|---|---|---|
| Immediate | Add missing validation, fix the bug | This sprint |
| Short-term | Add test, add monitoring alert, add circuit breaker | Next sprint |
| Long-term | Architecture change, process improvement, training | Next quarter |
Rules:
Who to notify (by severity):
| Severity | Notify | Channel |
|---|---|---|
| SEV-1 | Engineering lead, product lead, support lead, affected customers | Incident Slack channel + status page |
| SEV-2 | Engineering lead, product lead | Incident Slack channel |
| SEV-3 | Team lead | Team Slack channel |
| SEV-4 | Log for next standup | None |
Status update template:
**Incident: [title]**
**Severity:** SEV-[1/2/3/4]
**Status:** [Investigating / Mitigating / Monitoring / Resolved]
**Impact:** [who is affected and how]
**Current action:** [what is being done right now]
**Next update:** [time]
Rules:
# Post-Mortem: [Incident Title]
**Date:** [date]
**Duration:** [start time] — [end time] ([total duration])
**Severity:** SEV-[level]
**Author:** [name]
**Reviewers:** [names]
## Summary
[2-3 sentences: what happened, who was affected, what was the impact]
## Timeline
| Time (UTC) | Event | Source |
|---|---|---|
| [HH:MM] | [what happened] | [how we know] |
## Impact
- **Users affected:** [number or percentage]
- **Duration of impact:** [time]
- **Data impact:** [none / lost / corrupted / exposed]
- **Financial impact:** [if any]
- **SLA impact:** [if any]
## Root Cause
[Detailed technical explanation. Not "human error" — what system allowed this to happen?]
## Contributing Factors
- [Factor 1 — why it made the incident worse or harder to detect]
- [Factor 2]
## Resolution
[What was done to resolve the incident — both mitigation and permanent fix]
## Detection
- **How was it detected?** [alert / customer report / internal discovery]
- **Time to detect:** [minutes from start to detection]
- **Could we have detected it sooner?** [yes/no — how?]
## Action Items
| # | Action | Type | Owner | Deadline | Status |
|---|---|---|---|---|---|
| 1 | [action] | Prevent / Detect / Mitigate | [name] | [date] | TODO |
| 2 | [action] | Prevent / Detect / Mitigate | [name] | [date] | TODO |
## Lessons Learned
- **What went well:** [things that helped during the response]
- **What went poorly:** [things that hindered the response]
- **Where we got lucky:** [things that could have made it worse]
Deliver:
Use the runbook template (templates/runbook.md) for creating runbooks referenced during incidents.