From vibeworks-library
Incident response procedures — triage, communication, investigation, mitigation, and post-incident review. Use when handling production incidents or writing runbooks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vibeworks-library:skills/incident-responseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.
Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Name | Description | Examples | Response Time | Update Cadence | Responders |
|---|---|---|---|---|---|---|
| P0 | Total Outage | Complete service unavailability. All users affected. Revenue-impacting. | Site completely down, data corruption, security breach with active exploitation | < 15 minutes | Every 15 minutes | All hands on deck, executive notification |
| P1 | Major Degradation | Core functionality severely impaired. Large portion of users affected. | Payment processing broken, authentication failing for most users, major data pipeline stalled | < 30 minutes | Every 30 minutes | On-call engineer + team lead, stakeholder notification |
| P2 | Partial Impact | Non-core functionality broken or core functionality degraded for a subset of users. | Search feature down, slow responses in one region, intermittent errors for some users | < 2 hours | Every 2 hours | On-call engineer |
| P3 | Minor Issue | Cosmetic issues, minor bugs, or issues with workarounds available. | UI glitch, non-critical background job delayed, minor data inconsistency | Next business day | Daily (if ongoing) | Assigned engineer |
When an incident is detected (alert, user report, or monitoring), follow this triage sequence:
Answer these questions within the first 5 minutes:
Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.
| Severity | Who to Notify |
|---|---|
| P0 | On-call engineer, engineering manager, VP Engineering, customer support lead, communications |
| P1 | On-call engineer, engineering manager, customer support lead |
| P2 | On-call engineer, relevant team lead |
| P3 | Assigned engineer (via ticket) |
For P0 and P1 incidents, explicitly assign these roles:
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions, delegates tasks. Does NOT debug. |
| Technical Lead | Leads investigation and mitigation. Communicates findings to IC. |
| Communications Lead | Drafts and sends status updates to stakeholders and customers. |
| Scribe | Documents timeline, actions taken, and decisions in real time. |
Create a dedicated communication channel (Slack channel, incident bridge call):
#incident-2024-03-15-payments-downINCIDENT DECLARED: [P0/P1/P2] - [Brief description]
Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]
Next update in [15/30] minutes.
INCIDENT UPDATE: [P0/P1/P2] - [Brief description]
Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]
Next update in [15/30] minutes.
INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]
Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].
Monitoring for recurrence.
[Service Name] Status Update
We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.
Started: [Time, timezone]
Current status: [Brief, non-technical description]
We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.
Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.
Start with the service overview dashboard. Look for:
Most incidents are caused by recent changes. Check:
# What was deployed recently?
git log --oneline --since="2 hours ago" origin/main
# Any recent infrastructure changes?
# Check deployment pipeline history, Terraform runs, config changes
Questions to answer:
Search logs filtered to the incident timeframe:
# Example queries (adapt to your log aggregation platform)
# ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
# Datadog: service:order-service status:error
# CloudWatch: filter @message like /ERROR/ | sort @timestamp desc
Look for:
Based on data gathered, form a hypothesis and test it:
| Hypothesis | How to Test |
|---|---|
| Bad deploy caused it | Compare error timeline with deploy timestamp. Roll back and observe. |
| Database is overloaded | Check connection pool, slow query log, lock contention. |
| External dependency is down | Check dependency status page, test connectivity, check timeout rates. |
| Traffic spike overwhelmed the service | Check request rate, compare to normal baseline, check auto-scaling. |
| DNS or certificate issue | Test DNS resolution, check certificate expiry, verify SSL handshake. |
| Memory leak | Check memory usage trend, look for OOM kills in system logs. |
| Data corruption | Query for inconsistent data, check recent migration or backfill jobs. |
After applying a fix:
When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:
| Action | When to Use | How | Risk |
|---|---|---|---|
| Rollback | Bad deploy identified | Revert to previous known-good version via deployment pipeline | May lose new features; verify database compatibility |
| Feature flag toggle | New feature causing issues | Disable the flag in your feature management system | Requires feature flags to be in place |
| Horizontal scaling | Service overwhelmed by traffic | Increase instance count via auto-scaler or manual scaling | Increased cost; may not help if bottleneck is downstream |
| Cache clear | Stale or corrupted cached data | Flush application cache (Redis FLUSHDB, CDN purge) | Temporary increase in origin load after flush |
| Circuit breaker | Failing dependency cascading | Activate circuit breaker to fail fast instead of waiting | Gracefully degraded experience for users |
| Traffic shedding | Total overload | Rate limit or redirect traffic, enable maintenance page | Users see errors or degraded service |
| Database failover | Primary database unresponsive | Promote replica to primary (if configured) | Brief downtime during promotion; verify replication lag |
| DNS redirect | Entire region or provider down | Update DNS to point to backup region or provider | Propagation delay (use low TTL proactively) |
| Restart | Process stuck, memory leak | Rolling restart of application instances | Brief capacity reduction during restart |
| Hotfix | Small targeted code fix needed | Fast-track a minimal change through deployment pipeline | Bypasses normal review; must be reviewed post-incident |
# Verify the last known-good version
git log --oneline -10 origin/main
# Tag the rollback point
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"
# Trigger deployment of previous version
# (Adapt to your deployment pipeline)
# Example: Kubernetes
kubectl rollout undo deployment/order-service
# Verify rollback is deployed
kubectl rollout status deployment/order-service
# Monitor error rate and confirm reduction
Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.
# Post-Incident Review: [Incident Title]
**Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name]
## Summary
[2-3 sentences: what happened, the impact, and the resolution.]
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |
## Impact
Users affected, duration, revenue/SLA impact, data impact.
## Root Cause
[Detailed technical explanation of what went wrong and why.]
## Five Whys
1. **Why** did orders fail? -> Payment validation threw an exception.
2. **Why** did validation throw? -> Null value for a non-nullable field.
3. **Why** was the field null? -> Migration added column but did not backfill.
4. **Why** was backfill missed? -> No checklist step for backfill verification.
5. **Why** no checklist step? -> Migration procedures were undocumented.
## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |
## Lessons Learned
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]
Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:
| Principle | Practice |
|---|---|
| Assume good intent | People made the best decisions they could with the information they had at the time. |
| Focus on systems, not individuals | Ask "what allowed this to happen?" not "who caused this?" |
| Separate the what from the who | Describe actions taken without naming individuals in the root cause. Use role titles if context is needed. |
| Reward transparency | Publicly thank people who report incidents, share mistakes, or identify risks. |
| Follow through on action items | PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents. |
| Share learnings broadly | Publish PIR summaries (redacted if needed) so other teams learn too. |
A runbook is a step-by-step guide for responding to a specific alert or operational scenario.
Every runbook follows this structure:
# Runbook: [Alert or Scenario Name]
## When to Use
[Describe the alert, symptom, or scenario that triggers this runbook.]
## Prerequisites
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]
## Steps
### 1. Verify the Problem
```bash
curl -s https://monitoring.example.com/api/v1/query \
--data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'
Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.
# Option A: Restart
kubectl rollout restart deployment/order-service
# Option B: Rollback
kubectl rollout undo deployment/order-service
Expected: Error rate drops below 0.01 within 5 minutes.
If unresolved: escalate to [team], contact [channel/phone], provide [context].
### Runbook Best Practices
| Practice | Reason |
|----------|--------|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |
## Incident Response Checklist
Quick reference during an active incident:
- [ ] Impact assessed (who, what, when, trending).
- [ ] Severity assigned and documented.
- [ ] Incident channel or bridge opened.
- [ ] Roles assigned (IC, Technical Lead, Comms, Scribe).
- [ ] Initial stakeholder notification sent.
- [ ] Timeline being recorded in real time.
- [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- [ ] Mitigation applied and impact reducing.
- [ ] Resolution verified with monitoring data.
- [ ] All-clear communication sent.
- [ ] Post-incident review scheduled (within 48 hours for P0/P1).
- [ ] Action items created with owners and due dates.
npx claudepluginhub Claude-Code-Community-Ireland/claude-code-resources --plugin vibeworks-libraryExecute structured live incident response: declare severity, assign roles, mitigate, communicate, resolve, and run blameless postmortems for production incidents.
Manages active production incidents through detection, triage, mitigation, communication, and resolution with structured roles and severity levels. Triggers on outage, P0/P1, downtime, on-call, service down.
Guides incident management with lifecycle stages, severity levels, roles, metrics like MTTR. Use for runbooks, on-call rotations, postmortems.