Structured workflow for production incident management following SRE best practices. Covers incident declaration, triage, coordination, resolution, and post-mortem.
/plugin marketplace add lerianstudio/ring/plugin install ring-ops-team@ringThis skill inherits all available tools. When active, it can use any tool Claude has access to.
This skill defines the structured process for handling production incidents. It MUST be followed for all SEV1, SEV2, and SEV3 incidents.
See shared-patterns/incident-severity.md for severity definitions.
| Phase | Focus | Owner |
|---|---|---|
| 1. Detection | Identify and confirm incident | Monitoring/On-call |
| 2. Declaration | Assess severity, declare incident | Incident Commander |
| 3. Triage | Identify impact and initial hypothesis | Response Team |
| 4. Mitigation | Restore service, implement workaround | Engineering Team |
| 5. Resolution | Permanent fix, verification | Engineering Team |
| 6. Post-Incident | RCA, action items, documentation | Incident Commander |
Trigger: Alert fires or user report received.
Owner: First responder declares incident, assigns severity.
| Criteria | SEV1 | SEV2 | SEV3 |
|---|---|---|---|
| Complete outage | X | ||
| Data loss risk | X | ||
| >50% users affected | X | ||
| <50% users affected | X | ||
| Workaround available | X |
See shared-patterns/incident-severity.md for complete definitions.
Create incident channel (if SEV1/SEV2):
#incident-YYYY-MM-DD-brief-descriptionAssign Incident Commander (IC):
Update status page (if customer-facing):
**INCIDENT DECLARED**
**Severity:** SEV[1/2/3]
**Title:** [Brief description]
**Incident Commander:** @[name]
**Channel:** #incident-[date]-[slug]
**Impact:**
- Services affected: [list]
- Users affected: [count/percentage]
- Started: [timestamp UTC]
**Current Status:**
[Brief description of current state]
**Next Update:** [timestamp]
Owner: Incident Commander coordinates, engineering investigates.
Update frequency by severity:
| Severity | Internal Update | External Update |
|---|---|---|
| SEV1 | Every 10 min | Every 15 min |
| SEV2 | Every 15 min | Every 30 min |
| SEV3 | Every 30 min | As needed |
Owner: Engineering implements fix, IC coordinates.
**MITIGATION IN PROGRESS**
**Action:** [description]
**Owner:** @[name]
**Started:** [timestamp]
**Verification:**
- [ ] [criterion 1]
- [ ] [criterion 2]
**Rollback Plan:**
[If mitigation fails, do X]
Owner: Engineering confirms fix, IC verifies resolution.
ALL must be true before marking resolved:
**INCIDENT RESOLVED**
**Duration:** [X hours Y minutes]
**Resolution Time:** [timestamp UTC]
**Root Cause:**
[Brief description of what caused the incident]
**Fix Applied:**
[What was done to resolve]
**Next Steps:**
- [ ] RCA scheduled for [date]
- [ ] Action items tracked in [location]
**Retrospective:** [date/time]
Owner: Incident Commander schedules RCA, tracks action items.
| Severity | RCA Required | Timeline |
|---|---|---|
| SEV1 | MANDATORY | 48 hours |
| SEV2 | MANDATORY | 1 week |
| SEV3 | Optional | 2 weeks |
# Incident Post-Mortem: [Title]
**Incident ID:** INC-YYYY-NNNN
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV[1/2/3]
**Author:** @[incident commander]
## Summary
[2-3 sentence summary of what happened]
## Impact
- **Users Affected:** [count/percentage]
- **Revenue Impact:** [if applicable]
- **SLA Impact:** [if applicable]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | [event] |
## Root Cause
[Technical description of the root cause]
## Contributing Factors
1. [Factor 1]
2. [Factor 2]
## What Went Well
1. [Item 1]
2. [Item 2]
## What Could Be Improved
1. [Item 1]
2. [Item 2]
## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| [action] | @[name] | YYYY-MM-DD | Open |
## Lessons Learned
[Key takeaways for the team]
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "Document later, fix first" | Memory fades in hours | Document AS you fix |
| "Small incident, skip RCA" | Small incidents reveal systemic issues | RCA for SEV1/SEV2 minimum |
| "Root cause is obvious" | Obvious != correct | Investigate with data |
| "Skip verification period" | Premature resolution = reopen | Wait full verification period |
| User Says | Your Response |
|---|---|
| "Mark resolved now, verify later" | "Cannot mark resolved until verification complete. This prevents reopened incidents." |
| "Skip the RCA, we know what happened" | "RCA is mandatory for this severity. Schedule within required timeline." |
| "No time for documentation" | "Real-time documentation takes 30 seconds per event. Memory loss causes worse rework." |
For complex incidents, dispatch the incident-responder agent:
Task tool:
subagent_type: "incident-responder"
model: "opus"
prompt: |
INCIDENT: [description]
SEVERITY: SEV[X]
CURRENT STATUS: [state]
REQUEST: [specific assistance needed]
This skill should be used when the user asks about libraries, frameworks, API references, or needs code examples. Activates for setup questions, code generation involving libraries, or mentions of specific frameworks like React, Vue, Next.js, Prisma, Supabase, etc.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.