Rapid incident classification, severity assessment, and response coordination.
Coordinates rapid incident response by assessing severity, assembling teams, and managing communication workflows.
/plugin marketplace add jmagly/aiwg/plugin install sdlc@aiwgThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Rapid incident classification, severity assessment, and response coordination.
This skill provides rapid incident response coordination by:
When triggered, this skill:
Gathers incident details:
Classifies severity:
Assembles response team:
Initiates response:
Manages communication:
Tracks resolution:
sev1:
name: Critical
alias: [P0, SEV1, Critical]
criteria:
- Complete service outage
- Data loss or corruption
- Security breach
- >50% customers affected
- Revenue-impacting
response:
response_time: 15 minutes
update_frequency: 15 minutes
executive_notification: immediate
customer_communication: within 30 minutes
escalation:
- incident_commander: required
- engineering_manager: required
- vp_engineering: within 30 minutes
- cto: within 1 hour (if unresolved)
target_resolution: 4 hours
sev2:
name: High
alias: [P1, SEV2, High]
criteria:
- Major feature unavailable
- Significant degradation
- 10-50% customers affected
- Workaround exists but painful
response:
response_time: 30 minutes
update_frequency: 30 minutes
executive_notification: within 1 hour
customer_communication: within 2 hours (if extended)
escalation:
- incident_commander: required
- engineering_manager: within 1 hour
target_resolution: 8 hours
sev3:
name: Medium
alias: [P2, SEV3, Medium]
criteria:
- Feature partially degraded
- <10% customers affected
- Workaround available
- Non-critical path affected
response:
response_time: 2 hours
update_frequency: 2 hours
executive_notification: daily summary
customer_communication: as needed
escalation:
- team_lead: within 4 hours
target_resolution: 24 hours
sev4:
name: Low
alias: [P3, SEV4, Low]
criteria:
- Minor issue
- Cosmetic problem
- Edge case affected
- Easy workaround
response:
response_time: next business day
update_frequency: daily
executive_notification: weekly summary
escalation: standard ticket flow
target_resolution: 1 week
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 1. DETECTION & TRIAGE ā
ā ⢠Alert received or issue reported ā
ā ⢠Gather initial details ā
ā ⢠Classify severity ā
ā ⢠Create incident record ā
ā ⢠Time: <15 minutes ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 2. MOBILIZATION ā
ā ⢠Page on-call responders ā
ā ⢠Establish incident commander ā
ā ⢠Create communication channel ā
ā ⢠Notify stakeholders per severity ā
ā ⢠Time: <5 minutes after triage ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 3. INVESTIGATION ā
ā ⢠Review recent changes ā
ā ⢠Check monitoring/logs ā
ā ⢠Identify affected components ā
ā ⢠Form hypothesis ā
ā ⢠Time: ongoing, status updates per SLA ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 4. MITIGATION ā
ā ⢠Implement workaround if available ā
ā ⢠Rollback if change-related ā
ā ⢠Scale resources if capacity issue ā
ā ⢠Isolate affected components ā
ā ⢠Goal: Reduce customer impact ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 5. RESOLUTION ā
ā ⢠Implement permanent fix ā
ā ⢠Verify fix is effective ā
ā ⢠Monitor for recurrence ā
ā ⢠Update status to resolved ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 6. POST-INCIDENT ā
ā ⢠Schedule post-incident review ā
ā ⢠Document timeline and actions ā
ā ⢠Identify root cause ā
ā ⢠Create follow-up action items ā
ā ⢠Update runbooks/documentation ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# Incident Report: INC-2025-001234
## Summary
| Field | Value |
|-------|-------|
| Title | Database connection pool exhaustion |
| Severity | SEV1 (Critical) |
| Status | Resolved |
| Start Time | 2025-12-08 14:32 UTC |
| Detected | 2025-12-08 14:35 UTC |
| Resolved | 2025-12-08 15:47 UTC |
| Duration | 1h 15m |
| Impact | 100% of API requests failing |
| Customers Affected | ~45,000 |
## Incident Commander
**Name**: Sarah Chen
**Role**: Senior SRE
## Response Team
| Role | Name | Joined |
|------|------|--------|
| Incident Commander | Sarah Chen | 14:38 |
| Backend Lead | David Kim | 14:40 |
| DBA | Elena Rodriguez | 14:45 |
| Comms Lead | James Wilson | 14:50 |
## Impact Assessment
### Customer Impact
- **Scope**: All customers using web and mobile apps
- **Severity**: Complete service outage
- **Duration**: 1h 15m
- **Affected Features**: All authenticated features
### Business Impact
- **Revenue Loss**: Estimated $XX,XXX
- **SLA Breach**: Yes (99.9% monthly target affected)
- **Customer Complaints**: 127 support tickets
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:32 | First customer reports of errors |
| 14:35 | PagerDuty alert for 5xx spike |
| 14:38 | Incident declared, Sarah Chen IC |
| 14:40 | Investigation begins |
| 14:45 | Identified: DB connection pool exhausted |
| 14:52 | Root cause: Runaway query from batch job |
| 15:00 | Mitigation: Batch job killed |
| 15:10 | Connection pool recovering |
| 15:30 | 50% traffic restored |
| 15:47 | Full service restored |
| 15:50 | Monitoring confirms stable |
| 16:00 | Incident closed |
## Root Cause
**Summary**: A scheduled batch job contained an inefficient query that held database connections indefinitely, exhausting the connection pool.
**Details**:
- Batch job deployed at 14:00 with new query
- Query had missing index, causing full table scan
- Each scan held connection for 30+ seconds
- 100 concurrent requests Ć 30s = pool exhausted
- New requests could not get connections ā 5xx errors
**Contributing Factors**:
1. Missing index migration in batch job deploy
2. No query timeout configured
3. Connection pool size not tuned for load
4. Batch job ran during peak hours
## Resolution
**Immediate Actions**:
1. Killed runaway batch job
2. Restarted application servers to reset connections
3. Verified service restoration
**Permanent Fixes** (follow-ups):
- [ ] Add missing index (INC-001-01)
- [ ] Configure query timeouts (INC-001-02)
- [ ] Increase connection pool size (INC-001-03)
- [ ] Move batch jobs to off-peak hours (INC-001-04)
- [ ] Add connection pool monitoring alerts (INC-001-05)
## Communication Log
| Time | Channel | Message |
|------|---------|---------|
| 14:45 | #incident-2025-001234 | Incident declared, investigating API failures |
| 15:00 | Status Page | Investigating service disruption |
| 15:15 | Status Page | Identified cause, implementing fix |
| 15:30 | #incident-2025-001234 | Service recovering, 50% restored |
| 15:50 | Status Page | Service fully restored |
| 16:00 | Email to customers | Incident resolved, apology + explanation |
## Post-Incident Review
**Scheduled**: 2025-12-10 10:00 UTC
**Attendees**: Response team + Engineering Manager
**Document**: .aiwg/incidents/INC-2025-001234-pir.md
## Lessons Learned
### What Went Well
- Fast detection (3 minutes)
- Clear incident commander
- Good team coordination
### What Could Improve
- Batch job should have been tested with prod-like data
- Missing connection pool alerts
- No query timeout safety net
## Action Items
| ID | Action | Owner | Due | Status |
|----|--------|-------|-----|--------|
| INC-001-01 | Add missing index | Elena | 2025-12-09 | Done |
| INC-001-02 | Configure query timeouts | David | 2025-12-10 | In Progress |
| INC-001-03 | Tune connection pool | Sarah | 2025-12-12 | Planned |
| INC-001-04 | Reschedule batch jobs | David | 2025-12-15 | Planned |
| INC-001-05 | Add pool monitoring | Sarah | 2025-12-12 | Planned |
User: "Production is down, API returning 500 errors"
Skill initiates:
1. Gather details
2. Classify severity
3. Create incident record
4. Notify responders
Output:
"šØ INCIDENT DECLARED: INC-2025-001234
Severity: SEV1 (Critical) - Complete API failure
Impact: All customers affected
Immediate Actions:
1. ā
Incident record created
2. ā
Paging on-call: Sarah Chen (SRE)
3. ā
Slack channel: #incident-2025-001234
4. ā³ Assembling response team
You are Incident Commander until relieved.
Quick Checklist:
- [ ] What changed in last 2 hours?
- [ ] Check: logs, metrics, recent deploys
- [ ] Status page update needed?
Update command: 'incident update <message>'"
User: "Is this SEV1 or SEV2? Users can't checkout but browsing works"
Skill evaluates:
- Core function affected: Yes (checkout = revenue)
- Workaround: No (can't complete purchase)
- Customer %: All attempting checkout
- Revenue impact: Direct
Output:
"Severity Assessment: SEV1 (Critical)
Rationale:
ā Revenue-impacting feature down
ā No workaround for affected flow
ā Checkout is business-critical path
ā Affects all customers attempting purchase
SEV1 Criteria Met:
- Complete feature outage: checkout
- Revenue impact: direct
- No workaround: cannot complete purchases
Response Requirements:
- 15-minute update cadence
- Executive notification: immediate
- Customer communication: within 30 minutes
- Target resolution: 4 hours"
User: "Incident update: found the issue, deploying fix"
Skill logs:
1. Add to timeline
2. Update status
3. Notify stakeholders
Output:
"Incident Update: INC-2025-001234
Status: Mitigating
Time: 15:15 UTC
Update Logged:
'Found root cause, deploying fix'
Next Actions:
- [ ] Update status page
- [ ] Notify executive stakeholders
- [ ] Continue timeline documentation
Time Since Start: 43 minutes
Next Update Due: 15:30 UTC"
This skill uses:
project-awareness: Context for system topologyartifact-metadata: Track incident artifactsagents:
incident_commander:
agent: incident-responder
focus: Overall coordination and decisions
technical_lead:
agent: debugger
focus: Root cause investigation
reliability:
agent: reliability-engineer
focus: System stability and monitoring
communications:
agent: support-lead
focus: Customer and stakeholder communication
notifications:
sev1:
pagerduty: true
slack: "#incidents-critical"
email: [engineering-leads, on-call-manager]
sms: [incident-commander, vp-engineering]
sev2:
pagerduty: true
slack: "#incidents"
email: [engineering-leads]
sev3:
slack: "#incidents"
email: [team-lead]
sev4:
slack: "#incidents-low"
escalation:
sev1:
- {time: 0, to: on-call-engineer}
- {time: 15m, to: engineering-manager}
- {time: 30m, to: vp-engineering}
- {time: 1h, to: cto}
sev2:
- {time: 0, to: on-call-engineer}
- {time: 1h, to: engineering-manager}
- {time: 4h, to: vp-engineering}
.aiwg/incidents/INC-{year}-{id}.md.aiwg/incidents/INC-{year}-{id}-pir.md.aiwg/incidents/action-items.md.aiwg/incidents/metrics/Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.