Use when responding to production incidents following SRE principles and best practices.
Provides SRE incident response guidance with severity levels, roles, communication templates, and blameless postmortem frameworks. Use this when handling production outages to follow structured triage, mitigation, and resolution processes.
/plugin marketplace add TheBushidoCollective/han/plugin install do-observability-engineering@hanThis skill cannot use any tools. It operates in read-only mode without the ability to modify files or execute commands.
Managing incidents and conducting effective postmortems.
Alert fires ā On-call acknowledges ā Initial assessment
- Assess severity
- Page additional responders if needed
- Establish incident channel
- Assign incident commander
- Identify mitigation options
- Execute fastest safe mitigation
- Monitor for improvement
- Escalate if not improving
- Verify service health
- Communicate resolution
- Document actions taken
- Schedule postmortem
- Conduct postmortem
- Identify action items
- Track completion
- Update runbooks
šØ INCIDENT DECLARED - P0
Service: API Gateway
Impact: All API requests failing
Started: 2024-01-15 14:23 UTC
IC: @alice
Status Channel: #incident-001
Current Status: Investigating
Next Update: 30 minutes
š INCIDENT UPDATE #2 - P0
Service: API Gateway
Elapsed: 45 minutes
Progress: Identified root cause as database connection pool exhaustion.
Mitigation: Increasing pool size and restarting services.
ETA to Resolution: 15 minutes
Next Update: 15 minutes or when resolved
ā
INCIDENT RESOLVED - P0
Service: API Gateway
Duration: 1h 12m
Impact: 100% of API requests failed
Resolution: Increased database connection pool and restarted services.
Next Steps:
- Postmortem scheduled for tomorrow 10am
- Monitoring for recurrence
- Action items being tracked in #incident-001
# Incident Postmortem: API Outage 2024-01-15
## Summary
On January 15th, our API was completely unavailable for 72 minutes due to
database connection pool exhaustion.
## Impact
- Duration: 72 minutes (14:23 - 15:35 UTC)
- Severity: P0
- Users Affected: 100% of API users (~50,000 requests failed)
- Revenue Impact: ~$5,000 in SLA credits
## Timeline
**14:23** - Alerts fire for elevated error rate
**14:25** - IC paged, incident channel created
**14:30** - Identified all database connections exhausted
**14:45** - Decided to increase pool size
**15:00** - Configuration deployed
**15:15** - Services restarted
**15:35** - Error rate returned to normal, incident resolved
## Root Cause
Database connection pool was sized for normal load (100 connections).
Traffic spike from new feature launch (3x normal) exhausted connections.
No alerting existed for connection pool utilization.
## What Went Well
- Detection was quick (2 minutes from issue start)
- Team assembled rapidly
- Clear communication maintained
## What Didn't Go Well
- No capacity testing before feature launch
- Connection pool metrics not monitored
- No automated rollback capability
## Action Items
1. [P0] Add connection pool utilization monitoring (@bob, 1/17)
2. [P0] Implement automated rollback for deploys (@charlie, 1/20)
3. [P1] Establish capacity testing process (@diana, 1/25)
4. [P1] Increase connection pool to 300 (@bob, 1/16)
5. [P2] Update deployment runbook with load testing (@eve, 1/30)
## Lessons Learned
- Always load test before launching features
- Monitor resource utilization at all layers
- Have rollback mechanisms ready
# Runbook: High Database Latency
## Symptoms
- Database query times > 500ms
- Elevated API latency
- Alert: DatabaseLatencyHigh
## Impact
Users experience slow page loads. P1 severity if p95 > 1s.
## Investigation
1. Check database metrics in Grafana
https://grafana.example.com/d/db-overview
2. Identify slow queries:
```sql
SELECT * FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 10;
Check for locks:
SELECT * FROM pg_stat_activity
WHERE state = 'active';
Quick fixes:
Escalation: If latency > 2s for > 15 minutes, page DBA team.
## Best Practices
### Blameless Culture
- Focus on systems, not individuals
- Assume good intentions
- Learn from mistakes
- Reward transparency
### Clear Severity Definitions
- Severity should be based on user impact
- Document response time expectations
- Update definitions based on learnings
### Practice Incident Response
- Run "game days" quarterly
- Practice different scenarios
- Test on-call handoffs
- Review and improve runbooks
### Track Action Items
- Assign owners and due dates
- Review in team meetings
- Close loop on completion
- Measure time to completion
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.