Incident Response Skill

When to Use This Skill

Use this skill when:

Incident Response tasks - Working on plan incident management processes, postmortems, and blameless culture
Planning or design - Need guidance on Incident Response approaches
Best practices - Want to follow established patterns and standards

Overview

Plan incident management processes, postmortems, and blameless culture.

MANDATORY: Documentation-First Approach

Before planning incident response:

Invoke docs-management skill for incident management patterns
Verify SRE practices via MCP servers (perplexity)
Base guidance on Google SRE and industry best practices

Incident Lifecycle

INCIDENT LIFECYCLE:

┌─────────────────────────────────────────────────────────────────┐
│                    INCIDENT LIFECYCLE                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. DETECTION                                                    │
│     ├── Alert fires                                              │
│     ├── Customer report                                          │
│     └── Internal discovery                                       │
│            │                                                     │
│            ▼                                                     │
│  2. TRIAGE                                                       │
│     ├── Assess severity                                          │
│     ├── Assign incident commander                                │
│     └── Open incident channel                                    │
│            │                                                     │
│            ▼                                                     │
│  3. RESPONSE                                                     │
│     ├── Diagnose root cause                                      │
│     ├── Implement fix                                            │
│     ├── Communicate status                                       │
│     └── Coordinate resources                                     │
│            │                                                     │
│            ▼                                                     │
│  4. RESOLUTION                                                   │
│     ├── Verify fix                                               │
│     ├── Close incident                                           │
│     └── Initial timeline                                         │
│            │                                                     │
│            ▼                                                     │
│  5. POSTMORTEM                                                   │
│     ├── Blameless analysis                                       │
│     ├── Identify contributing factors                            │
│     ├── Define action items                                      │
│     └── Share learnings                                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Incident Severity Levels

SEVERITY DEFINITIONS:

┌─────────┬────────────────────────────────────────────────────────┐
│ SEV1    │ CRITICAL - Major user/business impact                  │
│         │                                                        │
│         │ Examples:                                              │
│         │ - Complete service outage                              │
│         │ - Data breach/security incident                        │
│         │ - Revenue-impacting failure                            │
│         │ - SLA breach imminent                                  │
│         │                                                        │
│         │ Response:                                              │
│         │ - Page immediately                                     │
│         │ - All hands on deck                                    │
│         │ - Exec communication                                   │
│         │ - Status page update                                   │
├─────────┼────────────────────────────────────────────────────────┤
│ SEV2    │ HIGH - Significant impact, workaround possible         │
│         │                                                        │
│         │ Examples:                                              │
│         │ - Partial outage                                       │
│         │ - Major feature unavailable                            │
│         │ - Performance severely degraded                        │
│         │                                                        │
│         │ Response:                                              │
│         │ - Page on-call                                         │
│         │ - Incident commander assigned                          │
│         │ - Customer communication                               │
├─────────┼────────────────────────────────────────────────────────┤
│ SEV3    │ MEDIUM - Limited impact, non-critical                  │
│         │                                                        │
│         │ Examples:                                              │
│         │ - Minor feature broken                                 │
│         │ - Small subset of users affected                       │
│         │ - Non-urgent degradation                               │
│         │                                                        │
│         │ Response:                                              │
│         │ - Business hours response                              │
│         │ - Track in ticket                                      │
├─────────┼────────────────────────────────────────────────────────┤
│ SEV4    │ LOW - Minimal impact, cosmetic issues                  │
│         │                                                        │
│         │ Examples:                                              │
│         │ - UI glitch                                            │
│         │ - Non-critical bug                                     │
│         │                                                        │
│         │ Response:                                              │
│         │ - Normal ticket workflow                               │
└─────────┴────────────────────────────────────────────────────────┘

Incident Roles

INCIDENT ROLES:

┌─────────────────────────────────────────────────────────────────┐
│                                                                  │
│  INCIDENT COMMANDER (IC)                                         │
│  ├── Owns the incident                                           │
│  ├── Coordinates response                                        │
│  ├── Makes decisions                                             │
│  ├── Delegates tasks                                             │
│  └── Keeps everyone informed                                     │
│                                                                  │
│  TECHNICAL LEAD                                                  │
│  ├── Leads technical investigation                               │
│  ├── Diagnoses root cause                                        │
│  ├── Implements fixes                                            │
│  └── Advises IC on technical matters                             │
│                                                                  │
│  COMMUNICATIONS LEAD                                             │
│  ├── Manages external communication                              │
│  ├── Updates status page                                         │
│  ├── Drafts customer notifications                               │
│  └── Handles stakeholder updates                                 │
│                                                                  │
│  SCRIBE                                                          │
│  ├── Documents timeline                                          │
│  ├── Records decisions                                           │
│  ├── Captures actions taken                                      │
│  └── Prepares postmortem draft                                   │
│                                                                  │
│  SUBJECT MATTER EXPERTS (SMEs)                                   │
│  ├── Provide domain expertise                                    │
│  ├── Execute specific tasks                                      │
│  └── Advise on their area                                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Incident Response Checklist

# Incident Response Checklist

## Detection (T+0)

- [ ] Alert acknowledged
- [ ] Initial assessment of severity
- [ ] Incident channel created (#incident-YYYY-MM-DD-{name})

## Triage (T+5 min)

- [ ] Severity assigned (SEV1/2/3/4)
- [ ] Incident Commander identified
- [ ] Technical Lead identified
- [ ] Initial status posted to channel

## Response (T+10 min)

### For SEV1/SEV2:
- [ ] Communications Lead assigned
- [ ] Status page updated (Investigating)
- [ ] Stakeholders notified (Slack #incidents)
- [ ] Customer-facing communication drafted (if needed)

### Technical Response:
- [ ] Scope identified (which services/users affected)
- [ ] Root cause hypothesis formed
- [ ] Fix being implemented or workaround in place
- [ ] Monitoring for improvement

## Communication Cadence

| Severity | Internal Update | External Update |
|----------|-----------------|-----------------|
| SEV1 | Every 15 min | Every 30 min |
| SEV2 | Every 30 min | Every hour |
| SEV3 | Every hour | As needed |

## Resolution

- [ ] Fix deployed and verified
- [ ] Metrics returning to normal
- [ ] Status page updated (Resolved)
- [ ] Customer notification sent (if applicable)
- [ ] Incident channel archived

## Follow-up

- [ ] Postmortem scheduled (within 48h for SEV1/2)
- [ ] Timeline documented
- [ ] Action items created
- [ ] Postmortem shared

Blameless Postmortem

BLAMELESS CULTURE PRINCIPLES:

┌─────────────────────────────────────────────────────────────────┐
│                    BLAMELESS POSTMORTEMS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CORE BELIEF:                                                    │
│  People don't come to work to do a bad job. When incidents       │
│  happen, the system failed, not the person.                      │
│                                                                  │
│  FOCUS ON:                                                       │
│  ✓ What happened (facts, not fault)                              │
│  ✓ Why the system allowed it (systemic issues)                   │
│  ✓ How to prevent recurrence (improvements)                      │
│  ✓ What we learned (knowledge sharing)                           │
│                                                                  │
│  AVOID:                                                          │
│  ✗ "Who made the mistake?"                                       │
│  ✗ "Why didn't they check?"                                      │
│  ✗ "They should have known better"                               │
│  ✗ Assigning personal blame                                      │
│                                                                  │
│  REFRAME TO:                                                     │
│  "Why did the system allow this?"                                │
│  "What safeguards were missing?"                                 │
│  "How can we make this impossible to happen again?"              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

BLAMELESS LANGUAGE:

Instead of...              Say...
─────────────────────────────────────────────────────────────────
"John broke production"    "A configuration change caused..."
"She didn't test it"       "Testing didn't catch this because..."
"They should have known"   "The system didn't make this obvious"
"Human error"              "The process allowed this to happen"

Postmortem Template

# Postmortem: {Incident Title}

**Date:** {YYYY-MM-DD}
**Severity:** {SEV1/2/3/4}
**Duration:** {X hours Y minutes}
**Authors:** {Names}
**Status:** {Draft/Reviewed/Published}

---

## Executive Summary

{2-3 sentence summary of what happened, impact, and key learnings}

---

## Impact

| Metric | Value |
|--------|-------|
| Duration | {X hours Y minutes} |
| Users affected | {Number or percentage} |
| Revenue impact | {$X or "Minimal"} |
| SLO impact | {Error budget consumed} |
| Support tickets | {Number} |

---

## Timeline

All times in UTC.

| Time | Event |
|------|-------|
| 14:00 | Deployment of orders-api v2.3.1 started |
| 14:05 | Deployment completed |
| 14:12 | Error rate alert fired |
| 14:15 | On-call acknowledged, began investigation |
| 14:22 | Root cause identified (database migration issue) |
| 14:25 | Rollback initiated |
| 14:32 | Rollback completed, service recovering |
| 14:45 | Error rate returned to normal |
| 14:50 | Incident resolved |

---

## Root Cause Analysis

### What happened

{Detailed technical explanation of what failed and why}

### Contributing factors

1. **{Factor 1}**
   - {Explanation}
   - {Why it contributed}

2. **{Factor 2}**
   - {Explanation}
   - {Why it contributed}

3. **{Factor 3}**
   - {Explanation}
   - {Why it contributed}

### 5 Whys Analysis

1. **Why did users see errors?**
   - Because the API was returning 500 errors

2. **Why was the API returning 500 errors?**
   - Because database queries were failing

3. **Why were database queries failing?**
   - Because the migration added a NOT NULL column without a default

4. **Why didn't we catch this before production?**
   - Because staging didn't have representative data

5. **Why didn't staging have representative data?**
   - Because we don't have a data anonymization pipeline

---

## What Went Well

- {Positive thing 1}
- {Positive thing 2}
- {Positive thing 3}

## What Could Have Gone Better

- {Improvement area 1}
- {Improvement area 2}
- {Improvement area 3}

---

## Action Items

| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | Add database migration validation to CI | @engineer | P1 | 2024-01-20 | Open |
| 2 | Create staging data pipeline | @data-team | P2 | 2024-02-01 | Open |
| 3 | Add rollback automation | @platform | P2 | 2024-01-25 | Open |
| 4 | Update deployment runbook | @oncall | P3 | 2024-01-22 | Open |

---

## Lessons Learned

### Technical

- {Technical lesson 1}
- {Technical lesson 2}

### Process

- {Process lesson 1}
- {Process lesson 2}

### Communication

- {Communication lesson 1}

---

## Appendix

### Related Links

- [Incident Slack Channel](#)
- [Deployment Dashboard](#)
- [Error Logs Query](#)

### Supporting Data

{Graphs, screenshots, log snippets}

---

## Sign-off

| Role | Name | Date |
|------|------|------|
| Author | {Name} | {Date} |
| Reviewer | {Name} | {Date} |
| Approved | {Name} | {Date} |

Incident Metrics

INCIDENT METRICS TO TRACK:

MTTR (Mean Time To Recovery):
┌─────────────────────────────────────────────────────────────────┐
│ Time from incident start to resolution                          │
│                                                                  │
│ Breakdown:                                                       │
│ - Time to detect (TTD)                                           │
│ - Time to acknowledge (TTA)                                      │
│ - Time to diagnose (TTDiag)                                      │
│ - Time to fix (TTF)                                              │
│                                                                  │
│ MTTR = TTD + TTA + TTDiag + TTF                                  │
└─────────────────────────────────────────────────────────────────┘

Other Key Metrics:
┌─────────────────────────────────────────────────────────────────┐
│ Incident frequency       │ Incidents per week/month             │
│ Severity distribution    │ % SEV1 vs SEV2 vs SEV3              │
│ Time in incident         │ Engineer hours spent on incidents    │
│ Repeat incidents         │ Same root cause recurring            │
│ Action item completion   │ % of postmortem items completed      │
│ Customer impact          │ Users affected, revenue lost         │
└─────────────────────────────────────────────────────────────────┘

Incident Prevention

PROACTIVE INCIDENT PREVENTION:

PRE-PRODUCTION:
┌─────────────────────────────────────────────────────────────────┐
│ - Code review with reliability focus                             │
│ - Automated testing (unit, integration, E2E)                     │
│ - Chaos engineering in staging                                   │
│ - Load testing before major releases                             │
│ - Feature flags for gradual rollout                              │
│ - Deployment checklists                                          │
└─────────────────────────────────────────────────────────────────┘

PRODUCTION SAFEGUARDS:
┌─────────────────────────────────────────────────────────────────┐
│ - Canary deployments                                             │
│ - Progressive rollouts                                           │
│ - Automated rollback on error spike                              │
│ - Circuit breakers                                               │
│ - Rate limiting                                                  │
│ - Redundancy and failover                                        │
└─────────────────────────────────────────────────────────────────┘

DETECTION:
┌─────────────────────────────────────────────────────────────────┐
│ - SLO-based alerting                                             │
│ - Anomaly detection                                              │
│ - Synthetic monitoring                                           │
│ - Real user monitoring (RUM)                                     │
│ - Error tracking                                                 │
└─────────────────────────────────────────────────────────────────┘

LEARNING:
┌─────────────────────────────────────────────────────────────────┐
│ - Blameless postmortems                                          │
│ - Action item follow-through                                     │
│ - Incident pattern analysis                                      │
│ - GameDays and chaos experiments                                 │
│ - Cross-team incident reviews                                    │
└─────────────────────────────────────────────────────────────────┘

Workflow

When planning incident response:

Define Severities: Clear criteria for each level
Establish Roles: IC, Tech Lead, Comms, Scribe
Create Runbooks: Pre-documented response procedures
Set Up Tooling: Incident channels, status page, PagerDuty
Practice: Regular incident drills and chaos experiments
Postmortem Culture: Blameless, learning-focused reviews
Track Metrics: MTTR, frequency, action item completion

References

For detailed guidance:

Last Updated: 2025-12-26

incident-response