Incident Response

Patterns and practices for effective incident management, from detection through postmortem.

When to Use This Skill

Designing incident response processes
Creating incident runbooks
Establishing on-call rotations
Running effective postmortems
Improving mean time to recovery (MTTR)

Incident Lifecycle

┌─────────────────────────────────────────────────────────┐
│                  INCIDENT LIFECYCLE                      │
│                                                          │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │
│  │ Detect  │─►│ Respond │─►│ Recover │─►│ Learn   │   │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘   │
│       │            │            │            │          │
│       ▼            ▼            ▼            ▼          │
│   Alerting    Triage &     Mitigation  Postmortem     │
│   Monitoring  Diagnosis    Remediation  Action Items   │
└─────────────────────────────────────────────────────────┘

Key Metrics

MTTD - Mean Time to Detect
└── Time from incident start to detection

MTTA - Mean Time to Acknowledge
└── Time from alert to human acknowledgment

MTTR - Mean Time to Recover
└── Time from detection to resolution

MTTF - Mean Time to Failure
└── Time between incidents (reliability)

Goal: Minimize MTTD + MTTA + MTTR

Incident Severity

Severity Levels

SEV 1 - Critical
├── Complete outage
├── Data loss or security breach
├── All/most users affected
├── Response: Immediate (24/7)
└── Example: Production database down

SEV 2 - High
├── Major functionality impaired
├── Significant user impact
├── Workaround may exist
├── Response: Urgent (business hours++)
└── Example: Payment processing degraded

SEV 3 - Medium
├── Partial functionality affected
├── Limited user impact
├── Workaround available
├── Response: Normal priority
└── Example: Report generation slow

SEV 4 - Low
├── Minor issue
├── Minimal user impact
├── Response: Best effort
└── Example: UI cosmetic bug

Severity Matrix

                    User Impact
                 Low   Medium   High
Scope    ├─────────────────────────────┤
Wide     │  SEV3   SEV2    SEV1       │
Medium   │  SEV4   SEV3    SEV2       │
Limited  │  SEV4   SEV4    SEV3       │
         └─────────────────────────────┘

Incident Roles

Core Roles

Incident Commander (IC)
├── Owns the incident end-to-end
├── Makes decisions and delegates
├── Controls incident channel
├── Does NOT debug (coordinates)
└── Focus: Big picture, communication

Tech Lead
├── Leads technical investigation
├── Coordinates technical responders
├── Makes technical decisions
├── Reports to IC
└── Focus: Root cause, fix

Communications Lead
├── Handles external communication
├── Updates status page
├── Manages customer notifications
├── Reports to IC
└── Focus: Stakeholder updates

Scribe
├── Documents timeline
├── Records decisions and actions
├── Captures important information
├── Reports to IC
└── Focus: Documentation

Role Handoffs

Handoff protocol:
1. IC: "I'm handing IC to [Name]"
2. New IC: "I'm taking IC. Current status is..."
3. IC: "Confirmed, [Name] is now IC"

Handoff when:
- Shift ends
- Fatigue sets in
- Expertise needed
- Escalation required

Incident Communication

Communication Channels

Internal:
┌─────────────────────────────────────────────────────────┐
│ #incident-YYYY-MM-DD-topic                              │
│ - All incident communication here                       │
│ - Pinned: Current status, timeline, roles              │
│ - Bridge call link for voice                            │
└─────────────────────────────────────────────────────────┘

External:
- Status page (status.example.com)
- Customer emails
- Social media (if needed)
- Support channels

Status Update Template

[TIME] Incident Update - [TITLE]

Current Status: [Investigating|Identified|Monitoring|Resolved]

Impact: [What users are experiencing]

What we know:
- [Key facts]

What we're doing:
- [Current actions]

Next update: [Time or "as soon as we learn more"]

Update Cadence

SEV 1: Every 15-30 minutes
SEV 2: Every 30-60 minutes
SEV 3: Every 1-2 hours
SEV 4: As needed

Also update when:
- Status changes
- Major new information
- Actions taken
- Resolution achieved

Incident Response Process

Step 1: Detection & Alert

Triggers:
- Automated alerting
- User reports
- Internal discovery

First responder actions:
1. Acknowledge alert
2. Assess initial impact
3. Declare incident if needed
4. Page additional responders
5. Open incident channel

Step 2: Triage & Mobilize

Triage questions:
- What is the user impact?
- How many users affected?
- Is there a workaround?
- What's the severity?

Mobilize:
1. Page appropriate responders
2. Establish roles (IC, Tech Lead, Comms)
3. Start incident channel
4. Begin timeline documentation

Step 3: Investigate & Diagnose

Investigation approach:
1. What changed recently?
   └── Deployments, config, infrastructure

2. What do metrics/logs show?
   └── Error rates, latency, traces

3. What's the blast radius?
   └── Which services, which users

4. What are the hypotheses?
   └── List, prioritize, test

Common commands:
- "Checking [system] now"
- "Theory: [hypothesis]"
- "Found: [discovery]"
- "Need: [resource/access]"

Step 4: Mitigate

Mitigation strategies:
1. Rollback
   └── Revert recent changes

2. Failover
   └── Switch to backup/replica

3. Scale
   └── Add capacity

4. Disable
   └── Turn off affected feature

5. Hotfix
   └── Deploy targeted fix

Priority: Restore service first, root cause later

Step 5: Resolve & Verify

Resolution checklist:
□ Service restored
□ Metrics normalized
□ User-facing impact ended
□ Monitoring in place for recurrence
□ Temporary mitigations documented

Verification:
- Check key SLIs
- Test user flows
- Monitor for 15-30 minutes
- Confirm with affected teams

Step 6: Close & Learn

Closure:
1. Declare incident resolved
2. Final status update
3. Schedule postmortem
4. Assign postmortem owner
5. Close incident channel (archive)

Timeline:
- Postmortem doc: Within 24-48 hours
- Postmortem meeting: Within 5 business days
- Action items: Tracked to completion

On-Call Practices

On-Call Structure

Primary: First responder to alerts
Secondary: Backup if primary unavailable
Escalation: Manager/senior for major incidents

Rotation options:
- Weekly rotation
- Follow-the-sun (multiple timezones)
- Split shifts (day/night)

Off-hours policy:
- What's page-worthy?
- What can wait?
- Compensation for off-hours

On-Call Responsibilities

During shift:
- Respond to alerts within SLA
- Triage and resolve or escalate
- Document actions taken
- Hand off to next shift

Handoff includes:
- Open alerts/incidents
- Recent incidents
- Known issues
- Scheduled changes

Alert Hygiene

Good alert:
- Actionable
- Urgent
- User-impacting
- Clear resolution path

Alert anti-patterns:
- "Somebody should look at this"
- Duplicate alerts
- Non-actionable information
- Crying wolf (frequent false positives)

Regular review:
- Which alerts fired?
- Which were actionable?
- Which were noise?
- What's missing?

Runbooks

Runbook Structure

# Runbook: [Alert Name]

## Overview
What this alert means and why it matters.

## Impact
What users experience when this fires.

## Investigation Steps
1. Check [metric/log/dashboard]
2. Look for [specific pattern]
3. Verify [component status]

## Mitigation Steps
1. If [condition], do [action]
2. If [condition], do [action]
3. Escalate if [condition]

## Rollback Procedure
How to undo changes if needed.

## Contacts
- Service owner: [name]
- Escalation: [name/team]

## Related Links
- Dashboard: [link]
- Logs: [query]
- Service docs: [link]

Runbook Maintenance

Keep runbooks current:
- Update after incidents
- Review quarterly
- Test procedures
- Remove stale content

Runbook location:
- Linked from alert
- Searchable/discoverable
- Version controlled

Postmortems

Blameless Culture

Blameless postmortem:
- Focus on systems, not individuals
- Assume people acted rationally
- Look for contributing factors
- Improve systems and processes

Not blameless: "John should have..."
Blameless: "The system allowed..."

Postmortem Template

# Incident Postmortem: [Title]

Date: [Date]
Duration: [Start - End]
Severity: [SEV level]
Authors: [Names]

## Summary
One paragraph summary of what happened.

## Impact
- Users affected: [Number/percentage]
- Duration: [Time]
- Revenue impact: [If applicable]

## Timeline
[Time] - Event
[Time] - Action taken
[Time] - Resolution

## Root Cause
What actually caused the incident.

## Contributing Factors
What made it worse or harder to resolve.

## What Went Well
- [Positive observations]

## What Could Be Improved
- [Improvement areas]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action] | [Name] | [Date] | [Status] |

## Lessons Learned
Key takeaways for the organization.

Action Item Types

Prevent: Stop this from happening again
Detect: Find it faster next time
Mitigate: Reduce impact when it happens
Document: Improve runbooks/documentation

Priority:
1. High-impact, low-effort
2. Required for safety
3. Reduces toil
4. Nice to have

Best Practices

1. Declare incidents early
   When in doubt, declare

2. Focus on mitigation first
   Root cause analysis later

3. Communicate frequently
   Silence breeds anxiety

4. Document as you go
   Don't rely on memory

5. Practice with game days
   Drill before real incidents

6. Blameless postmortems
   Systems fail, not people

7. Track action items
   Complete what you commit

8. Regular on-call review
   Improve the on-call experience

Related Skills

slo-sli-error-budget - SLOs and alerting
observability-patterns - Using observability in incidents
chaos-engineering-fundamentals - Proactive resilience testing

incident-response

Incident Response

When to Use This Skill

Incident Lifecycle

Key Metrics

Incident Severity

Severity Levels

Severity Matrix

Incident Roles

Core Roles

Role Handoffs

Incident Communication

Communication Channels

Status Update Template

Update Cadence

Incident Response Process

Step 1: Detection & Alert

Step 2: Triage & Mobilize

Step 3: Investigate & Diagnose

Step 4: Mitigate

Step 5: Resolve & Verify

Step 6: Close & Learn

On-Call Practices

On-Call Structure

On-Call Responsibilities

Alert Hygiene

Runbooks

Runbook Structure

Runbook Maintenance

Postmortems

Blameless Culture

Postmortem Template

Action Item Types

Best Practices

Related Skills

Similar Skills