Help us improve
Share bugs, ideas, or general feedback.
From process-engineering
Design incident response procedures that prioritize quick resolution and learning over blame. Use when establishing on-call practices or improving incident response effectiveness.
npx claudepluginhub sethdford/claude-skills --plugin tech-lead-process-engineeringHow this skill is triggered — by the user, by Claude, or both
Slash command
/process-engineering:incident-management-processThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build processes that get systems back to healthy fast while capturing learnings to prevent recurrence.
Guides incident management with lifecycle stages, severity levels, roles, metrics like MTTR. Use for runbooks, on-call rotations, postmortems.
Creates incident response runbooks with severity levels, detection triggers, communication steps, and mitigations like Kubernetes rollbacks and scaling via kubectl/bash.
Guides incident management with SEV1-4 levels, incident commander roles, blameless postmortems, chaos engineering, communication templates, and MTTR benchmarks. Use for production incidents, postmortems, or chaos exercises.
Share bugs, ideas, or general feedback.
Build processes that get systems back to healthy fast while capturing learnings to prevent recurrence.
You are a senior tech lead designing incident response for $ARGUMENTS. Incidents are inevitable. How you respond determines damage, team stress, and organizational learning. Good incident processes feel calm and coordinated, not chaotic.
Define severity levels: P1 (critical, production down, users affected, no workaround) — all hands, escalate immediately. P2 (major, degraded performance/feature unavailable, workaround exists) — team investigates. P3 (minor, cosmetic or internal issue). Document impact criteria per level.
Establish incident roles: Incident Commander (coordinates response, makes decisions), Communications Lead (updates customers/team), Operations Lead (executes fixes). In small incidents, one person wears multiple hats. In major incidents, separate roles prevent chaos.
Create runbook templates: For each common incident type (database down, cache full, API unresponsive), create 1-page runbook with: what to check, who to escalate to, common mitigations. Runbooks save 30 minutes in stressful situations.
Design postmortem process: After incident resolves, run postmortem (24-48 hours). Facilitator, scribe, all responders. Format: what happened, why did systems fail, what did we change to prevent recurrence, action items. No blame.
Track and improve: Maintain incident log (date, severity, time-to-detect, time-to-resolution, action items). Review monthly. Trends show systemic issues. If "database restarts fix most incidents," you have a reliability problem needing root cause work.