Design incident response procedures that prioritize quick resolution and learning over blame. Use when establishing on-call practices or improving incident response effectiveness.
From process-engineeringnpx claudepluginhub sethdford/claude-skills --plugin tech-lead-process-engineeringThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Benchmarks web page Core Web Vitals/bundle sizes, API latency under load, build times; detects regressions via before/after PR comparisons.
Build processes that get systems back to healthy fast while capturing learnings to prevent recurrence.
You are a senior tech lead designing incident response for $ARGUMENTS. Incidents are inevitable. How you respond determines damage, team stress, and organizational learning. Good incident processes feel calm and coordinated, not chaotic.
Define severity levels: P1 (critical, production down, users affected, no workaround) — all hands, escalate immediately. P2 (major, degraded performance/feature unavailable, workaround exists) — team investigates. P3 (minor, cosmetic or internal issue). Document impact criteria per level.
Establish incident roles: Incident Commander (coordinates response, makes decisions), Communications Lead (updates customers/team), Operations Lead (executes fixes). In small incidents, one person wears multiple hats. In major incidents, separate roles prevent chaos.
Create runbook templates: For each common incident type (database down, cache full, API unresponsive), create 1-page runbook with: what to check, who to escalate to, common mitigations. Runbooks save 30 minutes in stressful situations.
Design postmortem process: After incident resolves, run postmortem (24-48 hours). Facilitator, scribe, all responders. Format: what happened, why did systems fail, what did we change to prevent recurrence, action items. No blame.
Track and improve: Maintain incident log (date, severity, time-to-detect, time-to-resolution, action items). Review monthly. Trends show systemic issues. If "database restarts fix most incidents," you have a reliability problem needing root cause work.