From agent-almanac
Implements regenerative recovery for damaged systems using triage, stabilization, scaffolding, progressive rebuild, and scar management. For incidents, failed migrations, technical debt.
npx claudepluginhub pjt222/agent-almanacThis skill uses the workspace's default tool permissions.
---
Guides incident response workflows from detection and triage through containment, resolution, and postmortem, with severity frameworks and resilience patterns like circuit breakers.
Triages production incidents by severity (P1-P3), contains blast radius via rollback or strategies, root-causes issues, documents timeline, generates postmortems. Triggers on outages, errors, or 'incident' mentions.
Guide incident response, root cause analysis, and post-mortem documentation. Use when: production incident, outage response, post-mortem writing, RCA. Keywords: incident, outage, post-mortem, RCA, root cause, 事故, 故障, 根因分析.
Share bugs, ideas, or general feedback.
Implement regenerative recovery for systems that have sustained structural damage — whether from incidents, failed migrations, accumulated neglect, or external disruption. Uses biological wound-healing as a framework: triage, stabilization, scaffolding, progressive rebuild, and scar tissue management.
adapt-architecture) left the system in a damaged intermediate statedefend-colony) when the colony sustained damageRapidly assess all damage and classify by severity and urgency.
Wound Classification:
┌──────────┬──────────────────────┬────────────────────────────────────┐
│ Class │ Severity │ Response │
├──────────┼──────────────────────┼────────────────────────────────────┤
│ Critical │ Core function lost, │ Immediate: stop bleeding, activate │
│ │ data at risk, │ backup, redirect traffic, page │
│ │ actively spreading │ on-call team │
├──────────┼──────────────────────┼────────────────────────────────────┤
│ Serious │ Important function │ Urgent: fix within hours/days, │
│ │ degraded, no spread │ workarounds acceptable short-term │
├──────────┼──────────────────────┼────────────────────────────────────┤
│ Moderate │ Non-critical function│ Scheduled: fix within sprint, │
│ │ affected, contained │ prioritize against other work │
├──────────┼──────────────────────┼────────────────────────────────────┤
│ Minor │ Cosmetic or edge │ Backlog: fix when convenient, │
│ │ case, no user impact │ may self-resolve │
└──────────┴──────────────────────┴────────────────────────────────────┘
Expected: A complete wound inventory classified by severity, with a prioritized repair order that accounts for wound interactions.
On failure: If triage takes too long (the system is actively degrading), skip detailed classification and focus on: "What is the single most critical thing to stabilize?" Fix that first, then return to full triage.
Stop the damage from spreading before beginning repair.
Expected: The system is stable (not actively degrading) even if degraded. Damage is contained and not spreading. Evidence is preserved for root cause analysis.
On failure: If stabilization fails (damage continues spreading despite containment), escalate to full system fallback: activate disaster recovery, switch to backup system, or gracefully degrade to minimal viable operation. Stabilization that takes too long becomes the disaster.
Construct the temporary structures that support the repair process.
Expected: A repair environment with diagnostic capability, a sequenced repair plan, and awareness of scar tissue risk.
On failure: If setting up a proper repair environment is too slow (system urgency demands immediate production changes), apply changes directly but with extreme discipline: one change at a time, tested by the available means, rolled back if it doesn't help.
Repair damage systematically, verifying each fix before proceeding.
Expected: Critical and serious wounds are repaired with verified fixes. Emergency patches are removed. The system is restored to functional operation.
On failure: If a repair attempt fails or causes regression, roll back to the previous state and reassess. If multiple repair attempts fail for the same wound, the damage may be too deep for local repair — consider whether the affected component needs full replacement rather than repair (see dissolve-form).
Address the workarounds and shortcuts introduced during emergency repair, and strengthen against recurrence.
defend-colony immune memory)Expected: Scar tissue is managed (removed, replaced, or accepted with documentation). The system is not only repaired but more resilient than before the damage. Learnings are captured for future incidents.
On failure: If scar tissue management is deprioritized ("it works, don't touch it"), schedule it explicitly. Unmanaged scar tissue accumulates and eventually contributes to the next incident. If the root cause can't be identified, strengthen detection and recovery speed as compensating controls.
assess-form — damage assessment shares methodology with form assessmentadapt-architecture — architectural adaptation may be needed if damage reveals structural weaknessdissolve-form — for components too damaged to repair; dissolve and rebuilddefend-colony — defense triggers repair; post-incident recovery feeds back into defenseshift-camouflage — surface adaptation can mask damage while repair proceeds (with caution)conduct-post-mortem — structured post-incident analysis complements root cause identificationwrite-incident-runbook — repair procedures should be captured as runbooks for future incidents