From bmad-skills
Guides evidence collection for production incidents and generates reproducible Root Cause Analysis reports using checklists and templates.
npx claudepluginhub bmad-labs/skills --plugin bmad-skillsThis skill uses the workspace's default tool permissions.
Produce post-mortems that are **reproducible, layered, and operationally useful** — not just narrative. A good RCA lets a future engineer (or future you) understand the incident, verify the fix held, and avoid repeating it. This skill covers both the investigation flow (what to gather while the incident is fresh) and the report itself.
Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.
Share bugs, ideas, or general feedback.
Produce post-mortems that are reproducible, layered, and operationally useful — not just narrative. A good RCA lets a future engineer (or future you) understand the incident, verify the fix held, and avoid repeating it. This skill covers both the investigation flow (what to gather while the incident is fresh) and the report itself.
If the incident is still actively burning and the user just wants help fixing it, skip this skill — fix first, document after.
Save the report to <topic>-rca-<YYYY-MM-DD>.md in the current working directory, where:
<topic> is a short kebab-case identifier of the failing system (e.g. debezium, auth-service, kafka-consumer-lag)<YYYY-MM-DD> is the incident date (when it occurred), not necessarily todayExample: debezium-rca-2026-05-05.md, auth-500s-rca-2026-04-12.md
Before writing anything, walk through references/investigation-checklist.md with the user. The goal is to lock in concrete, reproducible facts — timestamps, version numbers, exact LSNs/IDs/error strings, command outputs — while the system state is still observable. Memory degrades fast; logs rotate; replication slots advance. Capture now, write later.
Do not skip this phase even if the user says "I already fixed it" — fixed-state evidence (the healthy confirmed_flush_lsn advancing, the test row flowing through Kafka, the new container log showing "streaming from latest xlogpos") is what proves the resolution actually held. That proof is what separates a real RCA from a story.
If the user already has notes/transcripts/scrollback from the live incident, mine those first before asking questions. Don't make them re-type what's already in the conversation.
Use templates/rca-report.md as the structural skeleton. Fill it section by section using the evidence from Phase 1. Then validate against references/quality-rubric.md before declaring done.
The Debezium RCA that this skill is modeled on worked because it had:
A timeline with UTC timestamps for every observable event — "the connector was wedged for ~18h" is narrative; "2026-05-04 09:54:16 Postgres terminated replication connection" is evidence. Always prefer the precise version.
An infrastructure table that fully identifies the system — versions, hostnames, zones, connector names, topic names, slot names. Someone reading this six months later should be able to find the exact resources without ambiguity.
Quantified impact across user, system, data, and SLA dimensions — vague impact ("some customers were affected") is worthless for severity calibration. State user-visible effect, internal system degradation, data integrity status, and SLA / financial cost as concrete numbers. If a number is unknown, say so explicitly rather than skipping the dimension.
Layered root cause analysis — not just what broke, but:
State snapshots with actual values — the contrast between expected state and observed state is what makes the diagnosis click. confirmed_flush_lsn = 1/AD5B16C0 (pre-restore stale value) next to pg_current_wal_lsn = 1/ADC4B740 (current) tells the whole story in two lines. Capture similar contrasts for whatever domain you're in (queue depth, error rate, version mismatch, schema drift).
Workaround / temporary mitigation captured separately from the resolution — the fast, low-risk action that stopped the bleeding before the root cause was fully understood. Workarounds and resolutions answer different questions: workaround = what does on-call do at 3am next time this fires; resolution = what permanently closes the case. Document the workaround's effect, its risks, and the trigger condition for applying it.
Resolution with ordering rationale — not just "I ran these commands", but why this order. If step 4 must come after step 3 because of in-memory state, say so. The next person hitting this will try the obvious order first and fail; document why obvious-order doesn't work.
A Five Whys chain that lands on a systemic gap — the chain is only useful if it stops at a missing guardrail (alert / review / test / knowledge), not at the technical trigger. Each "why" should narrow on a different mechanism — synonyms across adjacent steps mean you're padding. The final answer should map directly to a Recommendation below it.
A "What did NOT work" section — capture the dead ends. Future-you will be tempted to try the same thing. The Debezium RCA's "drop slot + recreate connector without offset reset" entry is gold — it's the most intuitive fix and it silently fails.
Diagnostic commands as a copy-paste block — the next incident in this domain will reuse 80% of these. Make them runnable, not pseudocode.
Verification evidence — proof the fix held. Test data flowing end-to-end. Slot lag stabilizing. Error rate returning to baseline. With actual values from the post-fix state.
Recommendations binned by urgency — Immediate (alerting/monitoring), Process (runbooks, comms), Configuration (settings changes). Bins force the user to think about timeline, not just "stuff to do".
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) > 100MB for 5+ minutes" is.references/investigation-checklist.md). Fill gaps by asking targeted questions or running diagnostic commands. Do this even for "small" incidents — the structure forces depth.templates/rca-report.md. Fill every section; if a section truly doesn't apply, write "N/A — [reason]" rather than deleting it.references/quality-rubric.md). Fix any rubric failures before presenting.<topic>-rca-<YYYY-MM-DD>.md in CWD.Match the operator's voice: technical, concise, evidence-led. Lead each section with the answer, then the reasoning. No corporate hedging ("there may have been some impact") — state what happened. No blame language — focus on system gaps, not individuals. The Debezium RCA is the reference; mirror its directness.