Help us improve
Share bugs, ideas, or general feedback.
From 9arm-skills
Writes an engineering post-mortem for a fixed bug: root cause, mechanism, fix, validation, and how it slipped through. Refuses to draft if root cause or fix is unconfirmed.
npx claudepluginhub thananon/9arm-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/9arm-skills:post-mortemThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The canonical engineering record of a bug fix. Written **after** debugging lands a real fix, **for** other engineers (and future-you, who will have forgotten everything in 6 months). Code identifiers are welcome here — this is the artifact that lets the next person recover the mental model fast.
Writing comprehensive postmortems: timeline, root cause, prevention, action items.
Structures a post-incident debrief with timeline, root cause analysis, and follow-ups. Use after production incidents, failed releases, or significant bugs.
Root cause based one-shot bug fix. Runs a full investigation pipeline: debugger diagnosis, gap analysis, requirements generation, execution, and verification. Includes QA suggestions after successful fix.
Share bugs, ideas, or general feedback.
The canonical engineering record of a bug fix. Written after debugging lands a real fix, for other engineers (and future-you, who will have forgotten everything in 6 months). Code identifiers are welcome here — this is the artifact that lets the next person recover the mental model fast.
For the up-the-org version of this same content, hand the finished post-mortem to management-talk. They compose: post-mortem owns the engineering truth, management-talk reframes it for leadership.
Before writing a single line, confirm all four. If any are missing, list what's missing and stop:
These map directly to debug-mantra steps 1–4. If you came in via debug-mantra, the breadcrumb ledger from step 4 is your raw material — pull from it.
Use these blocks in this order. Summary, Root cause, Fix, and Validation are mandatory. The rest are conditional but usually present.
One paragraph. What broke, in user/workload terms. What fixed it, in one sentence. JIRA key, PR number, owner. A reader who stops here should have the right answer.
What was actually observed. Test output, error message, log line, perf number, customer report. Concrete identifiers — don't paraphrase the failure mode.
The actual bug mechanism. Code identifiers welcome and expected — function names, file paths, struct fields, branch conditions, commit SHAs of the offending change. Walk the cause chain end-to-end. This is the most expensive section and the reason the post-mortem exists at all. Future-you will live or die by how clearly you write this.
Link the root cause to the symptom. Often non-obvious — the bug is in tadaLaunchPrepare but the visible failure is a customer training run hanging hours later. Walk the chain so a reader who only knows the symptom can connect it back to the cause without re-deriving it.
What changed and why this change addresses the root cause rather than hiding the symptom. Link to PR / commit. If a previous fix attempt papered over the symptom, name it and explain what was wrong with it — that history is part of the cause.
Short. The debugging path:
debug-mantra step 2 cascade).This section is for the next debugger — make it learnable.
What allowed this bug to reach the branch / release / customer. Pick the real reason:
If the honest answer is "no good reason — we should have caught this," say so. Blameless — describe the gap, not the person.
How we know the fix works. Concrete:
If you only validated one configuration, say so explicitly — "validated on Llama-2-70B / 8 GPUs / DeepSpeed; not retested on other workloads." Don't imply broader coverage than you actually have.
Concrete next-steps that aren't in the fix PR itself. Each item: what + owner + tracking artifact.
If there are no action items, write "None — the fix is sufficient and no class-of-bug follow-up is warranted." Don't manufacture action items to look thorough.
This is engineer-to-engineer. Different from management-talk:
tadaLaunchPrepare, tada/prim.h::syncWaitPeer, scratchBuf, commit SHAs, line numbers — keep them. The whole point is that future engineers can grep their way back to the change.docs/postmortems/<ticket>.md, internal wiki page. The shape is the same — only the wrapping changes.POST /rest/api/3/issue/<KEY>/comment. Print-only output needs no approval.management-talk." Don't do it automatically.Summary. Tada's single-stream fast-path skipped a required cross-stream synchronization, causing kernels to launch before scratch-buffer writes were visible. Triggered reliably by dumbModel on LLM-7B fine-tuning, hanging the workload at every eval step. Fixed by removing the unsafe fast-path and tightening a device-side check. JIRA-12345, PR org/platform#5751, owner Alex (Tada team).
Symptom. 8-GPU LLM-7B fine-tuning under dumbModel hung indefinitely at the first eval step. No error, no timeout — busy-spin in
tadaKernel_AllReduce_f32_RING. Reproduced on every run.Root cause. The single-stream fast-path in
tadaLaunchPrepare/tadaLaunchKernel/tadaLaunchFinish(gated onscheduler->numStreams == 1 && !plan->persistent) skipped the cross-stream event betweenlaunchStreamandhandle->shared->deviceStream. dumbModel hits this gate exactly. The kernel was launched before the IPC publish / scratch-buffer writes ondeviceStream(which populatescratchBuf) were visible tolaunchStream. In the kernel:scratchBuf == NULL→ stray pointer dereference → ring ready-flag read from garbage memory → thread spins forever waiting for a ready signal that will never arrive.Why it produced the symptom. The hang lives in the all-reduce ring waitloop, which is the last visible thing in the call stack — but the actual bug is at launch-prep, several frames earlier. The skipped sync is silent until a workload triggers the exact gate (single-stream, non-persistent), and dumbModel's reduce-scatter pattern hits it at every eval step.
Fix. PR #5751 removes the single-stream fast-path entirely (the saving was negligible vs. the safety it bypassed) and adds a device-side null check on
scratchBufbefore dereference, so the same class of bug fails loudly instead of silently spinning. A previous attempt (PR #5612) added a host-side defensive check after IPC publish that hid the symptom in some paths but left the underlying race in place — that change is also reverted.How it was found. Reproducer narrowed from "8-GPU LLM-7B hangs sometimes" to a deterministic 30s repro by pinning to a single eval step on a 2-GPU subset. Initial hypothesis: kernel launch ordering on
launchStream. Disproved by the debugger — the kernel was correctly enqueued. Second hypothesis: scratch-buffer init race. Confirmed by adding[DBG-7af3]instrumentation intadaLaunchPrepareprintingscratchBufand adeviceStreamevent-record timestamp; the launch happened before the publish completed. Single experiment that nailed it: forcingnumStreams = 2made the bug disappear, isolating the gate.Why it slipped through. Latent code path. The single-stream fast-path was added in March under the assumption that dumbModel paths always took the multi-stream route. That assumption was true at the time. A May change to dumbModel's launcher began collapsing eval steps to a single stream — at which point the gate flipped. Tada's CI did not exercise the single-stream + IPC + scratch-buffer combination; the customer workload was the first to hit it.
Validation. Original LLM-7B / 8-GPU / dumbModel workload now completes a full eval pass cleanly (3 consecutive 2-hour runs).
tada-testsall_reduce_perfregression suite green. Soak run: 6 hours on 8 GPUs, no hang. Not retested on other model sizes or non-dumbModel workloads — both go through the multi-stream path and were never affected.Action items.
- Regression test added:
tests/single_stream_ipc_publish_test.cppexercising the previously-uncovered gate. (Alex, merged in PR #5751.)- CI gap: add a single-stream + IPC matrix entry to nightly. (Alex, JIRA-12346.)
- Doc update: Tada launch-fast-path invariants documented in
docs/launch_synchronization.md. (Alex, PR #5752.)- Related: audit other
numStreams == 1fast-paths for the same class of bug. (Filed as JIRA-12347.)
What this post-mortem does that the management-talk version didn't:
tadaLaunchPrepare, scratchBuf, numStreams, handle->shared->deviceStream).numStreams = 2 made it disappear).management-talk's job, not yours.