From bmad-skills
Provides a structured root-cause investigation protocol for complex technical problems, enforcing explicit diagnosis cycles and blocking fixes until cause is verified.
npx claudepluginhub bmad-labs/skills --plugin bmad-skillsThis skill uses the workspace's default tool permissions.
Complex technical problems fail in a specific pattern: pattern-match to a plausible
Suggests manual /compact at logical task boundaries in long Claude Code sessions and multi-phase tasks to avoid arbitrary auto-compaction losses.
Share bugs, ideas, or general feedback.
Complex technical problems fail in a specific pattern: pattern-match to a plausible fix, apply it, observe it didn't work, apply a variation, repeat. Each iteration feels productive. None produce understanding. The loop can run for hours.
The protocol breaks the loop by separating two modes that must not be mixed:
You cannot enter implementation mode until diagnosis mode has produced a verified cause — meaning you hold direct evidence (a log line, a port number, a source function, a network trace, a config value read from the running process) that explains the symptom. "It might be X" is a hypothesis, not verified evidence. "The running process reads config from path Y, not path Z, because the startup script overwrites the env var before exec" is verified.
Every step in a complex investigation follows this three-part structure. Write it out explicitly — do not compress it into a single paragraph.
THOUGHT: What hypothesis am I testing? What do I expect to find?
What would this result mean for my current model of the problem?
ACTION: The single most informative thing I can do right now —
a command, a source read, a log grep, a process inspection.
One action per cycle. Pick the action that would most change
your model if the result is unexpected.
OBSERVATION: What actually happened. Quote the relevant output directly.
Does this confirm or refute the hypothesis?
How does it change the model?
The OBSERVATION step is where understanding is built. An observation that says "that confirms my theory" without explaining why is a red flag — it means you may be fitting evidence to a pre-formed conclusion rather than updating your model.
Work top to bottom. Stop at the level where you find verified evidence. Going further than necessary wastes time; stopping too early produces wrong diagnoses.
Before any action, restate the symptom in the most concrete observable terms:
The gap between what is observed and what is expected defines the shape of the problem. Every diagnostic action should be aimed at explaining that gap specifically — not at exploring adjacent possibilities.
Check what is free to check and would explain everything if wrong:
The process environment trap. A very common failure mode across all stacks:
you configure a variable in a launcher, compose file, or wrapper script, but the
process overwrites it at startup before it matters. Always verify what the running
process sees, not what you told the launcher to pass. On Linux: /proc/<pid>/environ.
On other platforms: equivalent process inspection tools.
Identify the intended flow and then verify each step in the actual running system:
The root cause is almost always at exactly one point of divergence — a layer boundary, a config value that was silently overridden, an interface that the code bound to differently than expected. The gap between "last correct" and "first wrong" is where to look.
Concretely: rather than theorising about what could go wrong, inspect running state — open file descriptors, bound addresses, active connections, actual values being processed — and compare them to what you expect.
When a component does not behave as documented or configured, read the code that processes the config or handles the relevant path. This sounds slow. It is reliably fast compared to guessing at configuration variations.
Source access priority:
The pattern that makes source reading so valuable: a function that reads
CONFIG_VAR_B instead of the CONFIG_VAR_A you've been setting terminates the
investigation immediately. No configuration change to CONFIG_VAR_A would ever
work, regardless of how many variations you tried.
When source is unavailable or the code path is too complex to trace statically, run a controlled experiment that isolates exactly one variable:
A good experiment is one that could prove you wrong. If it can only confirm what you already believe, it is not diagnostic — it is confirmation bias with extra steps.
You are stuck when any of these is true:
When stuck: stop. Do not apply another fix variant.
Instead:
State the situation explicitly:
"I am stuck. My current model of the problem is [X]. The evidence I have is [Y]. The part I cannot explain is [Z]."
Descend the diagnosis ladder — you have not gone far enough. The most common reason for being stuck is that the actual behaviour at some layer has not been inspected directly; there is an assumption standing in for observation.
If all available diagnostic tools have been exhausted, escalate to external research (see below).
The phrase "stop doing trial and error without knowing the root cause" is a hard stop signal from the user. It means the protocol was violated — return immediately to diagnosis mode, regardless of how close the current fix attempt feels.
Escalate to web search, documentation, or source code research when:
A researchable question is specific enough that a search could answer it directly:
"What environment variable does [library X]'s [function Y] actually read at runtime?"
A non-researchable question is what you write when stuck and hoping research will rescue you:
"How does [technology A] work with [technology B]?"
If you cannot write a specific question, you have not diagnosed far enough. More diagnosis, not more research, is the right move.
"Root cause confirmed: [component A] uses [value X] because [mechanism Y] overrides the configured [value Z] at startup. Evidence: [direct quote from log/source/process state]."
When a system has multiple layers (any stack: hardware → driver → OS → runtime → middleware → application → config), failures almost always occur at exactly one layer boundary. The strategy for finding it:
Find the last layer that works correctly. Where in the chain does behaviour match expectation? Start from the input end and work forward.
Find the first layer that fails. Where does behaviour first diverge from expectation?
The root cause is at that boundary. You now have a precise, one-layer question instead of a whole-system question. Investigate only that boundary.
This decomposition converts "nothing works end-to-end" into a single-layer question that can be answered with one or two diagnostic actions.
Avoid investigating layers you have not checked. It is tempting to hypothesise about a deep layer when the surface layers have not been fully inspected. The actual divergence point is almost always shallower than expected.
| Anti-pattern | Signal | Correct response |
|---|---|---|
| Config iteration | Trying the third variation of the same config change | Stop. Read what config the running process actually loads. |
| Restart loop | Rebuild → restart → check, without new diagnostic information | Stop. The code did not change. Inspect state before restarting again. |
| Assumption drift | Fix is written for cause X before X has been verified | Treat X as a hypothesis. Find direct evidence before writing a fix. |
| Complexity escalation | Each failed fix attempt adds more layers or indirection | Apply Occam's razor. The simpler explanation is right more often. A 3-line change to the right place beats a 50-line workaround around the wrong place. |
| Confirmation reading | Reading diagnostic output to confirm existing belief rather than test it | Ask: what would I see if my hypothesis is wrong? Look for that specifically. |
| Shell ≠ process env | Assuming the process sees what the launcher was told to pass | Verify the process's own environment at runtime, not the launch config. |
| Layer skipping | Theorising about a deep layer without inspecting the surface layers first | Walk the chain from input to output. The first divergence is the root cause. |
# What environment does the running process actually see?
# Linux:
cat /proc/<pid>/environ | tr '\0' '\n'
# Or filter for a specific variable:
cat /proc/<pid>/environ | tr '\0' '\n' | grep VAR_NAME
# Which network sockets does a process own? (Linux)
# Map process file descriptors to UDP/TCP ports:
python3 -c "
import os, re
pid = <pid>
inodes = {}
for fd in os.listdir(f'/proc/{pid}/fd'):
try:
m = re.match(r'socket:\[(\d+)\]', os.readlink(f'/proc/{pid}/fd/{fd}'))
if m: inodes[int(m.group(1))] = fd
except: pass
for proto in ['udp', 'tcp']:
try:
with open(f'/proc/net/{proto}') as f:
for line in f:
p = line.split()
if len(p) >= 10:
try:
i = int(p[9])
if i in inodes:
port = int(p[1].split(':')[1], 16)
print(f'{proto.upper()} fd{inodes[i]} port={port}')
except: pass
except: pass
"
# What string literals (env var names, config keys) does a binary contain?
grep -oa '[A-Z_][A-Z0-9_]\{3,\}' /path/to/binary | sort -u | head -50
# Read source from a public GitHub repo without cloning:
gh api repos/<org>/<repo>/contents/<path/to/file> --jq '.content' | base64 -d
# Which file is a process actually reading? (Linux, requires strace)
strace -p <pid> -e trace=openat 2>&1 | grep -v ENOENT
# What is the process's working directory and open files?
ls -la /proc/<pid>/fd
readlink /proc/<pid>/cwd
Do not declare a fix done until every item is checked: