Root Cause Analyzer
Trace observable symptoms through causal chains to identify actual root causes, distinguishing symptoms from contributing factors and true origins.
Guiding Principle
"Treating symptoms is maintenance. Finding root causes is engineering."
Procedure
Step 1 — Symptom Documentation
- Catalog all observable symptoms: errors, performance degradation, unexpected behavior.
- Classify symptoms by severity and frequency.
- Establish a timeline: when did symptoms first appear?
- Correlate symptoms with recent changes (deployments, config changes, dependency updates).
- Document each symptom with evidence
[HECHO].
Step 2 — Causal Chain Tracing
- For each symptom, apply the "5 Whys" technique to trace the causal chain.
- Follow the code path from the symptom location backward through call chains.
- Check for environmental factors: configuration, data, external dependencies.
- Identify branching causes: multiple independent causes for the same symptom.
- Document each level of the causal chain with evidence tags.
Step 3 — Root Cause Identification
- Distinguish symptoms (what you see) from contributing factors (what made it worse) from root causes (what started it).
- Verify the proposed root cause: would removing it prevent all symptoms?
- Check for multiple root causes that combine to produce the observed behavior.
- Validate against the timeline: does the root cause precede all symptoms?
- Tag findings:
[HECHO] for verified, [INFERENCIA] for likely, [SUPUESTO] for hypothetical.
Step 4 — Root Cause Report
- Present each root cause with its full causal chain.
- Recommend corrective actions (fix the root cause) vs. mitigations (reduce impact).
- Suggest preventive measures to avoid recurrence.
- Produce a causal chain diagram in Mermaid.
Quality Criteria
- Clear distinction between symptoms, contributing factors, and root causes
- Each causal link supported by code or data evidence
[HECHO]
- Causal chain validated against timeline
- Corrective vs. mitigation actions clearly separated
Anti-Patterns
- Stopping at the first plausible cause without tracing deeper
- Confusing correlation with causation (timing coincidence)
- Blaming "human error" as a root cause (it's always a system design issue)
- Proposing fixes without verifying they address the actual root cause