From aegis
Guides systematic root cause analysis for bugs, test failures, unexpected behaviors, and performance issues before proposing fixes.
npx claudepluginhub ganyuanran/aegis --plugin aegisThis skill uses the workspace's default tool permissions.
→ Bug? Test failure? Unexpected behavior? → **Find root cause first. No fixes without evidence.**
Enforces systematic root cause analysis before fixes for bugs, test failures, unexpected behavior, performance issues, and build failures.
Enforces root cause investigation for bugs, test failures, unexpected behavior, and performance issues through four phases before proposing fixes.
Enforces 4-phase root cause investigation for bugs, errors, test failures, unexpected behavior, and technical issues before proposing fixes.
Share bugs, ideas, or general feedback.
→ Bug? Test failure? Unexpected behavior? → Find root cause first. No fixes without evidence.
Random fixes waste time and create new bugs. Symptom fixes are failure.
This skill is the canonical debugging workflow. Use it to move from symptom to root cause, then to the smallest justified fix and retirement plan.
Any technical issue: test failures, bugs, unexpected behavior, performance problems, build/integration failures.
Especially under time pressure, when "just one quick fix" seems obvious, after multiple failed fixes, or when duplicate owners / fallback chains may be involved.
BEFORE attempting ANY fix:
Read Error Messages Carefully
Reproduce Consistently
feedback-loop-construction.md to build an automated reproduction loop; don't guessCheck Recent Changes
Gather Evidence in Multi-Component Systems
Trace Data Flow (when error is deep in call stack)
root-cause-tracing.md.Drill Down Through Diagnostic Layers
Start at L1. Exhaust all "why" questions at each layer before descending. The chain is open-ended — architecture is not the endpoint.
L1 Symptom: what failed? where? exact reproduction?
L2 Logic: which branch, invariant, or state transition is wrong?
L3 System: which component boundary, dependency, or ownership seam?
L4 Architecture: what design choice, duplicated owner, or fallback chain?
L5 Cross-system: which API / SLA / timing contract between systems?
L6 Platform: what runtime / OS / framework constraint?
L7 Spec gap: who never defined correct behavior for this case?
Hard signal definitions (H/T/D) are in the Quality Gate — apply them there, not during initial investigation.
Fix the root cause, not the symptom:
Create Failing Test Case
Implement Single Fix
Verify Fix
If Fix Doesn't Work
4bis. Post-Fix Differential Diagnosis
After applying a fix, if ANY symptom persists:
STOP. Do NOT attempt another fix without diagnosis.
| Residual pattern | Diagnosis | Action |
|---|---|---|
| Same reproduction conditions as fixed symptom | Fix is incomplete | Continue drilling from same source |
| Different reproduction conditions, chains converge to same source | Fix was at wrong depth | Re-drill from the shared source |
| Different reproduction conditions, chains diverge | Compound root cause (≥2 independent roots) | Each root needs its own fix |
| Same symptom, reduced but not eliminated | Fix was a downstream patch | Re-drill from source |
Compound root cause forms:
If 3+ Fixes Failed: Question Architecture
Pattern indicating architectural problem:
STOP and question fundamentals. Discuss with your human partner before attempting more fixes. This is NOT a failed hypothesis — this is a wrong architecture.
Deliver Dual-Track Closure
For bug fixes, refactors, contract changes, or governance cleanup, always produce:
Repair track — root cause, canonical owner, smallest necessary change, compatibility boundary, verification method.
Retirement track — old owner / fallback / patch, whether it is still active on the main path, the only reason to keep it (if any), trigger for deletion, verification needed before removal.
Never add a new owner, fallback, prompt branch, or adapter path without stating what happens to the old one.
Before you claim debugging is complete:
Workspace record for non-trivial debugging — if this is medium+ complexity
or it writes docs/aegis/ records, initialize/check through the helper when
available:
python scripts/aegis-workspace.py init --root <target-project-root>
python scripts/aegis-workspace.py new-work --root <target-project-root> ...
python scripts/aegis-workspace.py add-evidence --root <target-project-root> --work <YYYY-MM-DD-slug> ...
python scripts/aegis-workspace.py check --root <target-project-root>
These records are method-pack evidence trails only. They do not grant authoritative completion.
Stop-when review — re-read the diagnostic layer where you stopped. Did you reach "no deeper why remains" or a T-class terminal boundary? If the chain ended at L1-L2 and the evidence is conclusive, that is a valid endpoint. If there are still unexplained "why" questions, continue drilling before claiming done.
Hard signal check — apply these countable facts, not judgments:
Must continue drilling (H-class — ANY hit = NOT done):
if / switch / catch / try)git log --grep shows this symptom was "fixed" before → Read that commit's diff. Understand why it failed. Do not repeat the same patch pattern.Terminal unactionable (T-class — any hit = stop drilling, switch to mitigation):
Depth sufficient (D-class — ALL must pass before claiming done):
Reflection — re-run Goal / DeeperCause / Evidence / Risk/Unknown / Decision
Confirm the fix addressed the source, not just the sample
Retirement surface — did it shrink, stay, or grow?
Confidence:
A = direct evidence and regression coverage support the root-cause conclusionB = strong evidence, limited coverage or some bounded unknowns remainC = partial evidence only; do not present as fully resolvedIf confidence is not at least B, do not speak as if the issue is fully closed.
If you catch yourself thinking:
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4 Step 5) If symptoms persist after fix: Run differential diagnosis (see Phase 4 Step 4bis)
If you hear "Is that not happening?", "Will it show us...?", "Stop guessing", "Ultrathink this" → STOP. Return to Phase 1.
If investigation reveals the issue is truly environmental, timing-dependent, or external: document what you investigated, implement appropriate handling (retry, timeout, error message), add monitoring.
See root-cause-tracing.md, defense-in-depth.md, condition-based-waiting.md, feedback-loop-construction.md in this directory for deeper guidance on specific diagnostic scenarios.