Systematic Debugging
A strict debugging workflow. Use when dealing with bugs, test failures, or unexpected behavior.
Three core purposes:
- Fix the cause, not the symptom.
- Prevent guess-based fixes.
- Lock the failure with a test before fixing.
Hard Gates
These rules have no exceptions.
- Do not fix until you have a reproducible or observable state.
- Do not fix until you have stated a root-cause hypothesis.
- Do not fix until you have a failing test or equivalent reproduction mechanism.
- Verify only one hypothesis at a time.
- No "while I'm here" refactoring during a fix.
- If three fix attempts fail, suspect a structural issue before applying another patch.
Violating this process is considered a debugging failure.
When To Use
Use this skill in the following situations:
- When a test fails
- When a bug occurs in production or locally
- When a response, state, rendering, or query result differs from expectations
- When investigating performance degradation, timeouts, race conditions, or intermittent failures
- When something breaks again after being fixed at least once
The following excuses are not accepted:
- "It looks simple, I'll just fix it directly"
- "No time, let's patch it and move on"
- "It's probably this, let me just change it"
Required Output Contract
When using this skill, the following items must be locked internally:
- Problem statement: Define what went wrong in one sentence
- Reproduction path: How to reproduce or observe the failure
- Evidence: Actual observed results
- Root-cause hypothesis: Why you believe this problem occurs
- Failing guard: One of: failing test, reproduction script, or log verification
- Fix: A single fix targeting the root cause
- Verification: Reproduction path and related test results after the fix
If any of these seven items are missing, the work is not done.
Workflow
Follow the steps below in order.
Phase 1. Define The Problem
First, condense the problem.
- What is the expected behavior
- What is the observed behavior
- What is the scope of impact
- Is it always reproducible or intermittent
Output format:
Problem: <expected> but got <actual> under <condition>
Do not mix symptoms with speculation.
Good: Product detail API returns 500 when brand is null.
Bad: Serializer is broken because brand mapping seems wrong.
Phase 2. Reproduce Or Instrument
You must be able to see the failure again before fixing it.
Priority:
- Reproduce with existing tests
- Reproduce with a minimal integration test
- Reproduce with a unit test
- Observe via reproduction script or command
- Observe after adding logs/instrumentation
Rules:
- Make the reproduction path as small as possible.
- Even if the bug is only visible in the UI, prefer reproducing at a lower layer if possible.
- For intermittent failures, increase observability by adding logs, capturing inputs, timestamps, and concurrency conditions.
- If reproduction fails, do not proceed to fixing — increase observability instead.
What to do when reproduction is not possible:
- Record input values
- Check for environment differences
- Check recent changes
- Add logs at boundary points
- Search for smaller conditions that produce the same symptom
Phase 3. Gather Evidence
Collect only observable facts.
Always check:
- Full error messages and stack traces
- Failing input values
- Recently changed files or commits
- Environment/configuration differences
- Call paths and data flow
For multi-component problems, check at each boundary.
Examples:
- controller -> application -> service -> repository
- client -> API -> external service
- scheduler -> batch service -> database
At each boundary, check:
- What came in
- What went out
- What values were transformed
- Under what conditions it breaks
Do not fix until you have pinpointed the problem location.
Phase 4. Isolate Root Cause
Formulate exactly one cause candidate.
Format:
Hypothesis: <root cause> because <evidence>
Qualities of a good hypothesis:
- Points to a single cause
- Connects to observed evidence
- Can be disproved with a small experiment
Examples of bad hypotheses:
- "There seems to be some async issue somewhere"
- "The whole serialization layer seems unstable"
Trace the cause back to the source. If the error appears deep in the stack, trace the origin of the input, not the symptom.
Phase 5. Lock The Failure
Lock the failure before fixing.
Priority:
- Automated failing test
- Add a regression case to existing tests
- Minimal reproduction script
- Temporary verification via logs/assertions
Rules:
- Create an automated test whenever possible.
- It must fail before the fix.
- It must pass on the same path after the fix.
- The test name must reveal what broke.
If an automated test is feasible, use the test-driven-development skill alongside this one.
Phase 6. Implement A Single Fix
The fix addresses only one hypothesis.
Allowed:
- Minimal code change that directly addresses the cause
- Minimal supporting changes needed for verification
Forbidden:
- Bundling multiple seemingly related fixes
- Combining refactoring with the fix
- Sneaking in formatting/cleanup/renaming
- Adding null-guards without evidence
- Swallowing exceptions
If the fix fails, immediately return to Phase 1 or Phase 3. The previous hypothesis was wrong.
Phase 7. Verify And Close
All of the following must be satisfied before closing:
- The original reproduction path no longer fails.
- The new failing guard passes.
- Related tests are not broken.
- You can explain that the fix blocks the cause, not the symptom.
For intermittent bugs, a single pass is not enough. Verification under repeated runs or varying conditions is required.
Stop Conditions
Stop and reframe in the following situations.
1. Reproduction Failed
If reproduction fails after multiple attempts:
- Check if observability is insufficient.
- Check if there are environment differences.
- Check if the problem definition is wrong.
Changing code without reproduction is forbidden.
2. Three Failed Fixes
If three consecutive fixes miss the mark, conclude:
- The current understanding is wrong, or
- The problem is likely structural — shared state, boundary design, responsibility separation
From this point, a "fourth patch" is not the answer — a structural discussion is needed.
3. No Failing Guard
If you cannot create a failing test or equivalent reproduction mechanism, do not declare completion. At minimum, leave behind the reproduction command and observed results.
Red Flags
If any of the following thoughts arise, stop immediately and return to an earlier phase.
- "I'll just change this one line and it should work"
- "I'll check the logs later, let me fix it first"
- "I'll add the test later"
- "Let me fix this and that together at once"
- "The error is gone, so I don't need to know the cause"
Minimal Checklist
Use this checklist for self-verification during execution.
Completion Standard
The completion criterion for this skill is not "the code changed."
Completion criteria:
- The problem definition is clear
- The failure was locked before fixing
- The fix is connected to the root cause
- Verification results remain
Without these four, debugging is not finished.