Systematic 4-phase debugging methodology for finding and fixing root causes
From sdlcnpx claudepluginhub jwilger/claude-code-plugins --plugin sdlcThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
Version: 1.0.0 Portability: Universal
Defines a systematic 4-phase investigation process for debugging any bug, test failure, or unexpected behavior. Enforces root cause analysis before attempting fixes.
Purpose: Prevent symptom fixes (which hide bugs) and ensure deep understanding of problems before implementing solutions.
Scope:
The Iron Law: Never attempt a fix until you complete root cause investigation.
Why this matters: Symptom fixes hide bugs rather than solving them. They create technical debt, mask deeper issues, and often cause new bugs elsewhere.
How to apply:
Example:
❌ Bad: "Error says null pointer. Let me add a null check."
✓ Good: "Error says null pointer. Why is this null? (investigate)"
Result:
- Bad approach: Symptom fixed, root cause remains, bug appears elsewhere
- Good approach: Found initialization bug, fixed at source, entire class prevented
The Principle: Follow a structured investigation: Root Cause → Pattern Analysis → Hypothesis Testing → Implementation.
Why this matters: Structured investigation prevents random debugging and ensures you understand the problem completely before attempting solutions.
The Four Phases:
How to apply: Complete each phase before moving to the next. Document your findings at each phase.
The Principle: Test a single hypothesis with minimal changes. If it fails, undo and try a different theory.
Why this matters: Changing multiple things simultaneously makes it impossible to know which change had which effect. This wastes time and compounds confusion.
How to apply:
Example:
❌ Bad: "Let me change the import, add a null check, and update the type signature"
✓ Good:
- Hypothesis 1: "The import is wrong" → Test → Refuted → Undo
- Hypothesis 2: "The type is incorrect" → Test → Confirmed → Fix
Result: Clear understanding of what actually solved the problem
The Principle: If three fix attempts fail, stop trying fixes. The problem is deeper than you think.
Why this matters: Repeated failures signal architectural problems or domain modeling issues, not simple bugs. Continuing to try fixes wastes time.
How to apply:
Example:
Attempt 1: Add validation → Still fails
Attempt 2: Change order → Still fails
Attempt 3: Different algorithm → Still fails
STOP. Question the architecture:
- Are we solving the wrong problem?
- Is the domain model incorrect?
- Is this a fundamental design issue?
Rationale: Disciplined investigation finds root causes. Random debugging wastes time and hides problems.
Scenario: Test that passed before now fails after code changes.
Approach:
Phase 1: Root Cause Investigation
git diff HEAD~5 # What changed?
git log --oneline -10 # Recent commits
Phase 2: Pattern Analysis
Phase 3: Hypothesis Testing
Phase 4: Implementation
Scenario: Error occurs in distributed system (frontend → API → database).
Approach:
Phase 1: Root Cause Investigation
Phase 2: Pattern Analysis
Phase 3: Hypothesis Testing
Phase 4: Implementation
Scenario: Three fix attempts have failed.
Approach:
After 3rd failure:
Example:
Problem: "User authentication fails intermittently"
Attempt 1: Add retry logic → Still fails
Attempt 2: Increase timeout → Still fails
Attempt 3: Better error handling → Still fails
STOP. Architectural questions:
- Is session management the right approach?
- Should this be stateless with tokens instead?
- Is the database schema correct?
Result: Discovered fundamental session model flaw, redesigned auth flow
Works well with:
Prerequisites:
Problem: "I know what this is, let me just fix it" (skipping investigation)
Solution: Resist the urge. Do Phase 1 investigation FIRST, even if you think you know the answer. You're often wrong.
Problem: Changing multiple things without hypothesis, hoping something works
Solution: Form explicit hypothesis. Test ONE thing. Observe result. Learn from it.
Problem: "Let me add this check to prevent the error" without understanding why it occurs
Solution: Ask "why is this happening?" not "how do I hide this?" Fix the source, not the symptom.
Problem: Skipping working examples, trying to fix in isolation
Solution: Always find working code. Understanding why something works is as important as understanding why something fails.
Problem: "Fourth time's the charm" (continuing after 3+ failed fixes)
Solution: 3 failures = architectural problem signal. Stop fixing, start redesigning.
Phase 1: Root Cause Investigation
Error: "NullPointerException at user_service.rs:42"
File: user_service.rs
Line: 42
Code: let email = user.email.unwrap();
Reproduction: Always fails for user_id = 123, never fails for user_id = 456
Recent changes: Added email validation to registration (3 days ago)
Data flow: Database → UserService.load() → user.email → unwrap()
Phase 2: Pattern Analysis
Working example: User 456 has email in database
Failing example: User 123 has NULL email in database
Difference: User 123 was created BEFORE email validation was added
(email field nullable in DB for backward compatibility)
Dependencies: Database migration didn't backfill existing users
Phase 3: Hypothesis Testing
Hypothesis: "User 123 has NULL email because created before validation"
Test: Check database directly
SELECT id, email FROM users WHERE id = 123;
Result: email = NULL
Result: CONFIRMED - old users have NULL emails
Phase 4: Implementation
1. Create failing test:
test_user_service_handles_missing_email() {
user = User { id: 123, email: None };
result = service.load(user);
assert!(result.is_ok()); // Should handle gracefully
}
2. Fix (two parts):
a. Root cause: Backfill database (migration)
UPDATE users SET email = 'placeholder@example.com' WHERE email IS NULL;
b. Defense: Handle None case in code
let email = user.email.unwrap_or_default();
3. Verify:
- Test passes
- All users now have emails
- No more NullPointerException
Phase 1: Root Cause Investigation
Error: "Expected 200 OK, got 500 Internal Server Error"
Test: test_user_registration
Reproduction: Fails consistently in CI, passes locally
Recent changes: Updated authentication library (yesterday)
Environment difference:
- Local: SQLite in-memory database
- CI: PostgreSQL 14
Phase 2: Pattern Analysis
Working example: Local test with SQLite
Failing example: CI test with PostgreSQL
Difference investigation:
- Read auth library changelog
- Found: Library 2.0 uses PostgreSQL-specific JSON operators
- SQLite doesn't have these operators, but doesn't use them either
Dependencies:
- Auth library assumes PostgreSQL JSON support
- Library works with SQLite by accident (doesn't exercise JSON paths)
Phase 3: Hypothesis Testing
Hypothesis: "Auth library 2.0 uses PostgreSQL JSON operators incompatible with SQLite"
Test: Run local tests with PostgreSQL instead of SQLite
docker run -p 5432:5432 postgres:14
DATABASE_URL=postgresql://localhost/test cargo test
Result: CONFIRMED - local tests now fail with same error as CI
Phase 4: Implementation
1. Failing test already exists (test_user_registration)
2. Fix options:
a. Pin auth library to 1.x (workaround)
b. Migrate to PostgreSQL everywhere (align environments)
c. Use database-agnostic JSON library (portable)
Choice: (b) - Align local and CI environments
3. Implementation:
- Update local dev setup to use PostgreSQL
- Document in README
- Update .env.example
4. Verify:
- Local tests pass with PostgreSQL
- CI tests pass
- No environment discrepancies remain
Scenario: Performance bug (API response time > 5 seconds)
Attempt 1: Add caching
Hypothesis: "Database queries are slow, need caching"
Implementation: Add Redis cache for user queries
Result: FAILED - Still slow (5.2 seconds)
Attempt 2: Index database
Hypothesis: "Missing database indexes"
Implementation: Add index on users.email
Result: FAILED - Still slow (5.1 seconds, marginal improvement)
Attempt 3: Optimize query
Hypothesis: "N+1 query problem"
Implementation: Add eager loading for relationships
Result: FAILED - Still slow (4.8 seconds, still over limit)
After 3rd failure - STOP AND ESCALATE:
Question: "Why do performance fixes keep failing?"
Deeper investigation:
- Profile API with flamegraph
- Found: 90% of time spent in external service call (not database!)
- Root cause: Synchronous call to email validation API (3rd party)
Architectural problem:
- Wrong assumption: Database was the bottleneck
- Actual problem: Blocking I/O to external service
- Solution: Move email validation to async background job
Result: Response time < 200ms after architectural change
Use this checklist to verify you're following the debugging protocol:
If you can't check all boxes, you're not following the protocol.
Watch for these thoughts - they indicate you're about to skip the protocol:
| Thought | Reality | Correct Action |
|---|---|---|
| "I know what this is, let me just fix it" | You're skipping investigation | Do Phase 1 first |
| "Quick fix, then investigate if needed" | You'll never investigate after | Do Phase 1 FIRST |
| "Let me try a few things" | Random debugging hides bugs | ONE hypothesis at a time |
| "This worked before, must be environment" | Assumptions without evidence | Verify with evidence |
| "I'll add a check to prevent the error" | Symptom fix, not root cause | Find WHY it happens |
| "Fourth time's the charm" | 3+ failures = architecture problem | STOP. Escalate. |
When you catch yourself thinking these things, STOP and return to the protocol.
Source Documentation:
Related Skills:
External Resources:
Extraction Source: sdlc/commands/shared/debugging-protocol.md Extraction Date: 2026-02-04 Last Updated: 2026-02-04 Compatibility: Universal (all languages and frameworks) License: MIT