Disciplined methodology to isolate bugs through hypothesis-driven testing. Form hypotheses, test them with minimal changes, narrow scope systematically. Use when bugs are unclear or reproduce intermittently.
From debuggingnpx claudepluginhub sethdford/claude-skills --plugin engineer-debuggingThis skill is limited to using the following tools:
examples/example-output.mdtemplate.mdGuides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Details PluginEval's skill quality evaluation: 3 layers (static, LLM judge), 10 dimensions, rubrics, formulas, anti-patterns, badges. Use to interpret scores, improve triggering, calibrate thresholds.
Disciplined approach to finding root cause by treating debugging as a scientific investigation, not a guessing game.
You are guiding systematic bug investigation. Your role is to:
Debugging is hypothesis-driven experimentation, not random code inspection or prayer-driven programming.
Based on Zeller's Why Programs Fail and Robert C. Martin's discipline of debugging:
Before systematic debugging:
This is 80% of debugging. If you can't reproduce it, you can't debug it.
Approach:
Example:
Bug: Payment processing fails intermittently
Steps to reproduce:
1. Create order with 3+ items
2. Apply discount code "SAVE10"
3. Click "Process Payment"
4. Observe: 50% of time, get "Transaction timeout error"
If you achieve 100% reproduction, you're done with Step 1. If it's still intermittent, explore:
Collect evidence:
Tools:
Example output:
Stack trace:
TypeError: Cannot read property 'apply_discount' of undefined
at PaymentProcessor.js:42 (apply_discount())
at Order.js:88 (process())
Logs:
[2026-03-11 14:23:45] INFO: Order created, id=12345
[2026-03-11 14:23:46] DEBUG: Discount code "SAVE10" applied
[2026-03-11 14:23:47] ERROR: Cannot read property 'apply_discount' of undefined
Database state at failure:
SELECT * FROM orders WHERE id=12345;
id | discount_code | processed_at
12345 | SAVE10 | NULL
Request:
POST /orders/12345/process
Content-Type: application/json
{ "payment_method": "credit_card" }
Based on the data, make a testable guess:
Bad hypotheses (not falsifiable):
Good hypotheses (falsifiable, specific):
apply_discount() is called"Hypothesis template:
"If [condition], then [observable result]. When I check [specific place], I expect to find [specific data]."
Example:
"If the discount code is not being parsed correctly, then the discount_rate variable should be undefined or have an unexpected value. When I inspect the Order object in the debugger after applying discount code, I expect discount_rate to be undefined."
Test the hypothesis with the smallest possible change:
Before (hypothesis: discount_rate is undefined):
# Suspected code in order.py
def apply_discount(self, code):
discount_rate = DISCOUNT_CODES.get(code) # What if code is wrong?
self.subtotal *= (1 - discount_rate)
Experiment 1 (check hypothesis):
# Add logging to inspect state
def apply_discount(self, code):
discount_rate = DISCOUNT_CODES.get(code)
print(f"DEBUG: code={code}, discount_rate={discount_rate}") # Minimal change
self.subtotal *= (1 - discount_rate)
Run with same inputs that triggered bug. If output shows discount_rate=None, hypothesis confirmed.
Experiment 2 (test if that's the cause):
# Fix the problem
def apply_discount(self, code):
discount_rate = DISCOUNT_CODES.get(code, 0) # Default to 0 if not found
self.subtotal *= (1 - discount_rate)
Run same inputs. Does bug disappear?
Run the experiment:
# Original code with bug-triggering input
python order_processor.py --order-id 12345 --discount "SAVE10"
# Output:
# DEBUG: code=SAVE10, discount_rate=None
# TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'
Evaluation:
If experiment shows something unexpected:
Experiment result: discount_rate=0.1 (correctly found!)
Conclusion: discount_rate is NOT the problem. Hypothesis was wrong.
New hypothesis: Maybe self.subtotal is undefined?
def apply_discount(self, code):
discount_rate = DISCOUNT_CODES.get(code, 0)
print(f"DEBUG: subtotal={self.subtotal}, discount_rate={discount_rate}") # Check subtotal
self.subtotal *= (1 - discount_rate)
Run again. If output shows subtotal=None, hypothesis 2 is confirmed. If subtotal is correct, form hypothesis 3.
Keep narrowing: Each failed hypothesis eliminates possibilities and narrows the search space.
Once root cause is found, implement the fix:
# Before (buggy)
def apply_discount(self, code):
discount_rate = DISCOUNT_CODES.get(code) # Returns None if not found
self.subtotal *= (1 - discount_rate) # TypeError if discount_rate is None
# After (fixed)
def apply_discount(self, code):
if code not in DISCOUNT_CODES:
raise ValueError(f"Invalid discount code: {code}")
discount_rate = DISCOUNT_CODES[code]
self.subtotal *= (1 - discount_rate)
Verify:
Ensure the bug doesn't reappear:
def test_apply_discount_raises_on_invalid_code():
order = Order(subtotal=100.00)
with pytest.raises(ValueError, match="Invalid discount code"):
order.apply_discount("INVALID_CODE")
def test_apply_discount_reduces_subtotal():
order = Order(subtotal=100.00)
order.apply_discount("SAVE10") # 10% off
assert order.subtotal == 90.00
Run tests. Both pass. Bug is fixed and regression-tested.
When using this skill, deliver:
Example output:
## Bug: Payment processing fails intermittently
### Reproduction
Steps: Create order, apply discount "SAVE10", process payment
### Hypothesis
Discount code parsing is broken; discount_rate is undefined.
### Experiment
Add debug logging to apply_discount() to inspect state:
[code showing experiment]
### Result
discount_rate=None when code is "SAVE10"
### Root Cause
DISCOUNT_CODES.get(code) returns None if code not found.
apply_discount() doesn't handle None.
### Fix
[fixed code]
### Verification
[tests passing]
### Commit
fix: handle invalid discount codes in apply_discount
Bug Report: "Payment processing times out randomly. Sometimes it works, sometimes it doesn't."
Step 1: Reproduce
Try 10 times with same input:
Attempt 1: Success (2.3s)
Attempt 2: Timeout (30s)
Attempt 3: Success (2.1s)
Attempt 4: Timeout (30s)
Attempt 5: Success (2.4s)
Observation: ~50% failure rate. Timing suggests either:
Step 2: Gather Data
Logs during successful attempt:
[14:23:45.123] DEBUG: PaymentProcessor.process() called
[14:23:45.234] DEBUG: Contacting payment gateway...
[14:23:45.456] DEBUG: Gateway response received
[14:23:45.567] INFO: Payment processed successfully
Logs during timeout:
[14:24:01.123] DEBUG: PaymentProcessor.process() called
[14:24:01.234] DEBUG: Contacting payment gateway...
[14:24:31.456] ERROR: Timeout waiting for gateway response
Observation: Gateway response is sometimes slow (>30 seconds). Timeout is set to 30 seconds.
Step 3: Form Hypothesis
"If the payment gateway is slow, the timeout will trigger. When I check the gateway response time, I expect it to sometimes exceed 30 seconds."
Step 4: Design Experiment
Add instrumentation to measure gateway response time:
def process_payment(self, order):
start_time = time.time()
response = self.gateway.request(order) # Might be slow
elapsed = time.time() - start_time
self.logger.info(f"Gateway response time: {elapsed:.2f}s")
if elapsed > 25: # Warn if close to timeout
self.logger.warning(f"Slow gateway response: {elapsed:.2f}s")
return response
Step 5: Execute & Evaluate
Run 10 times, collect data:
[14:25:01] Gateway response time: 2.34s
[14:25:02] Gateway response time: 28.56s ← Close to timeout!
[14:25:03] Gateway response time: 2.12s
[14:25:04] Gateway response time: 31.23s ← Timeout! (>30s)
[14:25:05] Gateway response time: 2.45s
Conclusion: Gateway is slow sometimes (2-30+ seconds). Current timeout of 30s is too tight.
Step 6: Root Cause
Payment gateway has variable latency (SLA is "up to 30 seconds"). Our timeout of 30 seconds is too aggressive; network jitter causes timeouts.
Step 7: Fix
Increase timeout and add retry logic:
class PaymentGateway:
TIMEOUT = 45 # Was 30; increased to 45s
MAX_RETRIES = 3
def process_payment_with_retry(self, order):
for attempt in range(self.MAX_RETRIES):
try:
return self.process_payment(order)
except TimeoutError:
self.logger.warning(f"Attempt {attempt+1} timed out, retrying...")
time.sleep(2 ** attempt) # Exponential backoff
raise TimeoutError("Payment gateway unreachable after 3 attempts")
Step 8: Verify
Run 10 attempts:
✓ All 10 successful
Response times: 2.3s, 28.5s, 2.1s, 45.2s (retry 1) → success, 2.4s, ...
Success! No more timeouts.
Step 9: Test
def test_payment_retries_on_timeout():
gateway = MockGateway(responses=[
TimeoutError(), # Attempt 1 fails
TimeoutError(), # Attempt 2 fails
{"status": "success"} # Attempt 3 succeeds
])
processor = PaymentProcessor(gateway)
result = processor.process_payment_with_retry(order)
assert result["status"] == "success"
Commit:
fix: increase payment gateway timeout and add retry logic
Payment gateway has variable latency (up to 30s). Previous timeout of 30s
was too aggressive; network jitter caused intermittent failures.
Changes:
- Increase timeout from 30s to 45s
- Add exponential backoff retry (max 3 attempts)
- Log retry attempts for monitoring
Fixes intermittent payment processing timeouts.
Verified: 10 consecutive successful payment attempts.
When debugging, use these decisions:
Mistake: Make random changes hoping one fixes it.
Why LLMs make this: Code generation is fast; testing changes feels like progress.
Guard: Form hypothesis first. Change one thing. Test. Evaluate. Repeat.
Example:
Mistake: Rely solely on logs; never use a debugger.
Why LLMs make this: Debuggers require understanding tool usage; print feels simpler.
Guard: Use a debugger. Set breakpoints. Inspect variables. Evaluate expressions.
Example:
Mistake: See error report, make a guess, deploy "fix" without verifying bug first.
Why LLMs make this: Seems efficient; avoids setup time.
Guard: Reproduce bug locally first. Verify fix resolves it. Test edge cases.
Example:
Mistake: Make 5 changes thinking one will help; can't tell which one fixed it.
Why LLMs make this: Feels productive to batch changes.
Guard: Change one thing per experiment. Test. If it helps, keep it. If not, revert.
Example:
Mistake: Make a change; assume it's fixed; don't test thoroughly.
Why LLMs make this: Pressure to move on; testing is tedious.
Guard: Before closing bug, verify with same steps that triggered it. Test edge cases.
Example:
Mistake: Hide the error instead of fixing the underlying problem.
Why LLMs make this: Symptom-fixing is faster; feels complete.
Guard: Ask "why did this happen?" Keep asking "why?" until you reach root cause.
Example:
Before considering a bug fixed: