From development-skills
Guides observability-first debugging: instrument code with logs to reveal exact values, paths, and failures before guessing fixes. For errors, test failures, unexpected behavior.
npx claudepluginhub ntcoding/claude-skillz --plugin fetching-circleci-logsThis skill uses the workspace's default tool permissions.
Stop guessing. Add observability. Understand what's actually happening.
Enforces 4-phase debugging process: root cause investigation via error reading, reproduction, change checks, instrumentation before fixes. For bugs, tests, builds, performance issues.
Enforces systematic root cause analysis before fixes for bugs, test failures, unexpected behavior, performance issues, and build failures.
Guides evidence-driven debugging: state hypothesis, add minimal instrumentation (logs, breakpoints, probes), record observations to confirm or refute theories in async, distributed, or production systems.
Share bugs, ideas, or general feedback.
Stop guessing. Add observability. Understand what's actually happening.
Measure before you act. When something isn't working, the solution is almost never to guess and try random fixes. The solution is to add instrumentation that produces the specific information needed to fully explain the issue.
Agents (and developers) fall into a guess-and-check trap:
Why this happens: Insufficient data. You don't know what's actually happening, so you're shooting in the dark.
Make the invisible visible. Add logging, print statements, assertions, or debugging output that shows you:
What exactly is failing?
Don't:
Before forming hypotheses, instrument the system:
Add logging/print statements to show:
Example:
def process_request(data):
print(f"[DEBUG] Received data: {data}")
print(f"[DEBUG] Data type: {type(data)}")
result = transform(data)
print(f"[DEBUG] After transform: {result}")
if validate(result):
print(f"[DEBUG] Validation passed")
return save(result)
else:
print(f"[DEBUG] Validation FAILED")
print(f"[DEBUG] Validation errors: {get_validation_errors(result)}")
return None
The goal: Produce output that definitively shows what's happening at each step.
Execute with instrumentation active. Capture the output.
Look for:
Now that you have data:
Your hypothesis must:
Add targeted instrumentation or experiments:
If hypothesis is wrong, the instrumentation will show why. Add more observability and repeat.
"Maybe it's a race condition" "It might be a caching issue" "Could be the API timeout"
Fix: Add logging that would confirm or deny each theory.
Changing code hoping it fixes things without understanding why it broke.
Fix: First understand the bug via observability, then fix the root cause.
Making 3 changes simultaneously so you don't know which fixed it (or if it's actually fixed).
Fix: One change at a time. Verify each with instrumentation.
"This function should return user data" → doesn't mean it actually does.
Fix: Print what it actually returns. Verify assumptions.
set -x # Print each command before executing
command -v foo # Check if command exists
echo "Value: $VAR" # Print variable values
Problem occurs
↓
Can you see the exact failure point?
NO → Add logging/prints to trace execution flow
YES ↓
Do you know the input values at failure?
NO → Print input values and parameters
YES ↓
Do you know what the code is actually doing?
NO → Print intermediate results, branches taken
YES ↓
Do you know why it's doing the wrong thing?
NO → Print state, compare to expected state
YES ↓
Fix the bug
Symptom: Test fails with "Expected 3, got undefined"
❌ Speculation: "Maybe the mock isn't working" "Could be async timing issue" [tries random fixes]
✅ Observability-First:
test('calculates total', () => {
const items = [1, 2, 3];
console.log('Input items:', items);
const result = calculateTotal(items);
console.log('Result:', result);
console.log('Result type:', typeof result);
expect(result).toBe(6);
});
Output shows: Result: undefined
Evidence-based action: Check what calculateTotal actually returns. Add logging inside that function to see where it fails to compute/return.
Symptom: API returns 400 error
❌ Speculation: "Maybe the endpoint changed" "Could be auth token expired" [tries different endpoints randomly]
✅ Observability-First:
url = f"{BASE_URL}/api/users"
headers = {"Authorization": f"Bearer {token}"}
payload = {"name": name, "email": email}
print(f"[DEBUG] URL: {url}")
print(f"[DEBUG] Headers: {headers}")
print(f"[DEBUG] Payload: {payload}")
response = requests.post(url, headers=headers, json=payload)
print(f"[DEBUG] Status: {response.status_code}")
print(f"[DEBUG] Response: {response.text}")
Output shows: Response: {"error": "email field is required"}
Evidence-based action: The payload construction is wrong. Check where email variable is set.
Symptom: FileNotFoundError: foo.txt
❌ Speculation: "Maybe the path is wrong" [tries different path variations randomly]
✅ Observability-First:
import os
file_path = "foo.txt"
print(f"[DEBUG] Looking for: {file_path}")
print(f"[DEBUG] Current directory: {os.getcwd()}")
print(f"[DEBUG] Directory contents: {os.listdir('.')}")
print(f"[DEBUG] File exists: {os.path.exists(file_path)}")
if not os.path.exists(file_path):
abs_path = os.path.abspath(file_path)
print(f"[DEBUG] Absolute path would be: {abs_path}")
Output shows: Current directory is /app/src, file is in /app/data
Evidence-based action: Use correct path ../data/foo.txt or fix working directory.
When user says you're going down the wrong path:
User knows their system. When they suggest simple/obvious solutions, they're usually right. Don't overthink it.
The goal: Produce specific data that fully explains the issue, then the fix becomes obvious.
Sources: