From groundwork
Enforces systematic root cause analysis for bugs, test failures, unexpected behavior, and regressions via five-phase workflow: Understand, Reproduce, Isolate, Fix, Verify.
npx claudepluginhub etr/groundworkThis skill uses the workspace's default tool permissions.
Find the root cause. Prove it. Then fix it.
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Find the root cause. Prove it. Then fix it.
Core principle: If you can't explain why the bug happens, you don't know if your fix is correct.
Violating the letter of the rules is violating the spirit of the rules.
Your current effort level is {{effort_level}}.
Skip this step silently if effort is high or higher AND you are Opus (1M context).
If effort is below high, you MUST show the recommendation prompt — regardless of model.
If you are not Opus (1M context), you MUST show the recommendation prompt - regardless of effort level.
Otherwise → use AskUserQuestion:
{
"questions": [{
"question": "Do you want to switch? Falsifiable hypothesis formation and resisting the temptation to jump to fixes benefit from deeper reasoning.\n\nTo switch: cancel, run `/model opus[1m]` and `/effort high`, then re-invoke this skill.",
"header": "Recommended: Opus (1M context) at high effort",
"options": [
{ "label": "Continue" },
{ "label": "Cancel — I'll switch first" }
],
"multiSelect": false
}]
}
If the user selects "Cancel — I'll switch first": output the switching commands above and stop. Do not proceed with the skill.
Always:
Exceptions (ask your human partner):
Thinking "I already know the fix"? Stop. That's guessing, not debugging.
NO CODE CHANGES UNTIL YOU CAN EXPLAIN THE ROOT CAUSE
Think you know the fix? Prove the root cause first. Then fix.
No exceptions:
Understand first. Fix second. Period.
digraph debugging {
rankdir=LR;
understand [label="UNDERSTAND\nRead the system", shape=box, style=filled, fillcolor="#cce5ff"];
reproduce [label="REPRODUCE\nMake it fail", shape=box, style=filled, fillcolor="#ffcccc"];
isolate [label="ISOLATE\nDivide and\nconquer", shape=box, style=filled, fillcolor="#fff3cc"];
fix [label="FIX\nMinimal change\n+ failing test", shape=box, style=filled, fillcolor="#ccffcc"];
verify [label="VERIFY\nProve it's fixed", shape=box, style=filled, fillcolor="#ccccff"];
escalate [label="ESCALATE\nReport to human", shape=box, style=filled, fillcolor="#ffccee"];
journal [label="DEBUG JOURNAL\n.debug/{slug}.md", shape=note, style=filled, fillcolor="#f0f0f0"];
understand -> reproduce;
reproduce -> isolate;
isolate -> fix;
fix -> verify;
verify -> fix [label="fix failed\n(max 3)"];
verify -> escalate [label="3 fixes\nfailed"];
understand -> journal [style=dashed];
reproduce -> journal [style=dashed];
isolate -> journal [style=dashed];
fix -> journal [style=dashed];
verify -> journal [style=dashed];
}
Read the system. Check your assumptions. Don't touch anything yet.
Before investigating the bug, understand the code it lives in. Your mental model is a hypothesis — treat it as one.
Required steps:
Example — multi-layer instrumentation:
System: Frontend → API Gateway → Auth Service → Database
Step 1: Add logging at each boundary (request in, response out)
Step 2: Trigger the failure once
Step 3: Read the logs layer by layer:
- Frontend sent correct request ✓
- API Gateway forwarded correctly ✓
- Auth Service received request ✓
- Auth Service → Database query returned empty result ✗
- Auth Service returned 500 to Gateway
Conclusion: Problem is between Auth Service and Database.
Now investigate that specific boundary.
Don't guess which layer is broken. Let the evidence from all layers tell you.
``` Read the full function. Trace the call chain from the API endpoint through the service layer to the database query. Found that `getUserById` is called with a string ID but the query expects an integer. The type mismatch happens at the controller layer. ``` Reads the system, traces the path, identifies the discrepancy ``` The error says "invalid input" so I'll add input validation to the endpoint. ``` Didn't read the code, didn't trace the path, guessed at a fixCheck the plug:
Don't assume the obvious is fine. Verify it.
Make it fail. Consistently. On demand.
If you can't reproduce it, you can't prove you fixed it. Observation before theory — watch what actually happens, don't theorize about what might happen.
Required steps:
Reproduction checklist:
Before proceeding to Phase 3, your debug journal MUST contain:
Static code reading is NOT reproduction. If you have only read source files and formed theories about what might happen, you are still in Phase 1.
Simulation is NOT reproduction. Writing scripts that mimic what the code would do (e.g., Node.js scripts simulating CoffeeScript logic, Python scripts simulating import resolution) is theorizing, not reproducing. Run the actual code path instead.
Can't reproduce?
Can't execute directly? (GUI app, external service, no local environment)
Narrow the search space. One variable at a time.
Form hypotheses. Make them falsifiable. Test them. Eliminate possibilities systematically — don't jump to conclusions.
Required steps:
| Technique | When to Use | How |
|---|---|---|
| Binary search | Large code path, unknown location | Comment out half the code path. Bug gone? It's in that half. Repeat. |
| Differential debugging | Worked before, broken now | Compare working vs. broken state. git bisect, diff configs, diff dependencies. |
| Minimal reproduction | Complex system, many variables | Strip away components until the simplest case that still fails. |
| Working backwards | Clear error, unclear cause | Start at the error. Trace backwards through the call chain. |
| Rubber duck | Stuck, circular thinking | Explain the bug out loud, step by step. The gap in your explanation is the gap in your understanding. |
When dispatching subagents for parallel investigation:
pip install --target /tmp/test and return the error output" not "explore the dependency setup".debug/{slug}.md first so it doesn't repeat prior workOne thing at a time:
State the root cause. Write a failing test. Make the minimal fix.
You've isolated the root cause. Now prove it with a test, then fix it.
Required steps:
Skill(skill="groundwork:test-driven-development") and follow the TDD cycle. The test must reproduce the exact bug.Fix: Parse ID to integer at the controller layer before passing to the service.
Root cause stated, explains the symptom chain, minimal targeted fix
</Good>
<Bad>
Fixed the timeout by:
Shotgun fix, no root cause stated, changes way beyond the bug
</Bad>
## Phase 5: VERIFY
Prove the fix works. Prove nothing else broke. Get a fresh perspective.
**Required steps:**
1. Run the failing test — it must pass now
2. Run the full test suite — nothing else broke
3. Re-run the original reproduction — the bug is gone
4. Review your fix with fresh eyes — does it actually address the root cause, or mask it?
5. Check for related code paths — could the same bug exist elsewhere?
**Verification checklist:**
- [ ] Failing test now passes
- [ ] Full test suite passes
- [ ] Original reproduction no longer triggers the bug
- [ ] Fix addresses root cause (not symptoms)
- [ ] No unrelated changes included
- [ ] Related code paths checked for same bug
**Fix didn't work?** Go back to Phase 3. Your root cause analysis was wrong. Don't patch the patch — re-investigate.
## Escalation Rules
**3-fix limit.** If your third fix attempt fails, stop and escalate.
Three failed fixes means your root cause analysis is wrong. More attempts won't help — you need a different perspective.
**Question the architecture.** Three failed fixes often signal something deeper than an incomplete investigation. The bug may exist because the design is fundamentally unsound — a race condition baked into the concurrency model, a data integrity problem caused by a missing constraint, a failure mode that the architecture doesn't account for. When you escalate, shift the question from "help me fix this bug" to "is this design sound?"
**Escalation output format:**
```markdown
## Escalation: [Bug Description]
### What I tried
| # | Hypothesis | Fix Attempted | Result |
|---|-----------|---------------|--------|
| 1 | [hypothesis] | [what you changed] | [what happened] |
| 2 | [hypothesis] | [what you changed] | [what happened] |
| 3 | [hypothesis] | [what you changed] | [what happened] |
### Eliminated hypotheses
- [hypothesis]: eliminated because [evidence]
### What I know
- [confirmed fact with evidence]
### Could this be architectural?
[If 3 fixes failed, consider whether the bug reflects a design problem, not just a code problem. What assumption does the current architecture make that might be wrong?]
### Recommended next step
[specific suggestion for the human partner]
Revert all failed fix attempts. Leave the codebase clean.
Maintain a persistent debug file at .debug/{slug}.md. This survives context compaction and provides continuity across sessions.
Create the .debug/ directory and file at the start of Phase 1.
# Debug: {slug}
## Status
[Active | Root Cause Found | Fixed | Escalated]
## Symptoms
- [What the user or test reported]
- [Exact error messages]
- [When it started / what changed]
## Reproduction
### Command Executed
[The exact command you ran — NOT theoretical steps, NOT code you read]
### Actual Output
[Paste the exact error/output you observed when you ran it]
### Expected Output
[What should have happened instead]
### Matches User Report?
[Yes/No — does this reproduce the same symptoms the user described?]
## Hypotheses
### Active
- [ ] [Hypothesis]: [How to test it]
### Eliminated
- [x] [Hypothesis]: Eliminated because [evidence]
## Evidence Log
| # | Action | Observation | Conclusion |
|---|--------|-------------|------------|
| 1 | [what you did] | [what happened] | [what it means] |
## Root Cause
**Verification level:** [Verified | Corroborated | Unverified]
[Root cause statement]
[Evidence that confirms it]
## Fix
[What was changed and why]
## Resolution
[Final status, test references, PR link]
Update the journal as you work. Every hypothesis, every test, every result goes in the journal. When context compacts, the journal remains.
Root cause claims require a confidence level. This prevents false confidence and forces honest assessment.
| Level | Meaning | Required Before Fix? |
|---|---|---|
| Verified | Root cause proven by direct evidence (test reproduces it, fix resolves it, revert re-introduces it) | Yes — proceed to fix |
| Corroborated | Multiple independent observations support it, but no single definitive proof | Yes — proceed to fix with extra caution |
| Unverified | Single observation or logical inference only | No — gather more evidence first |
| Contradicted | Evidence conflicts with the hypothesis | No — eliminate this hypothesis |
Rule: Never fix based on an unverified root cause. If your confidence is "unverified," you're guessing. Go back to Phase 3.
| Excuse | Reality |
|---|---|
| "I already know the fix" | Then prove the root cause first. Takes 2 minutes if you're right. |
| "Let me just try this quick fix" | Quick fixes become permanent. Understand first. |
| "The fix is obvious" | Obvious fixes for wrong root causes create new bugs. |
| "I'll investigate after I fix it" | You won't. And you can't verify a fix you don't understand. |
| "It's just a typo / off-by-one" | Verify. "Just a typo" has been wrong more times than right. |
| "I changed 3 things and it works now" | You don't know which one fixed it. Revert two. |
| "Reproducing is too hard, let me just fix it" | If you can't reproduce it, you can't verify the fix. |
| "I don't need a journal for something this simple" | Simple bugs that need journals are the ones that spiral. |
| "The tests pass, so it's fixed" | Tests pass ≠ root cause addressed. The symptom might be masked. |
| "Let me add a retry / increase the timeout" | That's masking, not fixing. Find why it fails. |
| "It works on my machine" | Then your machine isn't the one with the bug. Reproduce there. |
| "I've been debugging this for hours, let me just ship it" | Sunk cost. Escalate instead of shipping a guess. |
| "I'll simulate the logic to understand it" | Run the actual code. Simulation scripts are theories, not evidence. |
| "The subagent explored the codebase thoroughly" | Did it run the failing operation? Reading files is Phase 1, not Phase 3. |
During interactive debugging, your human partner's reactions are a signal channel. Frustration and redirection mean your process is off — not that the human is being difficult.
| What they say | What it means | What to do |
|---|---|---|
| "Is that not happening?" | You assumed something without verifying | Go back and verify with evidence |
| "Stop guessing" | You're proposing fixes without understanding | Return to Phase 1 — read the system |
| "We're going in circles" / frustrated tone | Your approach isn't working | Step back. Re-read from scratch. Your mental model is wrong. |
| "Did you actually check?" | You claimed something without evidence | Gather evidence first, then report what you found |
| "That's not the problem" | You're investigating the wrong thing | Ask what they're seeing. Their context is broader than yours. |
| "Just revert" | Your fixes are making things worse | Revert everything. Return to Phase 2 with a clean slate. |
Rule: Treat pushback as data, not criticism. If your human partner redirects you, the most productive response is to follow the redirect — not to defend your hypothesis.
All of these mean: Stop. Go back to Phase 1. Understand the system.
| Problem | Solution |
|---|---|
| Can't reproduce | Check environment, timing, data. Add logging. Wait for recurrence. |
| Hypothesis keeps being wrong | Step back. Re-read the code from scratch. Your mental model is wrong. |
| Fix works but don't understand why | Revert. That's not a fix, it's an accident. Investigate why it works. |
| Too many variables | Simplify. Remove components until you find the minimal failing case. |
| Intermittent failure | Add instrumentation. Look for race conditions, timing, external state. |
| Stuck in a loop | Rubber duck. Explain the bug step by step. The gap is the clue. |
| Bug disappeared | It's hiding. Check if something else changed. Don't declare victory. |
| Fix breaks other tests | Your fix is wrong, or the other tests reveal a deeper issue. Investigate. |
Sometimes systematic investigation legitimately concludes without a clear root cause — environmental issues, timing-dependent failures, external service flakiness, or bugs that disappear under observation. This is rare, but it happens.
When you genuinely cannot identify the root cause:
Caveat: 95% of "no root cause found" conclusions are actually incomplete investigations. Before declaring inconclusive, honestly assess:
If there are gaps, go back and fill them. "Inconclusive" is a legitimate outcome — but only after exhausting the process, not instead of it.
Bug report: "API returns 500 on user profile page for some users"
Read the profile endpoint handler, the user service, and the database query. The endpoint calls getUserProfile(id) which joins users and preferences tables.
Check the plug: database connected, service running, other endpoints working fine.
curl -s localhost:3000/api/profile/42 # 200 OK
curl -s localhost:3000/api/profile/107 # 500 Internal Server Error
Consistent. User 42 works, user 107 fails. Minimal reproduction found.
Error log: TypeError: Cannot read property 'theme' of null
Hypothesis 1: User 107 has a null preferences record.
Test: Query the database directly.
SELECT * FROM preferences WHERE user_id = 107;
-- Empty result set
Result: Confirmed — user 107 has no preferences row.
Hypothesis 2: The JOIN returns null for preferences, and the code doesn't handle it.
Test: Read getUserProfile — it does user.preferences.theme with no null check.
Result: Confirmed. The LEFT JOIN returns null preferences, and the code assumes it exists.
Root cause: getUserProfile accesses preferences.theme without null-checking the join result. Users without a preferences row trigger a TypeError.
Verification level: Verified (direct evidence: null row + no null check + exact error match).
Failing test:
test('returns default theme when user has no preferences', async () => {
const user = await createUser({ id: 999 }); // no preferences row
const profile = await getUserProfile(999);
expect(profile.theme).toBe('default');
});
Test fails with TypeError: Cannot read property 'theme' of null. Correct failure.
Fix:
const theme = user.preferences?.theme ?? 'default';
Test passes.
curl localhost:3000/api/profile/107: 200 OK, theme is "default"preferences.language has same bug — filed separate issueDone.
Root cause identified → Fix with failing test → Verify fix
Otherwise → not debugging, just guessing
No exceptions without your human partner's permission.