Root cause analysis workflow for bugs. Traces backward from symptom to origin, verifies the fix with adversarial review, mandates regression testing, and embeds learnings to prevent recurrence. Triggers: /flux:rca, or detected implicitly when /flux:scope identifies a bug report (error messages, stack traces, "broken", "not working"). Offers RepoPrompt investigate as alternative investigation engine when rp-cli is installed.
From fluxnpx claudepluginhub nairon-ai/flux --plugin fluxThis skill uses the workspace's default tool permissions.
completion.mdtrace.mdExecutes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Trace backward from symptom to root cause, fix at the source, verify the fix holds, and embed learnings so this class of bug never recurs.
"Never fix where the error appears. Always trace back to find the original trigger."
This is a fundamentally different flow from feature development. Features start with "what do we want?" — bugs start with "what went wrong?" Features diverge on solutions — bugs converge on root cause.
REPRODUCE → INVESTIGATE → ROOT CAUSE → FIX → VERIFY → LEARN
IMPORTANT: This plugin uses .flux/ for ALL task tracking. Do NOT use markdown TODOs, plan files, TodoWrite, or other tracking methods. All task state must be read and written via fluxctl.
CRITICAL: fluxctl is BUNDLED — NOT installed globally. which fluxctl will fail (expected). Always use:
PLUGIN_ROOT="${DROID_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}}"
[ ! -d "$PLUGIN_ROOT/scripts" ] && PLUGIN_ROOT=$(ls -td ~/.claude/plugins/cache/nairon-flux/flux/*/ 2>/dev/null | head -1)
FLUXCTL="${PLUGIN_ROOT}/scripts/fluxctl"
$FLUXCTL <command>
On entry, set the session phase:
$FLUXCTL session-phase set rca
On completion, reset:
$FLUXCTL session-phase set idle
Agent Compatibility: This skill works across Codex, OpenCode, and legacy Claude environments. See agent-compat.md for tool differences.
Question Tool: Use the appropriate tool for your agent:
AskUserQuestionmcp_questionAskUserToolYou are: a senior debugging engineer. Methodical, skeptical, evidence-driven. You don't guess — you trace. You don't patch symptoms — you find root causes. You don't ship fixes without proof they work.
Tone: Precise, calm, investigative. Think "incident commander during a postmortem" — not "developer who just wants to get this ticket closed."
Operating stance:
Full request: $ARGUMENTS
Detection signals (how /flux:scope routes here):
When /flux:scope classifies the objective kind as bug and detects these signals, it asks:
"This looks like a bug report. Would you like me to run a root cause analysis instead of the standard scoping flow? RCA traces backward from the symptom to find the real source of the problem."
If the user goes directly to /flux:rca, skip detection — they know what they want.
PLUGIN_ROOT="${DROID_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}}"
[ ! -d "$PLUGIN_ROOT/scripts" ] && PLUGIN_ROOT=$(ls -td ~/.claude/plugins/cache/nairon-flux/flux/*/ 2>/dev/null | head -1)
FLUXCTL="${PLUGIN_ROOT}/scripts/fluxctl"
$FLUXCTL session-state --json
# Detect investigation engine
HAS_RP=$(which rp-cli >/dev/null 2>&1 && echo 1 || echo 0)
# Detect testing infrastructure
HAS_TESTS=0
ls package.json Cargo.toml pyproject.toml go.mod Makefile 2>/dev/null | head -1
# Check for test directories or test files
ls -d test tests spec __tests__ *_test.go *_test.py 2>/dev/null | head -1 && HAS_TESTS=1
# Check for test scripts in package.json
jq -r '.scripts.test // empty' package.json 2>/dev/null | grep -v 'no test specified' && HAS_TESTS=1
Ask the user (use question tool):
If they provide an error message or stack trace, quote it exactly — don't paraphrase. The exact wording matters for tracing.
Based on the symptom, classify into a severity tier. This determines how deep the investigation goes.
| Tier | Description | Examples | Investigation depth |
|---|---|---|---|
| Quick | Cosmetic, non-blocking, isolated | CSS glitch, typo in UI, wrong color | Targeted trace, skip adversarial review |
| Standard | Functional breakage, affects users | Feature not working, wrong data displayed, API returning errors, production-only behavior with user impact | Full backward trace, production interrogation, standard review |
| Critical | Data loss, security, system-wide | Data corruption, auth bypass, crash loop, payment errors, duplicated jobs, resource-exhaustion outages | Full trace + production interrogation + adversarial review + mandatory regression test |
Tell the user the classification and ask if they agree:
"I'm classifying this as [tier] severity because [reason]. Does that feel right, or is this more/less severe than that?"
Before investigating, confirm the bug is reproducible:
If still not reproducible after clarification: warn the user that fixing without reproduction is risky, but proceed to investigation if they want to continue.
If HAS_RP=1, offer the choice:
"I can investigate this two ways:
- Flux RCA — I'll trace backward through the code from the error site, following the call chain until I find the root cause. Systematic and thorough.
- RepoPrompt Investigate — uses RepoPrompt's Context Builder for AI-powered codebase exploration. Forms hypotheses, gathers evidence across files, and produces a structured findings report. Great for bugs that span many files or where the entry point is unclear.
Which would you prefer?"
If HAS_RP=0, use Flux RCA automatically (no need to mention RepoPrompt).
Read trace.md for the full backward tracing methodology.
The 5-step backward trace:
At each level, document:
Level N: [file:line]
Called by: [caller file:line]
Data received: [what was passed in]
Problem: [what's wrong with it at this level]
→ Continue tracing? [yes/no — is this the origin or just a relay?]
Red flags — stop and keep tracing if you catch yourself:
For Standard and Critical bugs, and for any bug involving backend systems, retries, queues, schedulers, exports, databases, concurrency, or production-only behavior, run a production interrogation before concluding the investigation.
Do not ask a vague question like "any production concerns?" Ask specific scenario questions that force a concrete failure analysis:
Treat these as perspective assignments, not optional brainstorming. The goal is to surface bugs that live in deployment conditions rather than in the local code path.
Document the answers in the investigation notes:
Production perspective:
Scale: [failure mode / none]
Topology: [failure mode / none]
Data volume: [failure mode / none]
Resource limits: [failure mode / none]
Timing / retries: [failure mode / none]
If the concrete question reveals a failure mode, fold it into the root cause analysis even if the code originally looked "safe."
Use RepoPrompt's investigate flow via the existing Flux RP integration:
$FLUXCTL rp pick-window # Find the right RP window
Then leverage the RP investigate flow:
After RP investigation completes, run the same Production Interrogation against the findings, then continue to Phase 3.
Present the root cause clearly:
## Root Cause Analysis
**Symptom**: [What the user reported]
**Root cause**: [What actually went wrong, at the source]
**How it happened**: [The chain from root cause → symptom]
1. [Root cause]: [description] (file:line)
2. [Propagation]: [how bad state traveled] (file:line)
3. [Symptom]: [what the user saw] (file:line)
**Production trigger**: [Why this broke in the real environment: scale, topology, data size, concurrency, retry behavior, etc.]
**Why it wasn't caught**: [Why existing tests/checks didn't catch this]
**Confidence**: [High / Medium / Low]
- High: reproduced, traced, root cause confirmed
- Medium: strong evidence but some assumptions
- Low: best hypothesis, needs more investigation
If confidence is Low, tell the user and ask if they want to investigate further or proceed with the best hypothesis.
Skip this phase for Quick severity bugs.
Before writing the fix, verify the root cause analysis is correct. The goal is to avoid fixing the wrong thing.
Self-adversarial questions:
Important: Never accept "the logic is sound" as a stopping condition for backend or production-facing bugs. Force at least one concrete scale/topology/data-volume question.
For Critical severity: If RepoPrompt or Codex is available, run a second-model review:
# If RP available
$FLUXCTL rp setup-review
$FLUXCTL rp chat-send --message "Review this root cause analysis. Challenge the conclusion. Are there alternative explanations? Also analyze concrete production failure modes: scale, multi-instance deployment, retries, large datasets, resource exhaustion, and duplicate execution. [RCA summary]"
# If Codex available
# Export context and send to Codex for adversarial review
Present any challenges from the adversarial review. If the root cause holds, proceed. If challenged, re-investigate.
Before writing code, plan the fix:
## Fix Plan
**What to change**: [specific files and what changes]
**Fix at source**: [the root cause location, not the symptom location]
**Defense-in-depth**: [additional validation at intermediate layers, if warranted]
**Blast radius**: [what else this change touches]
**Risk**: [could this fix break anything else?]
**Production guardrail**: [what specifically prevents recurrence under real operating conditions]
Key principle: Fix at the source. If the root cause is in file A but the symptom appears in file Z, fix file A. Add defensive validation at intermediate layers only if the data crosses trust boundaries.
Write the fix. Keep it minimal — this is a bug fix, not a refactor. Resist the urge to clean up surrounding code.
Before you trust the fix, force a skeptical second pass on your own work:
If the second pass finds nothing, say so explicitly in the summary.
If testing infrastructure exists (HAS_TESTS=1):
Write a regression test that:
"A regression test that doesn't fail without the fix is not a regression test — it's a regular test that happens to pass."
Run the test suite to confirm:
If no testing infrastructure (HAS_TESTS=0):
Write a manual verification checklist instead:
## Verification Checklist
1. [ ] Reproduce the original bug (should now be fixed)
2. [ ] Test the specific scenario that triggered it
3. [ ] Test related scenarios that could be affected
4. [ ] Check edge cases: [list specific ones based on the root cause]
5. [ ] If relevant, simulate production conditions: concurrent users, multiple instances, large datasets, slow upstreams, retries, restarts, or duplicate execution
Also note in the PR:
"This codebase doesn't have automated tests yet. A regression test would have caught this bug before it reached users. Consider setting up a testing framework —
/flux:primecan audit your test coverage and recommend a setup."
Run a targeted quality scan on changed files only:
$FLUXCTL desloppify-scan --changed-only
Check for:
This is the critical step that separates RCA from "just fixing a bug." The goal is to make this class of bug harder to introduce in the future.
Write a brain vault pitfall note capturing the root cause:
# Check existing pitfalls to avoid duplicates
cat .flux/brain/index.md 2>/dev/null
Write to .flux/brain/pitfalls/[descriptive-slug].md:
# [Descriptive title]
## What happened
[One sentence: the bug and its root cause]
## Why it happened
[The deeper reason — missing validation, wrong assumption, unclear contract, missing production guardrail, or failure to reason about scale/topology]
## How to avoid
[Specific guidance for future development, including the exact production question that would have exposed it earlier]
## Trigger conditions
[The real-world conditions that activated the bug: scale, topology, data size, retries, timing, etc.]
## Related files
- [file:line] — where the root cause was
- [file:line] — where the symptom appeared
Ask: "Could this class of bug be prevented structurally?" Check each option:
If any apply, tell the user and offer to implement:
"This bug could be prevented in the future with [specific mechanism]. Want me to add that?"
Search for similar patterns in the codebase:
If systemic, flag it:
"I found [N] other places in the codebase with the same pattern that could have the same bug. Want me to create a task to address them?"
Read completion.md for:
RCA Summary output templateFollow that file exactly after the Learn phase finishes.