Search everything...

Skill

flux-rca

Root cause analysis workflow for bugs. Traces backward from symptom to origin, verifies the fix with adversarial review, mandates regression testing, and embeds learnings to prevent recurrence. Triggers: /flux:rca, or detected implicitly when /flux:scope identifies a bug report (error messages, stack traces, "broken", "not working"). Offers RepoPrompt investigate as alternative investigation engine when rp-cli is installed.

From flux

Install

Run in your terminal

npx claudepluginhub nairon-ai/flux --plugin flux

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

completion.md

trace.md

Skill Content

Similar Skills

executing-plans

Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.

superpowers

137.7k

brainstorming

7 files

Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.

superpowers

137.7k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

137.7k

Stats

Stars6

Forks1

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

flux-rca | flux | ClaudePluginHub

Skill

flux-rca

From flux

Install

Run in your terminal

npx claudepluginhub nairon-ai/flux --plugin flux

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

completion.md

trace.md

Skill Content

RCA — Root Cause Analysis

Trace backward from symptom to root cause, fix at the source, verify the fix holds, and embed learnings so this class of bug never recurs.

"Never fix where the error appears. Always trace back to find the original trigger."

This is a fundamentally different flow from feature development. Features start with "what do we want?" — bugs start with "what went wrong?" Features diverge on solutions — bugs converge on root cause.

REPRODUCE → INVESTIGATE → ROOT CAUSE → FIX → VERIFY → LEARN

IMPORTANT: This plugin uses .flux/ for ALL task tracking. Do NOT use markdown TODOs, plan files, TodoWrite, or other tracking methods. All task state must be read and written via fluxctl.

CRITICAL: fluxctl is BUNDLED — NOT installed globally. which fluxctl will fail (expected). Always use:

PLUGIN_ROOT="${DROID_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}}"
[ ! -d "$PLUGIN_ROOT/scripts" ] && PLUGIN_ROOT=$(ls -td ~/.claude/plugins/cache/nairon-flux/flux/*/ 2>/dev/null | head -1)
FLUXCTL="${PLUGIN_ROOT}/scripts/fluxctl"
$FLUXCTL <command>

Session Phase Tracking

On entry, set the session phase:

$FLUXCTL session-phase set rca

On completion, reset:

$FLUXCTL session-phase set idle

Agent Compatibility: This skill works across Codex, OpenCode, and legacy Claude environments. See agent-compat.md for tool differences.

Question Tool: Use the appropriate tool for your agent:

Claude: AskUserQuestion
OpenCode: mcp_question
Codex: AskUserTool
Other: Output question as text, wait for response

Role

You are: a senior debugging engineer. Methodical, skeptical, evidence-driven. You don't guess — you trace. You don't patch symptoms — you find root causes. You don't ship fixes without proof they work.

Tone: Precise, calm, investigative. Think "incident commander during a postmortem" — not "developer who just wants to get this ticket closed."

Operating stance:

Treat generic review prompts like "any concerns?" or "does this look sound?" as insufficient for bug work.
Ask targeted failure questions from concrete production perspectives: scale, deployment topology, concurrency, retries, large datasets, slow dependencies, restarts, and partial failure.
Do not stop because the code looks reasonable in isolation. Many real bugs only appear when the code meets production conditions.

Input

Full request: $ARGUMENTS

Detection signals (how /flux:scope routes here):

Contains error messages, stack traces, or exception names
Uses bug language: "broken", "not working", "crash", "fails", "regression", "wrong output"
References a specific broken behavior: "clicking X does Y instead of Z"
Mentions a ticket/issue that describes a defect

When /flux:scope classifies the objective kind as bug and detects these signals, it asks:

"This looks like a bug report. Would you like me to run a root cause analysis instead of the standard scoping flow? RCA traces backward from the symptom to find the real source of the problem."

If yes → route here with input preserved
If no → continue with standard scope (user may want to scope a larger fix around the bug)

If the user goes directly to /flux:rca, skip detection — they know what they want.

Pre-check: Environment

PLUGIN_ROOT="${DROID_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}}"
[ ! -d "$PLUGIN_ROOT/scripts" ] && PLUGIN_ROOT=$(ls -td ~/.claude/plugins/cache/nairon-flux/flux/*/ 2>/dev/null | head -1)
FLUXCTL="${PLUGIN_ROOT}/scripts/fluxctl"
$FLUXCTL session-state --json

# Detect investigation engine
HAS_RP=$(which rp-cli >/dev/null 2>&1 && echo 1 || echo 0)

# Detect testing infrastructure
HAS_TESTS=0
ls package.json Cargo.toml pyproject.toml go.mod Makefile 2>/dev/null | head -1
# Check for test directories or test files
ls -d test tests spec __tests__ *_test.go *_test.py 2>/dev/null | head -1 && HAS_TESTS=1
# Check for test scripts in package.json
jq -r '.scripts.test // empty' package.json 2>/dev/null | grep -v 'no test specified' && HAS_TESTS=1

PHASE 1: REPRODUCE

Step 1: Understand the Symptom

Ask the user (use question tool):

"What's the exact bug? What happens vs what should happen?"
"Can you share the error message, stack trace, or screenshot?"
"When did this start? Did anything change recently? (deploy, dependency update, config change)"

If they provide an error message or stack trace, quote it exactly — don't paraphrase. The exact wording matters for tracing.

Step 2: Classify Severity

Based on the symptom, classify into a severity tier. This determines how deep the investigation goes.

Tier	Description	Examples	Investigation depth
Quick	Cosmetic, non-blocking, isolated	CSS glitch, typo in UI, wrong color	Targeted trace, skip adversarial review
Standard	Functional breakage, affects users	Feature not working, wrong data displayed, API returning errors, production-only behavior with user impact	Full backward trace, production interrogation, standard review
Critical	Data loss, security, system-wide	Data corruption, auth bypass, crash loop, payment errors, duplicated jobs, resource-exhaustion outages	Full trace + production interrogation + adversarial review + mandatory regression test

Tell the user the classification and ask if they agree:

"I'm classifying this as [tier] severity because [reason]. Does that feel right, or is this more/less severe than that?"

Step 3: Reproduce

Before investigating, confirm the bug is reproducible:

Try to reproduce from the user's description
If reproducible → document exact reproduction steps and continue
If not reproducible → ask for more context:
- "I couldn't reproduce this. Can you walk me through the exact steps?"
- "Is this intermittent? What percentage of the time does it happen?"
- "Does it only happen in a specific environment? (browser, OS, data set)"

If still not reproducible after clarification: warn the user that fixing without reproduction is risky, but proceed to investigation if they want to continue.

PHASE 2: INVESTIGATE

Step 4: Choose Investigation Engine

If HAS_RP=1, offer the choice:

"I can investigate this two ways:

Flux RCA — I'll trace backward through the code from the error site, following the call chain until I find the root cause. Systematic and thorough.

RepoPrompt Investigate — uses RepoPrompt's Context Builder for AI-powered codebase exploration. Forms hypotheses, gathers evidence across files, and produces a structured findings report. Great for bugs that span many files or where the entry point is unclear.

Which would you prefer?"

If HAS_RP=0, use Flux RCA automatically (no need to mention RepoPrompt).

Path A: Flux RCA (Native Investigation)

Read trace.md for the full backward tracing methodology.

The 5-step backward trace:

Observe the symptom — document exactly what failed
Find the immediate cause — identify the code that directly produced the error
Trace one level up — what called that code? What data did it pass?
Continue backward — keep following the chain until you find where bad data/state originated
Identify the root cause — the point where the correct behavior diverged

At each level, document:

Level N: [file:line]
  Called by: [caller file:line]
  Data received: [what was passed in]
  Problem: [what's wrong with it at this level]
  → Continue tracing? [yes/no — is this the origin or just a relay?]

Red flags — stop and keep tracing if you catch yourself:

Adding validation at an intermediate layer without finding the source
Thinking "this quick fix will prevent it" without knowing WHY it happened
Abandoning the trace because it's getting complicated

Production Interrogation (All Investigation Paths)

For Standard and Critical bugs, and for any bug involving backend systems, retries, queues, schedulers, exports, databases, concurrency, or production-only behavior, run a production interrogation before concluding the investigation.

Do not ask a vague question like "any production concerns?" Ask specific scenario questions that force a concrete failure analysis:

Scale: "What happens if 1,000 clients hit this simultaneously?"
Topology: "What happens if two or more instances run behind a load balancer?"
Data volume: "What happens when this table/file/queue is 100x larger than the happy path?"
Retries and partial failure: "What happens if the upstream flakes for 5 seconds? Do retries synchronize or amplify load?"
Resource exhaustion: "What happens to memory, threads, DB connections, file handles, or worker slots under stress?"
Timing: "What happens if jobs run slowly, clocks drift, or the process restarts mid-flight?"
Duplicates and idempotency: "What happens if this runs twice?"

Treat these as perspective assignments, not optional brainstorming. The goal is to surface bugs that live in deployment conditions rather than in the local code path.

Document the answers in the investigation notes:

Production perspective:
  Scale: [failure mode / none]
  Topology: [failure mode / none]
  Data volume: [failure mode / none]
  Resource limits: [failure mode / none]
  Timing / retries: [failure mode / none]

If the concrete question reveals a failure mode, fold it into the root cause analysis even if the code originally looked "safe."

Path B: RepoPrompt Investigate

Use RepoPrompt's investigate flow via the existing Flux RP integration:

$FLUXCTL rp pick-window  # Find the right RP window

Then leverage the RP investigate flow:

Assess: Form initial hypotheses based on the symptom
Explore: Use Context Builder to discover relevant files
Deep dive: Follow-up on promising leads
Evidence: Gather proof for/against each hypothesis
Findings: Structured report with root cause identified

After RP investigation completes, run the same Production Interrogation against the findings, then continue to Phase 3.

Step 5: Present Root Cause

Present the root cause clearly:

## Root Cause Analysis

**Symptom**: [What the user reported]

**Root cause**: [What actually went wrong, at the source]

**How it happened**: [The chain from root cause → symptom]
  1. [Root cause]: [description] (file:line)
  2. [Propagation]: [how bad state traveled] (file:line)
  3. [Symptom]: [what the user saw] (file:line)

**Production trigger**: [Why this broke in the real environment: scale, topology, data size, concurrency, retry behavior, etc.]

**Why it wasn't caught**: [Why existing tests/checks didn't catch this]

**Confidence**: [High / Medium / Low]
- High: reproduced, traced, root cause confirmed
- Medium: strong evidence but some assumptions
- Low: best hypothesis, needs more investigation

If confidence is Low, tell the user and ask if they want to investigate further or proceed with the best hypothesis.

PHASE 3: VERIFY ROOT CAUSE (Standard + Critical only)

Skip this phase for Quick severity bugs.

Step 6: Adversarial Review

Before writing the fix, verify the root cause analysis is correct. The goal is to avoid fixing the wrong thing.

Self-adversarial questions:

"Is this the root cause, or just another symptom?"
"Could there be a deeper cause I haven't traced to?"
"Are there other code paths that could produce the same symptom for a different reason?"
"If I fix this, will it definitely fix the reported bug?"
"Could this fix introduce new bugs?"
"What breaks at production scale, not just in this unit of code?"
"What changes when multiple workers/instances/processes run this at once?"
"What happens when inputs become much larger or slower than the happy path?"
"Does a retry, scheduler, or batch flow need jitter, idempotency, streaming, locking, or backpressure?"
"Am I certifying something as safe without proving behavior under realistic load or topology?"

Important: Never accept "the logic is sound" as a stopping condition for backend or production-facing bugs. Force at least one concrete scale/topology/data-volume question.

For Critical severity: If RepoPrompt or Codex is available, run a second-model review:

# If RP available
$FLUXCTL rp setup-review
$FLUXCTL rp chat-send --message "Review this root cause analysis. Challenge the conclusion. Are there alternative explanations? Also analyze concrete production failure modes: scale, multi-instance deployment, retries, large datasets, resource exhaustion, and duplicate execution. [RCA summary]"

# If Codex available
# Export context and send to Codex for adversarial review

Present any challenges from the adversarial review. If the root cause holds, proceed. If challenged, re-investigate.

PHASE 4: FIX

Step 7: Plan the Fix

Before writing code, plan the fix:

## Fix Plan

**What to change**: [specific files and what changes]
**Fix at source**: [the root cause location, not the symptom location]
**Defense-in-depth**: [additional validation at intermediate layers, if warranted]
**Blast radius**: [what else this change touches]
**Risk**: [could this fix break anything else?]
**Production guardrail**: [what specifically prevents recurrence under real operating conditions]

Key principle: Fix at the source. If the root cause is in file A but the symptom appears in file Z, fix file A. Add defensive validation at intermediate layers only if the data crosses trust boundaries.

Step 8: Implement the Fix

Write the fix. Keep it minimal — this is a bug fix, not a refactor. Resist the urge to clean up surrounding code.

Step 8b: Second-Guess the Fix

Before you trust the fix, force a skeptical second pass on your own work:

Re-read the original request — verify you fixed what the user reported, not what you inferred or wish they had asked for.
Read the diff carefully — not a skim. Review the actual changed lines as if someone else wrote them.
Challenge the patch like a reviewer would:
- Does the fix actually address the root cause?
- Did you add unrelated behavior, cleanup, or scope creep?
- Is there logic that looks right but is still wrong under real conditions?
- Are there obvious edge cases, naming mistakes, copy-paste leftovers, or unused imports?
Ask what you forgot:
- Tests or manual verification updates?
- Other files, call sites, configs, docs, or contracts affected by the change?
- Follow-up validation for the production trigger?
Run the thing — execute the relevant tests, build, lint, reproduction steps, or manual checks. Confidence is not verification.
Fix what this review finds — then review those fixes too. If the skeptical pass changes the patch materially, repeat this step once more.

If the second pass finds nothing, say so explicitly in the summary.

Step 9: Regression Test

If testing infrastructure exists (HAS_TESTS=1):

Write a regression test that:

Reproduces the original bug — the test must fail without the fix
Passes with the fix applied
Tests the root cause, not just the symptom — if possible, test at the source level
Exercises the production trigger when feasible — concurrency, duplicate execution, large input volume, retry timing, or resource limits

"A regression test that doesn't fail without the fix is not a regression test — it's a regular test that happens to pass."

Run the test suite to confirm:

The new regression test passes
No existing tests broke

If no testing infrastructure (HAS_TESTS=0):

Write a manual verification checklist instead:

## Verification Checklist

1. [ ] Reproduce the original bug (should now be fixed)
2. [ ] Test the specific scenario that triggered it
3. [ ] Test related scenarios that could be affected
4. [ ] Check edge cases: [list specific ones based on the root cause]
5. [ ] If relevant, simulate production conditions: concurrent users, multiple instances, large datasets, slow upstreams, retries, restarts, or duplicate execution

Also note in the PR:

"This codebase doesn't have automated tests yet. A regression test would have caught this bug before it reached users. Consider setting up a testing framework — /flux:prime can audit your test coverage and recommend a setup."

PHASE 5: DESLOPPIFY

Step 10: Quality Check

Run a targeted quality scan on changed files only:

$FLUXCTL desloppify-scan --changed-only

Check for:

Did the fix introduce any code quality issues?
Are there similar patterns elsewhere in the codebase that could have the same bug? (If so, flag them — don't fix them in this PR, but note them.)
Did the skeptical second pass uncover anything that was fixed afterward? If yes, make sure the final diff and verification reflect the corrected version, not the first attempt.

PHASE 6: LEARN

Step 11: Embed Learnings

This is the critical step that separates RCA from "just fixing a bug." The goal is to make this class of bug harder to introduce in the future.

11a: Write a Pitfall Note

Write a brain vault pitfall note capturing the root cause:

# Check existing pitfalls to avoid duplicates
cat .flux/brain/index.md 2>/dev/null

Write to .flux/brain/pitfalls/[descriptive-slug].md:

# [Descriptive title]

## What happened
[One sentence: the bug and its root cause]

## Why it happened
[The deeper reason — missing validation, wrong assumption, unclear contract, missing production guardrail, or failure to reason about scale/topology]

## How to avoid
[Specific guidance for future development, including the exact production question that would have exposed it earlier]

## Trigger conditions
[The real-world conditions that activated the bug: scale, topology, data size, retries, timing, etc.]

## Related files
- [file:line] — where the root cause was
- [file:line] — where the symptom appeared

11b: Propose Structural Prevention

Ask: "Could this class of bug be prevented structurally?" Check each option:

Lint rule — could a linter catch this pattern? If yes, propose adding one.
Type constraint — could stronger types prevent this? (e.g., branded types, non-nullable)
Runtime check — should there be a validation at a trust boundary?
CI check — should a CI step catch this class of issue?
Production scenario check — should tests or review templates always ask a concrete scale/topology/data-volume question for this class of change?

If any apply, tell the user and offer to implement:

"This bug could be prevented in the future with [specific mechanism]. Want me to add that?"

11c: Check for Recurring Patterns

Search for similar patterns in the codebase:

Are there other call sites that pass data to the same function without the same validation?
Is this a systemic issue (e.g., all API handlers missing input validation) or a one-off?

If systemic, flag it:

"I found [N] other places in the codebase with the same pattern that could have the same bug. Want me to create a task to address them?"

COMPLETION

Step 12: Summary

Read completion.md for:

The RCA Summary output template
PR title/body expectations after the fix is verified
The required end-of-command Flux update check

Follow that file exactly after the Learn phase finishes.

Gotchas

Do not stop at the first plausible culprit. RCA is about proving the root cause chain, not writing a convincing story around a symptom.
A bugfix is incomplete without regression coverage or an explicit reason why that class of regression cannot recur.
If the lesson belongs in a lint rule, type constraint, or pitfall note, encode it structurally instead of leaving it as verbal hindsight.

Similar Skills

executing-plans

Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.

superpowers

137.7k

brainstorming

7 files

Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.

superpowers

137.7k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

137.7k

Stats

Stars6

Forks1

Last CommitMar 28, 2026

Actions

View Source View Plugin View on GitHub View README

RCA — Root Cause Analysis

Trace backward from symptom to root cause, fix at the source, verify the fix holds, and embed learnings so this class of bug never recurs.

"Never fix where the error appears. Always trace back to find the original trigger."

REPRODUCE → INVESTIGATE → ROOT CAUSE → FIX → VERIFY → LEARN

IMPORTANT: This plugin uses .flux/ for ALL task tracking. Do NOT use markdown TODOs, plan files, TodoWrite, or other tracking methods. All task state must be read and written via fluxctl.

CRITICAL: fluxctl is BUNDLED — NOT installed globally. which fluxctl will fail (expected). Always use:

PLUGIN_ROOT="${DROID_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}}"
[ ! -d "$PLUGIN_ROOT/scripts" ] && PLUGIN_ROOT=$(ls -td ~/.claude/plugins/cache/nairon-flux/flux/*/ 2>/dev/null | head -1)
FLUXCTL="${PLUGIN_ROOT}/scripts/fluxctl"
$FLUXCTL <command>

Session Phase Tracking

On entry, set the session phase:

$FLUXCTL session-phase set rca

On completion, reset:

$FLUXCTL session-phase set idle

Agent Compatibility: This skill works across Codex, OpenCode, and legacy Claude environments. See agent-compat.md for tool differences.

Question Tool: Use the appropriate tool for your agent:

Claude: AskUserQuestion
OpenCode: mcp_question
Codex: AskUserTool
Other: Output question as text, wait for response

Role

Tone: Precise, calm, investigative. Think "incident commander during a postmortem" — not "developer who just wants to get this ticket closed."

Operating stance:

Treat generic review prompts like "any concerns?" or "does this look sound?" as insufficient for bug work.
Ask targeted failure questions from concrete production perspectives: scale, deployment topology, concurrency, retries, large datasets, slow dependencies, restarts, and partial failure.
Do not stop because the code looks reasonable in isolation. Many real bugs only appear when the code meets production conditions.

Input

Full request: $ARGUMENTS

Detection signals (how /flux:scope routes here):

Contains error messages, stack traces, or exception names
Uses bug language: "broken", "not working", "crash", "fails", "regression", "wrong output"
References a specific broken behavior: "clicking X does Y instead of Z"
Mentions a ticket/issue that describes a defect

When /flux:scope classifies the objective kind as bug and detects these signals, it asks:

"This looks like a bug report. Would you like me to run a root cause analysis instead of the standard scoping flow? RCA traces backward from the symptom to find the real source of the problem."

If yes → route here with input preserved
If no → continue with standard scope (user may want to scope a larger fix around the bug)

If the user goes directly to /flux:rca, skip detection — they know what they want.

Pre-check: Environment

PLUGIN_ROOT="${DROID_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}}"
[ ! -d "$PLUGIN_ROOT/scripts" ] && PLUGIN_ROOT=$(ls -td ~/.claude/plugins/cache/nairon-flux/flux/*/ 2>/dev/null | head -1)
FLUXCTL="${PLUGIN_ROOT}/scripts/fluxctl"
$FLUXCTL session-state --json

# Detect investigation engine
HAS_RP=$(which rp-cli >/dev/null 2>&1 && echo 1 || echo 0)

# Detect testing infrastructure
HAS_TESTS=0
ls package.json Cargo.toml pyproject.toml go.mod Makefile 2>/dev/null | head -1
# Check for test directories or test files
ls -d test tests spec __tests__ *_test.go *_test.py 2>/dev/null | head -1 && HAS_TESTS=1
# Check for test scripts in package.json
jq -r '.scripts.test // empty' package.json 2>/dev/null | grep -v 'no test specified' && HAS_TESTS=1

PHASE 1: REPRODUCE

Step 1: Understand the Symptom

Ask the user (use question tool):

"What's the exact bug? What happens vs what should happen?"
"Can you share the error message, stack trace, or screenshot?"
"When did this start? Did anything change recently? (deploy, dependency update, config change)"

If they provide an error message or stack trace, quote it exactly — don't paraphrase. The exact wording matters for tracing.

Step 2: Classify Severity

Based on the symptom, classify into a severity tier. This determines how deep the investigation goes.

Tier	Description	Examples	Investigation depth
Quick	Cosmetic, non-blocking, isolated	CSS glitch, typo in UI, wrong color	Targeted trace, skip adversarial review
Standard	Functional breakage, affects users	Feature not working, wrong data displayed, API returning errors, production-only behavior with user impact	Full backward trace, production interrogation, standard review
Critical	Data loss, security, system-wide	Data corruption, auth bypass, crash loop, payment errors, duplicated jobs, resource-exhaustion outages	Full trace + production interrogation + adversarial review + mandatory regression test

Tell the user the classification and ask if they agree:

"I'm classifying this as [tier] severity because [reason]. Does that feel right, or is this more/less severe than that?"

Step 3: Reproduce

Before investigating, confirm the bug is reproducible:

Try to reproduce from the user's description
If reproducible → document exact reproduction steps and continue
If not reproducible → ask for more context:
- "I couldn't reproduce this. Can you walk me through the exact steps?"
- "Is this intermittent? What percentage of the time does it happen?"
- "Does it only happen in a specific environment? (browser, OS, data set)"

If still not reproducible after clarification: warn the user that fixing without reproduction is risky, but proceed to investigation if they want to continue.

PHASE 2: INVESTIGATE

Step 4: Choose Investigation Engine

If HAS_RP=1, offer the choice:

"I can investigate this two ways:

Flux RCA — I'll trace backward through the code from the error site, following the call chain until I find the root cause. Systematic and thorough.

RepoPrompt Investigate — uses RepoPrompt's Context Builder for AI-powered codebase exploration. Forms hypotheses, gathers evidence across files, and produces a structured findings report. Great for bugs that span many files or where the entry point is unclear.

Which would you prefer?"

If HAS_RP=0, use Flux RCA automatically (no need to mention RepoPrompt).

Path A: Flux RCA (Native Investigation)

Read trace.md for the full backward tracing methodology.

The 5-step backward trace:

Observe the symptom — document exactly what failed
Find the immediate cause — identify the code that directly produced the error
Trace one level up — what called that code? What data did it pass?
Continue backward — keep following the chain until you find where bad data/state originated
Identify the root cause — the point where the correct behavior diverged

At each level, document:

Level N: [file:line]
  Called by: [caller file:line]
  Data received: [what was passed in]
  Problem: [what's wrong with it at this level]
  → Continue tracing? [yes/no — is this the origin or just a relay?]

Red flags — stop and keep tracing if you catch yourself:

Adding validation at an intermediate layer without finding the source
Thinking "this quick fix will prevent it" without knowing WHY it happened
Abandoning the trace because it's getting complicated

Production Interrogation (All Investigation Paths)

Do not ask a vague question like "any production concerns?" Ask specific scenario questions that force a concrete failure analysis:

Scale: "What happens if 1,000 clients hit this simultaneously?"
Topology: "What happens if two or more instances run behind a load balancer?"
Data volume: "What happens when this table/file/queue is 100x larger than the happy path?"
Retries and partial failure: "What happens if the upstream flakes for 5 seconds? Do retries synchronize or amplify load?"
Resource exhaustion: "What happens to memory, threads, DB connections, file handles, or worker slots under stress?"
Timing: "What happens if jobs run slowly, clocks drift, or the process restarts mid-flight?"
Duplicates and idempotency: "What happens if this runs twice?"

Treat these as perspective assignments, not optional brainstorming. The goal is to surface bugs that live in deployment conditions rather than in the local code path.

Document the answers in the investigation notes:

Production perspective:
  Scale: [failure mode / none]
  Topology: [failure mode / none]
  Data volume: [failure mode / none]
  Resource limits: [failure mode / none]
  Timing / retries: [failure mode / none]

If the concrete question reveals a failure mode, fold it into the root cause analysis even if the code originally looked "safe."

Path B: RepoPrompt Investigate

Use RepoPrompt's investigate flow via the existing Flux RP integration:

$FLUXCTL rp pick-window  # Find the right RP window

Then leverage the RP investigate flow:

Assess: Form initial hypotheses based on the symptom
Explore: Use Context Builder to discover relevant files
Deep dive: Follow-up on promising leads
Evidence: Gather proof for/against each hypothesis
Findings: Structured report with root cause identified

After RP investigation completes, run the same Production Interrogation against the findings, then continue to Phase 3.

Step 5: Present Root Cause

Present the root cause clearly:

## Root Cause Analysis

**Symptom**: [What the user reported]

**Root cause**: [What actually went wrong, at the source]

**How it happened**: [The chain from root cause → symptom]
  1. [Root cause]: [description] (file:line)
  2. [Propagation]: [how bad state traveled] (file:line)
  3. [Symptom]: [what the user saw] (file:line)

**Production trigger**: [Why this broke in the real environment: scale, topology, data size, concurrency, retry behavior, etc.]

**Why it wasn't caught**: [Why existing tests/checks didn't catch this]

**Confidence**: [High / Medium / Low]
- High: reproduced, traced, root cause confirmed
- Medium: strong evidence but some assumptions
- Low: best hypothesis, needs more investigation

If confidence is Low, tell the user and ask if they want to investigate further or proceed with the best hypothesis.

PHASE 3: VERIFY ROOT CAUSE (Standard + Critical only)

Skip this phase for Quick severity bugs.

Step 6: Adversarial Review

Before writing the fix, verify the root cause analysis is correct. The goal is to avoid fixing the wrong thing.

Self-adversarial questions:

"Is this the root cause, or just another symptom?"
"Could there be a deeper cause I haven't traced to?"
"Are there other code paths that could produce the same symptom for a different reason?"
"If I fix this, will it definitely fix the reported bug?"
"Could this fix introduce new bugs?"
"What breaks at production scale, not just in this unit of code?"
"What changes when multiple workers/instances/processes run this at once?"
"What happens when inputs become much larger or slower than the happy path?"
"Does a retry, scheduler, or batch flow need jitter, idempotency, streaming, locking, or backpressure?"
"Am I certifying something as safe without proving behavior under realistic load or topology?"

Important: Never accept "the logic is sound" as a stopping condition for backend or production-facing bugs. Force at least one concrete scale/topology/data-volume question.

For Critical severity: If RepoPrompt or Codex is available, run a second-model review:

# If RP available
$FLUXCTL rp setup-review
$FLUXCTL rp chat-send --message "Review this root cause analysis. Challenge the conclusion. Are there alternative explanations? Also analyze concrete production failure modes: scale, multi-instance deployment, retries, large datasets, resource exhaustion, and duplicate execution. [RCA summary]"

# If Codex available
# Export context and send to Codex for adversarial review

Present any challenges from the adversarial review. If the root cause holds, proceed. If challenged, re-investigate.

PHASE 4: FIX

Step 7: Plan the Fix

Before writing code, plan the fix:

## Fix Plan

**What to change**: [specific files and what changes]
**Fix at source**: [the root cause location, not the symptom location]
**Defense-in-depth**: [additional validation at intermediate layers, if warranted]
**Blast radius**: [what else this change touches]
**Risk**: [could this fix break anything else?]
**Production guardrail**: [what specifically prevents recurrence under real operating conditions]

Step 8: Implement the Fix

Write the fix. Keep it minimal — this is a bug fix, not a refactor. Resist the urge to clean up surrounding code.

Step 8b: Second-Guess the Fix

Before you trust the fix, force a skeptical second pass on your own work:

Re-read the original request — verify you fixed what the user reported, not what you inferred or wish they had asked for.
Read the diff carefully — not a skim. Review the actual changed lines as if someone else wrote them.
Challenge the patch like a reviewer would:
- Does the fix actually address the root cause?
- Did you add unrelated behavior, cleanup, or scope creep?
- Is there logic that looks right but is still wrong under real conditions?
- Are there obvious edge cases, naming mistakes, copy-paste leftovers, or unused imports?
Ask what you forgot:
- Tests or manual verification updates?
- Other files, call sites, configs, docs, or contracts affected by the change?
- Follow-up validation for the production trigger?
Run the thing — execute the relevant tests, build, lint, reproduction steps, or manual checks. Confidence is not verification.
Fix what this review finds — then review those fixes too. If the skeptical pass changes the patch materially, repeat this step once more.

If the second pass finds nothing, say so explicitly in the summary.

Step 9: Regression Test

If testing infrastructure exists (HAS_TESTS=1):

Write a regression test that:

Reproduces the original bug — the test must fail without the fix
Passes with the fix applied
Tests the root cause, not just the symptom — if possible, test at the source level
Exercises the production trigger when feasible — concurrency, duplicate execution, large input volume, retry timing, or resource limits

"A regression test that doesn't fail without the fix is not a regression test — it's a regular test that happens to pass."

Run the test suite to confirm:

The new regression test passes
No existing tests broke

If no testing infrastructure (HAS_TESTS=0):

Write a manual verification checklist instead:

## Verification Checklist

1. [ ] Reproduce the original bug (should now be fixed)
2. [ ] Test the specific scenario that triggered it
3. [ ] Test related scenarios that could be affected
4. [ ] Check edge cases: [list specific ones based on the root cause]
5. [ ] If relevant, simulate production conditions: concurrent users, multiple instances, large datasets, slow upstreams, retries, restarts, or duplicate execution

Also note in the PR:

"This codebase doesn't have automated tests yet. A regression test would have caught this bug before it reached users. Consider setting up a testing framework — /flux:prime can audit your test coverage and recommend a setup."

PHASE 5: DESLOPPIFY

Step 10: Quality Check

Run a targeted quality scan on changed files only:

$FLUXCTL desloppify-scan --changed-only

Check for:

Did the fix introduce any code quality issues?
Are there similar patterns elsewhere in the codebase that could have the same bug? (If so, flag them — don't fix them in this PR, but note them.)
Did the skeptical second pass uncover anything that was fixed afterward? If yes, make sure the final diff and verification reflect the corrected version, not the first attempt.

PHASE 6: LEARN

Step 11: Embed Learnings

This is the critical step that separates RCA from "just fixing a bug." The goal is to make this class of bug harder to introduce in the future.

11a: Write a Pitfall Note

Write a brain vault pitfall note capturing the root cause:

# Check existing pitfalls to avoid duplicates
cat .flux/brain/index.md 2>/dev/null

Write to .flux/brain/pitfalls/[descriptive-slug].md:

# [Descriptive title]

## What happened
[One sentence: the bug and its root cause]

## Why it happened
[The deeper reason — missing validation, wrong assumption, unclear contract, missing production guardrail, or failure to reason about scale/topology]

## How to avoid
[Specific guidance for future development, including the exact production question that would have exposed it earlier]

## Trigger conditions
[The real-world conditions that activated the bug: scale, topology, data size, retries, timing, etc.]

## Related files
- [file:line] — where the root cause was
- [file:line] — where the symptom appeared

11b: Propose Structural Prevention

Ask: "Could this class of bug be prevented structurally?" Check each option:

Lint rule — could a linter catch this pattern? If yes, propose adding one.
Type constraint — could stronger types prevent this? (e.g., branded types, non-nullable)
Runtime check — should there be a validation at a trust boundary?
CI check — should a CI step catch this class of issue?
Production scenario check — should tests or review templates always ask a concrete scale/topology/data-volume question for this class of change?

If any apply, tell the user and offer to implement:

"This bug could be prevented in the future with [specific mechanism]. Want me to add that?"

11c: Check for Recurring Patterns

Search for similar patterns in the codebase:

Are there other call sites that pass data to the same function without the same validation?
Is this a systemic issue (e.g., all API handlers missing input validation) or a one-off?

If systemic, flag it:

"I found [N] other places in the codebase with the same pattern that could have the same bug. Want me to create a task to address them?"

COMPLETION

Step 12: Summary

Read completion.md for:

The RCA Summary output template
PR title/body expectations after the fix is verified
The required end-of-command Flux update check

Follow that file exactly after the Learn phase finishes.

Gotchas

Do not stop at the first plausible culprit. RCA is about proving the root cause chain, not writing a convincing story around a symptom.
A bugfix is incomplete without regression coverage or an explicit reason why that class of regression cannot recur.
If the lesson belongs in a lint rule, type constraint, or pitfall note, encode it structurally instead of leaving it as verbal hindsight.