Help us improve
Share bugs, ideas, or general feedback.
From rust-development-pipeline
Debug an existing fixture-anchored system that passes its acceptance test but produces wrong physics or wrong output. Classifies prior investigation notes, establishes external anchor criteria, applies upstream-audit rule, implements fix with discriminator-value tests, and captures resolution. Use when the user says "/debug-outcomes", "debug this failure", "the test passes but the output is wrong", or describes a symptom in a system that already has fixture files and a passing (but loose) acceptance test.
npx claudepluginhub tonywu20/my-claude-marketplace --plugin rust-development-pipelineHow this skill is triggered — by the user, by Claude, or both
Slash command
/rust-development-pipeline:debug-outcomesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Handles the **debug shape**: an existing fixture-anchored system whose acceptance
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Share bugs, ideas, or general feedback.
Handles the debug shape: an existing fixture-anchored system whose acceptance
test passes but whose output violates physics or known-good reference values. This
is distinct from /drive-outcomes (new development against a phase plan) and
/diagnose-tests (static audit). The entry point is a symptom, not a plan file.
The key risks in this shape are:
/debug-outcomes "<symptom>"
Where <symptom> is a free-text description of the observed wrong output
(e.g., "V_ion = +82 Ha at atom site, expected ≈ -21 Ha").
cat CONTEXT.md 2>/dev/null || echo "NO_CONTEXT_MD"
fd --no-ignore -t f . fixtures/ 2>/dev/null | head -5
/init-project first./drive-outcomes — this skill
requires external ground truth. Without fixtures there is no anchor.fd instead of findrg instead of grep${CLAUDE_PLUGIN_ROOT}/skills/drive-outcomes/references/odd-pattern.md${CLAUDE_PLUGIN_ROOT}/skills/drive-outcomes/references/forensic-tasks-spec.mdnotes/debug/<slug>/INVESTIGATION.md — prior-note classificationnotes/debug/<slug>/CRITERIA.md — external anchor criterianotes/debug/<slug>/DIAGNOSTIC_SELFTEST.md — diagnostic self-test resultsnotes/debug/<slug>/DIVERGENCE_SURFACE.md — divergence-surface enumerationnotes/debug/<slug>/RESOLUTION.md — root cause, fix location, reclassified claimsnotes/failure-patterns.md — appended entryecho "debug-outcomes" > .claude/.current_stage
date +%s%3N > .claude/.session_start
SLUG="debug-$(date +%Y%m%d-%H%M)"
mkdir -p notes/debug/$SLUG
Read CONTEXT.md, ADRs, notes/failure-patterns.md, and odd-pattern.md.
Scan notes/ for any investigation notes, session logs, or plan files referencing
the symptom:
rg -l "<symptom-keyword>" notes/ 2>/dev/null
For every numeric claim found in those files, classify it and record in
notes/debug/$SLUG/INVESTIGATION.md:
| Class | Definition | Admissible as criterion? |
|---|---|---|
| EXTERNAL | From a fixture file, published spec, or reference implementation output | Yes |
| DERIVED | Computed by our own pipeline (possibly buggy) | No — not without independent corroboration |
| HYPOTHESIZED | Inferred or estimated from scaling arguments | No |
Classification procedure: for each number, ask "where did this come from?" Trace it to its origin. If the origin is our own code or a prior session's output, it is DERIVED regardless of how it was phrased ("expected", "needed", "target").
Only EXTERNAL claims may become success criteria in Step 3.
Memory citation rule: claims from prior sessions' memory entries auto-classify as DERIVED unless they carry all three of:
file:line in the reference implementation, reachable in
≤60 secondsWithout these three fields, a memory entry is a hypothesis, not evidence.
Discover fixture files:
fd --no-ignore -t f . fixtures/ 2>/dev/null | sort
For each fixture file relevant to the symptom, extract concrete expected values
directly from the file — not from prior notes. Write to
notes/debug/$SLUG/CRITERIA.md:
# Anchor Criteria: <symptom-slug>
## Fixture Files
- `fixtures/<path>` — <what it contains>
## Success Criteria
- <assertion> (Source: <fixture-path>, <field or offset>)
- <assertion> (Source: <reference-impl-file>:<line>)
Each criterion must be EXTERNAL and falsifiable. If no EXTERNAL anchor can be established, stop and tell the user — the bug cannot be debugged without ground truth.
Before testing any hypothesis, enumerate every place the local pipeline could
plausibly disagree with the reference implementation for the observed symptom
class. Write to notes/debug/$SLUG/DIVERGENCE_SURFACE.md.
Use these generic divergence-surface categories as a starting point (project-specific items supersede where applicable):
if / when / case / early-return)For each item, classify as:
Items cannot be ruled out by prior-session memory unless that memory carries verification granularity, citation (file:line), and counter-example/scope. A prior memory that says "X was verified" without these three fields is DERIVED and cannot eliminate a divergence-surface item.
Goal: breadth-first enumeration before diving deep on any single item. This prevents depth-first hypothesis chaining from prior sessions' biases.
Before trusting any diagnostic or comparison code in downstream steps, verify every diagnostic code path against the external anchors from Step 3.
Enumerate diagnostics — list every diagnostic script, comparison
function, and analysis tool that downstream steps will use. Write to
notes/debug/$SLUG/DIAGNOSTIC_SELFTEST.md.
Self-test each diagnostic — for each enumerated path, query ≥10 sample points through 2 structurally independent code paths and assert agreement at every point:
# Path A: high-level diagnostic function being verified
result_a = compute_v_ion(parsed_data)
# Path B: manual computation from first principles
result_b = manual_v_ion(parsed_data, formula="V_eff - V_H - V_xc")
assert abs(result_a - result_b) < 1e-6
Independence means different indexing scheme, iteration order, or formula derivation — not the same logic in different syntax.
Record pass/fail per sample point in DIAGNOSTIC_SELFTEST.md.
If any sample point disagrees — the diagnostic code has a bug. Fix the diagnostic before proceeding to Step 6. Do NOT interpret the diagnostic output; the diagnostic is part of the debug surface and is wrong by default.
Sanity-check against physical intuition — before trusting any diagnostic output, compare against coarse physical expectation:
When diagnostic output conflicts with intuition, treat the diagnostic as the likely error source. Reject any diagnostic whose output cannot be reproduced by at least one independent computation path.
Reference: odd-pattern.md (Diagnostic Self-Verification section) for the
full pattern description and rationale.
Ask the user explicitly:
"Has the algorithm formula been independently validated against the reference implementation (not just against our own pipeline output)?"
This gate enforces the Upstream Audit Rule from odd-pattern.md: if the algorithm
has been verified and outputs still violate criteria, re-examining the algorithm
wastes sessions. The bug is upstream.
In upstream-audit mode, enumerate all inputs to the algorithm and check each:
Which parser produces this input?
What unit convention does the reference implementation use at the read boundary?
Does the parser convert at read time, or does the caller convert?
Find BOTH the read code line AND the write code line in the reference
implementation (Read/Write Symmetry rule from odd-pattern.md).
Conditional-skip table — produce a table mapping every if/when/
case/early-return in the reference function to our implementation's
handling. Empty "Our condition" cells are red flags requiring explicit
justification ("safe because input X is always Y, verified by test Z")
or a fix to add the missing condition.
Reference: odd-pattern.md (Conditional-Skip Audits section) for the
full table format and rationale.
Diagnostic check — if any diagnostic in use emits only summary statistics (no per-point backing), report to the user:
"The diagnostic only emits summary statistics. Should I write a brute-force per-point comparison loop to get ground-truth data? If you know what specific coordinates or pattern to target, tell me."
If the user provides guidance, follow it. If the user agrees or doesn't
know, write a brute-force per-point sweep that dumps every data point
through a simple independently-written loop. Write the output to
notes/debug/$SLUG/PER_POINT_DUMP.md.
Run loose: execute the existing acceptance test as-is. Record the actual
observed value (e.g., max_residual = 103.4 Ha).
Observe: compare against EXTERNAL anchor criteria from Step 3. Compute discrepancy magnitude and direction.
Write a tight test that fails at the observed wrong value and passes only
at the correct value. Apply the Discriminator Value Selection rule from
odd-pattern.md: the threshold must be placed such that correct and incorrect
implementations differ by ≥ 2×. Boundary values are brittle.
// TIGHT: correct ≈ -21 Ha, wrong = +82 Ha → ratio ≈ 4×
// threshold at -10 Ha gives clear binary signal
assert!(v_ion < -10.0, "V_ion = {} Ha (expected < -10 Ha)", v_ion);
// Source: Cu111_CO.pot_fmt, V_eff − V_H − V_xc = -21 Ha
Confirm the tight test fails on the current (broken) code before implementing any fix. If it passes, the threshold is wrong — tighten further.
Brute-force guard — after every 2 full cycles through sub-steps 1-4
without producing a new artifact file in notes/debug/$SLUG/, stop and
report to the user:
"I've cycled through Loose-then-Tighten twice without a breakthrough. Current state: {observed values, hypotheses so far}. Next step: brute-force sweep to find the discrepancy pattern. Any insight on what to target, or should I dump everything?"
If the user guides, follow guidance. Otherwise, write a brute-force
per-point loop that dumps raw data with zero aggregation to
notes/debug/$SLUG/BRUTE_FORCE_DUMP.md. Use the raw dump to identify
the discrepancy pattern (e.g., "every 3rd point is wrong" → stride-3
indexing bug), then return to sub-step 2 with the corrected understanding.
Decompose the residual — if the failing observable is a sum of multiple components (e.g., V_eff = V_H + V_ion + V_xc), each component should have an independent EXTERNAL anchor. Write tight tests per-component, not just at the sum. Residuals are attributable component-by-component, not bundled.
Scope the fix to the upstream component identified in Step 6 (parser, unit conversion, file format) or the algorithm if in algorithm-validation mode.
Standard loop:
cargo check --workspace 2>&1Auto-review before commit:
Commit: fix(<component>): <symptom-slug> — <root-cause-one-line>
Write notes/debug/$SLUG/RESOLUTION.md:
# Resolution: <symptom-slug>
**Symptom**: <original symptom text>
**Root cause**: <one sentence>
**Fix location**: <file:line>
**Fix description**: <what changed>
**Anchor criteria used**: <list EXTERNAL claims from Step 3>
**Prior notes reclassified**:
- "<claim>" — reclassified from EXTERNAL to DERIVED because <reason>
**Date**: <ISO date>
Append a summary entry to notes/failure-patterns.md (create if absent):
## <date>: <symptom-slug>
**Root cause**: <one sentence>
**Fix**: <file:line>
**Pattern**: <upstream-unit-mismatch | algorithm-error | format-assumption | ...>
Update any open investigation notes that reference this symptom: append a
RESOLVED section with a pointer to RESOLUTION.md.
CLAUDE_PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT}" CLAUDE_PROJECT_DIR="${CLAUDE_PROJECT_DIR}" \
uv run --directory "${CLAUDE_PLUGIN_ROOT}" python \
"${CLAUDE_PLUGIN_ROOT}/scripts/eval-session-metrics.py" debug-outcomes
"Debug session complete.
Root cause: {root-cause}. Fix: {file:line}. Tight test anchored to: {fixture-path}. Resolution recorded at notes/debug/{slug}/RESOLUTION.md.
{N} prior-note claims reclassified (DERIVED/HYPOTHESIZED → not used as criteria).
Next step:
/make-judgementif the fix touched multiple components."
Will:
Will not:
odd-pattern.md content — references it by path