Skill

think-diagnose

Diagnoses causes of observed phenomena via abductive reasoning. Spawns diagnosticians with lenses (technical, human-factors, process, etc.), evaluates fit to evidence, calibrates confidence, reports leaders with distinguishing tests. Feedback only.

developer-tools

Install

npx claudepluginhub chrisallenlane/claude-swe-workflows --plugin claude-swe-workflows

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Takes a phenomenon — something that was observed and that the user wants to understand — and produces a ranked set of candidate causes with evidence-based confidence calibration. Uses **abductive reasoning**: inference to the best explanation. Distinct from `/bug-fix` (which handles code-specific diagnosis with artifact output and execution tooling); `/think-diagnose` is pure reasoning about ca...

Supporting Assets

references/README.md

SKILL.md

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

169.2k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

169.2k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

169.2k

Stats

Stars12

Forks2

Last CommitApr 21, 2026

Actions

View Source View Plugin View on GitHub View README

Think-Diagnose - Abductive Reasoning About Causes

Takes a phenomenon — something that was observed and that the user wants to understand — and produces a ranked set of candidate causes with evidence-based confidence calibration. Uses abductive reasoning: inference to the best explanation. Distinct from /bug-fix (which handles code-specific diagnosis with artifact output and execution tooling); /think-diagnose is pure reasoning about causes, applicable to non-code phenomena as readily as code ones.

This skill produces no tangible artifacts. It is a consultant, not an implementer. No code, no tickets, no commits. The output is a structured diagnosis report that the user can act on by gathering more evidence, adopting a leading cause, or piping to /think-brainstorm for remediation.

Roles

Judge (you, running this skill):

Capture the phenomenon in a written brief
Elicit evidence, rigorously separating observation from interpretation
Choose appropriate reasoning lenses
Spawn diagnosticians in isolation
Evaluate candidate causes against evidence (this skill has a real evaluative phase, unlike purely-divergent think-* skills)
Calibrate confidence honestly and report

Diagnosticians: Each receives a specific reasoning lens and generates candidate causes (with mechanisms, predictions, refuters, and plausibility) in isolation from other diagnosticians.

Workflow

1. Receive the Phenomenon

The phenomenon may arrive as:

Conversation context — summarize it back, confirm
A document — read the file (incident report, data summary, observation log)
Fresh user input — capture verbatim

Produce a written brief of the phenomenon. Precisely what is the thing to explain? Vague phenomena produce vague diagnoses.

2. Gather Evidence — Separate Observation from Interpretation

This is the most failure-prone step in the entire workflow, and it has enforced structure. Most bad diagnoses start by accepting interpretations as observations.

Elicit from the user, in three distinct buckets:

Observations — concrete things that were measured, seen, or experienced. "The metric dropped 30% on March 14th." "Three customers mentioned X in surveys." "The build broke at commit abc123."
Interpretations already held — what the user or others have already inferred from the observations. "The team thinks it's because of the migration." "We believe the drop is due to seasonality." Flag these explicitly so diagnosticians know not to accept them as given.
Unavailable / unknown evidence — what's unknown, wasn't measured, or can't be retrieved. "We don't have per-user data before April." "We didn't log the old config."

Push back on smuggled interpretations. If the user says "the metric dropped because of the migration," that's two claims: (a) the metric dropped (observation) and (b) the migration caused it (interpretation). Separate them before proceeding.

3-6 clarifying questions is typical to establish this split. Stop when you have enough to pass diagnosticians material they can work with.

3. Choose Reasoning Lenses

Select 3-6 lenses from the palette based on the phenomenon's shape.

Available lenses:

technical — engineering-level causes (code, infra, config, capacity, dependencies)
human-factors — people, skills, fatigue, turnover, miscommunication, team dynamics
process — broken or missing process, handoffs, approvals, ownership, rituals
incentive-structure — the system rewards the behavior we're diagnosing (Goodhart territory)
environmental — external factors (market, regulation, customer mix, vendor, upstream)
temporal — something changed in time that correlates with the phenomenon
measurement-artifact — the phenomenon isn't real, it's a metric/instrumentation issue
statistical — base rates, regression to mean, Simpson's paradox, confounders, selection

Selection heuristics:

Phenomenon is metric-based? Always include measurement-artifact. Underrated; catches a large share of false phenomena.
Phenomenon has a clear onset date? Include temporal.
Phenomenon involves aggregate data (averages, ratios)? Include statistical.
Phenomenon is in a team/org context? Include human-factors, process, incentive-structure.
Phenomenon is in a codebase or system? Include technical.
Phenomenon occurs in a context with external inputs (customers, markets, vendors)? Include environmental.

Drop lenses that don't fit. A phenomenon in a closed system without external dependencies probably doesn't need environmental. A phenomenon observed directly (not through metrics) probably doesn't need measurement-artifact.

4. Spawn Diagnosticians (Parallel, Isolated)

Spawn one THK - Diagnostician agent per chosen lens, in parallel. Each receives:

The phenomenon brief
The observations
The interpretations already held (flagged — not to be accepted as given)
The unavailable evidence
Its assigned lens
Instruction to generate 3-8 candidate causes, each with mechanism / predictions / refuters / plausibility

No cross-talk between diagnosticians. NGT principle — independent reasoning first, evaluation second. Isolated diagnosticians produce more distinct candidate causes; coordinated ones anchor on the first compelling story.

Collect all candidate causes.

5. Evaluate Fit — Orchestrator's Work

This phase is new territory for /think-* skills. The prior skills (brainstorm, reframe, scrutinize, deliberate) are purely divergent or choose among pre-stated options; this skill requires the orchestrator to do evaluation against evidence.

For each candidate cause from step 4, evaluate:

Explanatory fit — does this cause explain the observed phenomenon? Does it explain all the observations, or only some?
Prediction check — the diagnostician stated what we'd expect to see if this cause were true. Do we observe those things? (Some predictions may require the user to check; note them.)
Refuter check — the diagnostician stated what would disprove this cause. Do we observe any of those refuters?
Parsimony — is there a simpler cause that fits equally well? Prefer the simpler one if fit is comparable.
Domain plausibility — given what's known about the domain, how plausible is this cause? This uses general reasoning, not just evidence fit.

Cluster causes across lenses. Some causes from different lenses are the same underlying mechanism viewed from different angles (e.g., "engineers ship half-finished features" seen through human-factors and incentive-structure may converge on the same root cause). Merge and preserve lens attribution.

Resist compelling-narrative bias. Causes with clean stories are dangerous; they feel explanatory even when they don't fit the evidence. Weight evidence fit over story quality. When in doubt, flag "compelling story, weak fit" explicitly.

6. Calibrate Confidence

No fabricated percentages. Use qualitative categories with clear meaning:

Strong fit — cause explains all observations, predictions confirmed (or testable), no refuters observed, plausible. This is a leading candidate.
Moderate fit — cause explains most observations, some predictions unconfirmed but not contradicted, plausible. Secondary candidate.
Weak fit — cause explains some observations, significant predictions unconfirmed, possibly plausible. Long-shot candidate.
Unable to distinguish — two or more causes fit the evidence equally well. Cannot converge without more evidence.

Honest uncertainty is valuable. "Cause A looks most likely but evidence is sparse; disambiguating observation X would shift the picture" is a better output than fake precision.

7. Report

Final report format:

## Diagnosis Report

**Phenomenon:** [one-line summary]
**Lenses applied:** [list]

### Observations

[Concrete ground-truth observations, as elicited in step 2]

### Interpretations Held Aside

[Interpretations the user or others held, flagged as not-accepted-as-given
for the diagnosis. If any turned out to be correct, that's reported in
the leading candidate section; if not, they remain held aside.]

### Unavailable Evidence

[What's unknown or wasn't measured — constrains what can be concluded]

### Leading Candidate(s)

[1-2 causes with strong fit. For each:]

#### [cause name] — strong fit

**Mechanism:** [how this cause produces the observations]

**Evidence fit:**
- Observations explained: [which]
- Predictions confirmed: [which; or "would need to check X"]
- Refuters: [none observed / note any partial refuters]

**Plausibility:** [brief domain-knowledge assessment]

**Lens(es):** [which diagnostician(s) surfaced this]

### Other Candidates

[Moderate-fit and weak-fit causes, briefly. Include lens attribution.]

### Distinguishing Evidence

[Concrete observations the user could gather to distinguish between the
leading candidates. This is the most actionable part of the report —
not "more data" in the abstract, but specific tests.]

For example:
- To distinguish cause A from cause B: check whether [specific observable].
  If [X], cause A. If [Y], cause B.

### What Remains Unknown

[Questions the diagnosis raised but cannot answer with current evidence.
May include "original interpretation X still possible but no stronger
support than alternatives."]

### Recommendation

One of:
- **Act on leading candidate** — evidence fit is strong; proceed with remediation of [cause]
- **Gather distinguishing evidence first** — leading candidates tie; collect [specific observations] to converge
- **The phenomenon may not be real** — measurement-artifact lens produced a strong candidate; verify the observation before diagnosing further
- **Insufficient evidence for confident diagnosis** — current observations don't distinguish among plausible causes; decision needed on whether to gather more data or accept uncertainty

### Suggested Next Steps

- To remediate the leading cause: `/think-brainstorm` for interventions (or `/bug-fix` if the cause is in code)
- To gather distinguishing evidence: [specific checks listed above]
- To pressure-test the leading cause before acting: `/think-scrutinize`

8. No Iteration

This skill is one-shot. If the user gathers distinguishing evidence and wants an updated diagnosis, they re-invoke with the new evidence in hand. Each invocation is a clean diagnostic consultation.

Constraints

No artifacts. No code, tickets, commits, or documents.
Separate observation from interpretation. Non-negotiable in the evidence-gathering phase.
Isolated generation. Diagnosticians do not see each other's output during generation.
Honest uncertainty. No fabricated confidence numbers. Qualitative categories only.
Evidence fit > story quality. Compelling narratives are suspect; evidence fit wins.
Honest "can't distinguish" is a valid, valuable outcome.

When to Use

Good fit:

Unexplained metric changes — "why did engagement drop 15% starting March 1?"
Recurring problems with unclear cause — "why does this project keep slipping?"
Behavioral patterns — "why do customer calls go badly when we raise prices?"
Organizational issues — "why does our goal-setting process keep producing missed goals?"
Anywhere /bug-fix doesn't apply because the phenomenon isn't code-specific

Poor fit:

Known causes needing intervention design — use /think-brainstorm
Code-specific diagnosis — use /bug-fix (artifact output, execution tooling)
Choosing between options — use /think-deliberate
Stress-testing a plan — use /think-scrutinize
Phenomena so vaguely described that no observations exist (refine the phenomenon first)

Rule of thumb:

"Why is this happening?" → /think-diagnose
"Code is broken, find and fix it" → /bug-fix
"What could I do about this?" → /think-brainstorm

Relationship to Other Skills

/think-diagnose is a hybrid generative + evaluative skill, unlike the purely divergent /think-brainstorm and /think-reframe or the selective /think-deliberate and /think-scrutinize. The orchestrator generates candidate causes (divergent) then evaluates them against evidence (evaluative).

Natural downstreams:

If cause is identified → /think-brainstorm for remediation interventions
If cause is in code → /bug-fix for targeted investigation and fix
If cause is uncertain between candidates → gather distinguishing evidence, then re-invoke /think-diagnose
Before acting on the leading cause → /think-scrutinize to stress-test the intervention plan

Natural pipeline (for non-code phenomena):

/think-diagnose → /think-brainstorm → /think-deliberate → /think-scrutinize
    why?            what to do?         which approach?      what's wrong?

Philosophy

Diagnosis is hard because compelling narratives beat correct ones. Humans prefer causes that tell a good story — they feel explanatory. Good abductive reasoning resists this: the most likely cause is the one that best fits the evidence, not the one that makes the cleanest story.

The enforced observation-vs-interpretation split is the skill's most important contribution. Most bad diagnoses start by accepting an interpretation as if it were an observation — "the drop happened because of the migration" sneaks the causal claim into the description of what happened. Once that interpretation is in the evidence bucket, no diagnostician will challenge it, and the diagnosis inherits the error.

Honest uncertainty is the other key discipline. "I don't know for sure" is often the correct output when evidence is thin — and it's far more useful than a confident-sounding but brittle conclusion. Users can act on acknowledged uncertainty (by gathering more evidence); they can't protect themselves from false confidence.