Search everything...

Skill

research

Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.

Install

npx claudepluginhub wyattowalsh/agents --plugin agents

Tool Access

This skill uses the workspace's default tool permissions.

Preview

General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).

Supporting Assets

evals/test-cases.yamlreferences/bias-detection.mdreferences/confidence-rubric.mdreferences/contradiction-protocol.mdreferences/dashboard-schema.mdreferences/evidence-chain.mdreferences/output-formats.mdreferences/self-verification.mdreferences/session-commands.mdreferences/source-selection.mdreferences/team-templates.mdscripts/finding-formatter.pyscripts/journal-store.pyscripts/research-scanner.pyscripts/source-deduplicator.pyscripts/verify.pytemplates/dashboard.html

SKILL.md

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

168.3k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

168.3k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

168.3k

Stats

Stars2

Forks1

Last CommitApr 26, 2026

Actions

View Source View Plugin View on GitHub View README

research | agents | ClaudePluginHub

Skill

research

From agents

Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.

Install

npx claudepluginhub wyattowalsh/agents --plugin agents

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Supporting Assets

SKILL.md

Deep Research

Vocabulary

Term	Definition
query	The user's research question or topic; the unit of investigation
claim	A discrete assertion to be verified; extracted from sources or user input
source	A specific origin of information: URL, document, database record, or API response
evidence	A source-backed datum supporting or contradicting a claim; always has provenance
provenance	The chain from evidence to source: tool used, URL, access timestamp, excerpt
confidence	Score 0.0-1.0 per claim; based on evidence strength and cross-validation
cross-validation	Verifying a claim across 2+ independent sources; the core anti-hallucination mechanism
triangulation	Confirming a finding using 3+ methodologically diverse sources
contradiction	When two credible sources assert incompatible claims; must be surfaced explicitly
synthesis	The final research product: not a summary but a novel integration of evidence with analysis
journal	The saved markdown record of a research session, stored in `~/.{gemini
sweep	Wave 1: broad parallel search across multiple tools and sources
deep dive	Wave 2: targeted follow-up on specific leads from the sweep
lead	A promising source or thread identified during the sweep, warranting deeper investigation
tier	Complexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10)
finding	A verified claim with evidence chain, confidence score, and provenance; the atomic unit of output
gap	An identified area where evidence is insufficient, contradictory, or absent
bias marker	An explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.)
degraded mode	Operation when research tools are unavailable; confidence ceilings applied
capability	A research ability such as web search, docs lookup, extraction, or subagent delegation; tool names are preferred implementations, not guarantees

Dispatch

`$ARGUMENTS`	Action
Question or topic text (has verb or `?`)	Investigate — classify complexity, execute wave pipeline
Vague input (<5 words, no verb, no `?`)	Intake — ask 2-3 clarifying questions, then classify
`check <claim>` or `verify <claim>`	Fact-check — verify claim against 3+ search engines
`compare <A> vs <B> [vs <C>...]`	Compare — structured comparison with decision matrix output
`survey <field or topic>`	Survey — landscape mapping, annotated bibliography
`track <topic>`	Track — load prior journal, search for updates since last session
`resume [number or keyword]`	Resume — resume a saved research session
`list [active\|domain\|tier]`	List — show journal metadata table
`archive`	Archive — move journals older than 90 days
`delete <N>`	Delete — delete journal N with confirmation
`export [N]`	Export — render HTML dashboard for journal N (default: current)
Empty	Gallery — show topic examples + "ask me anything" prompt

Auto-Detection Heuristic

If no mode keyword matches:

Ends with ? or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → Investigate
Contains vs, versus, compared to, or between noun phrases → Compare
Declarative statement with factual claim, no question syntax → Fact-check
Broad field name with no specific question → ask: "Investigate a specific question, or survey the entire field?"
Ambiguous → ask: "Would you like me to investigate this question, verify this claim, or survey this field?"

Gallery (Empty Arguments)

Present research examples spanning domains:

#	Domain	Example	Likely Tier
1	Technology	"What are the current best practices for LLM agent architectures?"	Deep
2	Academic	"What is the state of evidence on intermittent fasting for longevity?"	Standard
3	Market	"How does the competitive landscape for vector databases compare?"	Deep
4	Fact-check	"Is it true that 90% of startups fail within the first year?"	Standard
5	Architecture	"When should you choose event sourcing over CRUD?"	Standard
6	Trends	"What emerging programming languages gained traction in 2025-2026?"	Standard

Pick a number, paste your own question, or type guide me.

Skill Awareness

Before starting research, check if another skill is a better fit:

Signal	Redirect
Code review, PR review, diff analysis	Suggest `/honest-review`
Strategic decision with adversaries, game theory	Suggest `/wargame`
Multi-perspective expert debate	Suggest `/host-panel`
Prompt optimization, model-specific prompting	Suggest `/prompt-engineer`

If the user confirms they want general research, proceed.

Complexity Classification

Score the query on 5 dimensions (0-2 each, total 0-10):

Dimension	0	1	2
Scope breadth	Single fact/definition	Multi-faceted, 2-3 domains	Cross-disciplinary, 4+ domains
Source difficulty	Top search results suffice	Specialized databases or multiple source types	Paywalled, fragmented, or conflicting sources
Temporal sensitivity	Stable/historical	Evolving field (months matter)	Fast-moving (days/weeks matter), active controversy
Verification complexity	Easily verifiable (official docs)	2-3 independent sources needed	Contested claims, expert disagreement, no consensus
Synthesis demand	Answer is a fact or list	Compare/contrast viewpoints	Novel integration of conflicting threads

Total	Tier	Strategy
0-2	Quick	Inline, 1-2 searches, fire-and-forget
3-5	Standard	Subagent wave, 3-5 parallel searchers, report delivered
6-8	Deep	Agent team (TeamCreate), 3-5 teammates, interactive session
9-10	Exhaustive	Agent team, 4-6 teammates + nested subagent waves, interactive

Present the scoring to the user. User can override tier with --depth <tier>.

Scaling Strategy

Scale work by query complexity and available orchestration capabilities:

Scope	Strategy	Delegation
Quick (0-2)	Inline answer after 1-2 searches	No subagents
Standard (3-5)	Parallel broad sweep across 2-5 sub-questions	Use available subagent primitive; otherwise batch sequentially
Deep (6-8)	Lead-driven team workflow with perspective expansion	Use team/subagent primitives when present; otherwise bounded serial waves
Exhaustive (9-10)	Deep workflow plus adversarial and nested waves	Use nested delegation when available; otherwise state degraded throughput explicitly

Capability resolution: Treat named tools and orchestration APIs as preferred capabilities. Claude Code may use Task/TeamCreate; Codex may use dynamic subagents or parallel tool calls; other agents may use their native delegation or run the wave pipeline serially. If no delegation equivalent exists, use degraded orchestration: preserve wave order, reduce breadth, and report the limitation in methodology. Apply confidence ceilings only when source or retrieval capabilities are unavailable, per references/source-selection.md.

Wave Pipeline

All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.

Wave 0: Triage (always inline, never parallelized)

Run !uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS" for deterministic pre-scan
Decompose query into 2-5 sub-questions
Score complexity on the 5-dimension rubric
Check tool availability — probe key MCP tools; set degraded mode flags and confidence ceilings per references/source-selection.md
Select tools per domain signals — read references/source-selection.md
Check for existing journals — if track or resume, load prior state
Present triage to user — show: complexity score, sub-questions, planned strategy, estimated tier. User may override.

Wave 1: Broad Sweep (parallel)

Scale by tier:

Quick (inline): 1-2 tool calls sequentially. No subagents.

Standard (subagent wave): Dispatch 3-5 parallel subagents with the platform's available delegation primitive:

Subagent A → brave-search + duckduckgo-search for sub-question 1
Subagent B → exa + g-search for sub-question 2
Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specifics
Subagent D → wikipedia / wikidata for factual grounding
[Subagent E → PubMed / openalex if academic domain detected]

Deep (agent team): Create a research team with the platform's available team primitive:

Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4)
  |-- web-researcher:       brave-search, duckduckgo-search, exa, g-search
  |-- tech-researcher:      context7, deepwiki, arxiv, semantic-scholar, package-version
  |-- content-extractor:    fetcher, trafilatura, docling, wikipedia, wayback
  |-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed]
  |-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]

Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.

Exhaustive: Deep team + each teammate runs nested subagent waves internally when supported; otherwise use serial batches and label the run "degraded orchestration."

Each subagent/teammate returns structured findings:

{
  "sub_question": "...",
  "findings": [{
    "claim": "...",
    "confidence": 0.6,
    "evidence": [{"tool": "brave-search", "url": "https://...", "timestamp": "2026-04-24T12:00:00Z", "excerpt": "..."}],
    "cross_validation": "unknown",
    "bias_markers": [],
    "gaps": []
  }],
  "leads": ["url1", "url2"],
  "gaps": ["could not find data on X"]
}

Wave 1.5: Perspective Expansion (Deep/Exhaustive only)

STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:

Perspective	Focus	Question Style
Skeptic	What could be wrong? What's missing?	"What evidence would disprove this?"
Domain Expert	Technical depth, nuance, edge cases	"What do practitioners actually encounter?"
Practitioner	Real-world applicability, trade-offs	"What matters when you actually build this?"
Theorist	First principles, abstractions, frameworks	"What underlying model explains this?"

Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.

Wave 2: Deep Dive (parallel, targeted)

Rank leads from Wave 1 by potential value (citation frequency, source authority, relevance)
Dispatch deep-read subagents — use fetcher/trafilatura/docling to extract full content from top leads
Follow citation chains — if a source cites another, fetch the original
Fill gaps — for each gap identified in Wave 1, dispatch targeted searches
Use thinking MCPs:
- cascade-thinking for multi-perspective analysis of complex findings
- structured-thinking for tracking evidence chains and contradictions
- think-strategies for complex question decomposition (Standard+ only)

Wave 3: Cross-Validation (parallel)

The anti-hallucination wave. Read references/confidence-rubric.md and references/self-verification.md.

For every claim surviving Waves 1-2:

Independence check — are supporting sources truly independent? Sources citing each other are NOT independent.
Counter-search — explicitly search for evidence AGAINST each major claim using a different search engine
Freshness check — verify sources are current (flag if >1 year old for time-sensitive topics)
Contradiction scan — read references/contradiction-protocol.md, identify and classify disagreements
Confidence scoring — assign 0.0-1.0 per references/confidence-rubric.md
Bias sweep — check each finding against 10 bias categories (7 core + 3 LLM-specific) per references/bias-detection.md

Self-Verification (3+ findings survive): Spawn devil's advocate subagent per references/self-verification.md:

For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.

Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0. Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.

Wave 4: Synthesis (always inline, lead only)

Produce the final research product. Read references/output-formats.md for templates.

The synthesis is NOT a summary. It must:

Answer directly — answer the user's question clearly
Map evidence — all verified findings with confidence and citations
Surface contradictions — where sources disagree, with analysis of why
Show confidence landscape — what is known confidently, what is uncertain, what is unknown
Audit biases — biases detected during research
Identify gaps — what evidence is missing, what further research would help
Distill takeaways — 3-7 numbered key findings
Cite sources — full bibliography with provenance

Output format adapts to mode:

Investigate → Research Brief (Standard) or Deep Report (Deep/Exhaustive)
Fact-check → Quick Answer with verdict + evidence
Compare → Decision Matrix
Survey → Annotated Bibliography
User can override with --format brief|deep|bib|matrix

Confidence Scoring

Score	Basis
0.9-1.0	Official docs + 2 independent sources agree, no contradictions
0.7-0.8	2+ independent sources agree, minor qualifications
0.5-0.6	Single authoritative source, or 2 sources with partial agreement
0.3-0.4	Single non-authoritative source, or conflicting evidence
0.2-0.3	Multiple non-authoritative sources with partial agreement, or single source with significant caveats
0.1-0.2	LLM reasoning only, no external evidence found
0.0	Actively contradicted by evidence

Hard rules:

No claim reported at >= 0.7 unless supported by 2+ independent sources
Single-source claims cap at 0.6 regardless of source authority
Degraded mode (all research tools unavailable): max confidence 0.4, all findings labeled "unverified"

Merged confidence (for claims supported by multiple sources): c_merged = 1 - (1-c1)(1-c2)...(1-cN) capped at 0.99

Evidence Chain Structure

Every finding carries this structure:

FINDING RR-{seq:03d}: [claim statement]
  CONFIDENCE: [0.0-1.0]
  EVIDENCE:
    1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
    2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
  CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources
  BIAS MARKERS: [none | list of detected biases with category]
  GAPS: [none | what additional evidence would strengthen this finding]

Use !uv run python skills/research/scripts/finding-formatter.py --format markdown to normalize.

Source Selection

Read references/source-selection.md during Wave 0 for the full tool-to-domain mapping. Summary:

Domain Signal	Primary Tools	Secondary Tools
Library/API docs	`llms.txt`/`llms-full.txt`, context7, deepwiki, package-version	brave-search
Academic/scientific	arxiv, semantic-scholar, PubMed, openalex	crossref, brave-search
Current events/trends	brave-search, exa, duckduckgo-search, g-search	fetcher, trafilatura
GitHub repos/OSS	deepwiki, repomix	brave-search
General knowledge	wikipedia, wikidata, brave-search	fetcher
Historical content	wayback, brave-search	fetcher
Fact-checking	3+ search engines mandatory	wikidata for structured claims
PDF/document analysis	docling	trafilatura

Multi-engine protocol: For any claim requiring verification, use minimum 2 different search engines. Different engines have different indices and biases. Agreement across engines increases confidence.

Progressive Disclosure

Load only the next required reference:

Start with this file for routing, classification, and wave order.
Load references/source-selection.md during Wave 0 only.
Load validation references during Wave 3 only: confidence, contradiction, self-verification, and bias files as needed.
Load references/output-formats.md or references/dashboard-schema.md only when producing final output or exports.
Never preload all references; summarize tool limitations instead of filling context with unused mappings.

Bias Detection

Check every finding against 10 bias categories. Read references/bias-detection.md for full detection signals and mitigation strategies.

Bias	Detection Signal	Mitigation
LLM prior	Matches common training patterns, lacks fresh evidence	Flag; require fresh source confirmation
Recency	Overweighting recent results, ignoring historical context	Search for historical perspective
Authority	Uncritically accepting prestigious sources	Cross-validate even authoritative claims
Confirmation	Queries constructed to confirm initial hypothesis	Use neutral queries; search for counterarguments
Survivorship	Only finding successful examples	Search for failures/counterexamples
Selection	Search engine bubble, English-only	Use multiple engines; note coverage limitations
Anchoring	First source disproportionately shapes interpretation	Document first source separately; seek contrast

State Management

Journal path: ~/.{gemini|copilot|codex|claude}/research/
Archive path: ~/.{gemini|copilot|codex|claude}/research/archive/
Filename convention: {YYYY-MM-DD}-{domain}-{slug}.md
- {domain}: tech, academic, market, policy, factcheck, compare, survey, track, general
- {slug}: 3-5 word semantic summary, kebab-case
- Collision: append -v2, -v3
Format: YAML frontmatter + markdown body +  blocks

Save protocol:

Quick: save once at end with status: Complete
Standard/Deep/Exhaustive: save after Wave 1 with status: In Progress, update after each wave, finalize after synthesis

Resume protocol:

resume (no args): find status: In Progress journals. One → auto-resume. Multiple → show list.
resume N: Nth journal from list output (reverse chronological).
resume keyword: search frontmatter query and domain_tags for match.

Use !uv run python skills/research/scripts/journal-store.py for all journal operations.

State snapshot (appended after each wave save):

<!-- STATE
wave_completed: 2
findings_count: 12
leads_pending: ["url1", "url2"]
gaps: ["topic X needs more sources"]
contradictions: 1
next_action: "Wave 3: cross-validate top 8 findings"
-->

In-Session Commands (Deep/Exhaustive)

Available during active research sessions:

Command	Effect
`drill <finding #>`	Deep dive into a specific finding with more sources
`pivot <new angle>`	Redirect research to a new sub-question
`counter <finding #>`	Explicitly search for evidence against a finding
`export`	Render HTML dashboard
`status`	Show current research state without advancing
`sources`	List all sources consulted so far
`confidence`	Show confidence distribution across findings
`gaps`	List identified knowledge gaps
`?`	Show command menu

Read references/session-commands.md for full protocols.

Reference File Index

File	Content	Read When
`references/source-selection.md`	Tool-to-domain mapping, multi-engine protocol, degraded mode	Wave 0 (selecting tools)
`references/confidence-rubric.md`	Scoring rubric, cross-validation rules, independence checks	Wave 3 (assigning confidence)
`references/evidence-chain.md`	Finding template, provenance format, citation standards	Any wave (structuring evidence)
`references/bias-detection.md`	10 bias categories (7 core + 3 LLM-specific), detection signals, mitigation strategies	Wave 3 (bias audit)
`references/contradiction-protocol.md`	4 contradiction types, resolution framework	Wave 3 (contradiction detection)
`references/self-verification.md`	Devil's advocate protocol, hallucination detection	Wave 3 (self-verification)
`references/output-formats.md`	Templates for all 5 output formats	Wave 4 (formatting output)
`references/team-templates.md`	Team archetypes, subagent prompts, perspective agents	Wave 0 (designing team)
`references/session-commands.md`	In-session command protocols	When user issues in-session command
`references/dashboard-schema.md`	JSON data contract for HTML dashboard	`export` command

Loading rule: Load ONE reference at a time per the "Read When" column. Do not preload.

Critical Rules

No claim >= 0.7 unless supported by 2+ independent sources — single-source claims cap at 0.6
Never fabricate citations — if URL, author, title, or date cannot be verified, use vague attribution ("a study in this tradition") rather than inventing specifics
Always surface contradictions explicitly — never silently resolve disagreements; present both sides with evidence
Always present triage scoring before executing research — user must see and can override complexity tier
Save journal after every wave in Deep/Exhaustive mode — enables resume after interruption
Never skip Wave 3 (cross-validation) for Standard/Deep/Exhaustive tiers — this is the anti-hallucination mechanism
Multi-engine search is mandatory for fact-checking — use minimum 3 different search tools (e.g., brave-search + duckduckgo-search + exa)
Apply the Accounting Rule after every parallel dispatch — N dispatched = N accounted for before proceeding to next wave
Distinguish facts from interpretations in all output — factual claims carry evidence; interpretive claims are explicitly labeled as analysis
Flag all LLM-prior findings — claims matching common training data but lacking fresh evidence must be flagged with bias marker
Max confidence 0.4 in degraded mode — when all research tools are unavailable, report all findings as "unverified — based on training knowledge"
Load ONE reference file at a time — do not preload all references into context
Track mode must load prior journal before searching — avoid re-researching what is already known
The synthesis is not a summary — it must integrate findings into novel analysis, identify patterns across sources, and surface emergent insights not present in any single source
PreToolUse write guard is non-negotiable — the research skill never modifies source files; it only creates/updates journals in ~/.{gemini|copilot|codex|claude}/research/
Stop hook must pass — verify.py stop confirms the skill did not leave tracked research-source files dirty
Normalize legacy findings before synthesis — top-level source_url, source_tool, and confidence_raw must be converted into the canonical evidence[] + confidence shape

Similar Skills

using-git-worktrees

Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.

superpowers

168.3k

subagent-driven-development

3 files

Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.

superpowers

168.3k

dispatching-parallel-agents

Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.

superpowers

168.3k

Stats

Stars2

Forks1

Last CommitApr 26, 2026

Actions

View Source View Plugin View on GitHub View README

Deep Research

Vocabulary

Term	Definition
query	The user's research question or topic; the unit of investigation
claim	A discrete assertion to be verified; extracted from sources or user input
source	A specific origin of information: URL, document, database record, or API response
evidence	A source-backed datum supporting or contradicting a claim; always has provenance
provenance	The chain from evidence to source: tool used, URL, access timestamp, excerpt
confidence	Score 0.0-1.0 per claim; based on evidence strength and cross-validation
cross-validation	Verifying a claim across 2+ independent sources; the core anti-hallucination mechanism
triangulation	Confirming a finding using 3+ methodologically diverse sources
contradiction	When two credible sources assert incompatible claims; must be surfaced explicitly
synthesis	The final research product: not a summary but a novel integration of evidence with analysis
journal	The saved markdown record of a research session, stored in `~/.{gemini
sweep	Wave 1: broad parallel search across multiple tools and sources
deep dive	Wave 2: targeted follow-up on specific leads from the sweep
lead	A promising source or thread identified during the sweep, warranting deeper investigation
tier	Complexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10)
finding	A verified claim with evidence chain, confidence score, and provenance; the atomic unit of output
gap	An identified area where evidence is insufficient, contradictory, or absent
bias marker	An explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.)
degraded mode	Operation when research tools are unavailable; confidence ceilings applied
capability	A research ability such as web search, docs lookup, extraction, or subagent delegation; tool names are preferred implementations, not guarantees

Dispatch

`$ARGUMENTS`	Action
Question or topic text (has verb or `?`)	Investigate — classify complexity, execute wave pipeline
Vague input (<5 words, no verb, no `?`)	Intake — ask 2-3 clarifying questions, then classify
`check <claim>` or `verify <claim>`	Fact-check — verify claim against 3+ search engines
`compare <A> vs <B> [vs <C>...]`	Compare — structured comparison with decision matrix output
`survey <field or topic>`	Survey — landscape mapping, annotated bibliography
`track <topic>`	Track — load prior journal, search for updates since last session
`resume [number or keyword]`	Resume — resume a saved research session
`list [active\|domain\|tier]`	List — show journal metadata table
`archive`	Archive — move journals older than 90 days
`delete <N>`	Delete — delete journal N with confirmation
`export [N]`	Export — render HTML dashboard for journal N (default: current)
Empty	Gallery — show topic examples + "ask me anything" prompt

Auto-Detection Heuristic

If no mode keyword matches:

Ends with ? or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → Investigate
Contains vs, versus, compared to, or between noun phrases → Compare
Declarative statement with factual claim, no question syntax → Fact-check
Broad field name with no specific question → ask: "Investigate a specific question, or survey the entire field?"
Ambiguous → ask: "Would you like me to investigate this question, verify this claim, or survey this field?"

Gallery (Empty Arguments)

Present research examples spanning domains:

#	Domain	Example	Likely Tier
1	Technology	"What are the current best practices for LLM agent architectures?"	Deep
2	Academic	"What is the state of evidence on intermittent fasting for longevity?"	Standard
3	Market	"How does the competitive landscape for vector databases compare?"	Deep
4	Fact-check	"Is it true that 90% of startups fail within the first year?"	Standard
5	Architecture	"When should you choose event sourcing over CRUD?"	Standard
6	Trends	"What emerging programming languages gained traction in 2025-2026?"	Standard

Pick a number, paste your own question, or type guide me.

Skill Awareness

Before starting research, check if another skill is a better fit:

Signal	Redirect
Code review, PR review, diff analysis	Suggest `/honest-review`
Strategic decision with adversaries, game theory	Suggest `/wargame`
Multi-perspective expert debate	Suggest `/host-panel`
Prompt optimization, model-specific prompting	Suggest `/prompt-engineer`

If the user confirms they want general research, proceed.

Complexity Classification

Score the query on 5 dimensions (0-2 each, total 0-10):

Dimension	0	1	2
Scope breadth	Single fact/definition	Multi-faceted, 2-3 domains	Cross-disciplinary, 4+ domains
Source difficulty	Top search results suffice	Specialized databases or multiple source types	Paywalled, fragmented, or conflicting sources
Temporal sensitivity	Stable/historical	Evolving field (months matter)	Fast-moving (days/weeks matter), active controversy
Verification complexity	Easily verifiable (official docs)	2-3 independent sources needed	Contested claims, expert disagreement, no consensus
Synthesis demand	Answer is a fact or list	Compare/contrast viewpoints	Novel integration of conflicting threads

Total	Tier	Strategy
0-2	Quick	Inline, 1-2 searches, fire-and-forget
3-5	Standard	Subagent wave, 3-5 parallel searchers, report delivered
6-8	Deep	Agent team (TeamCreate), 3-5 teammates, interactive session
9-10	Exhaustive	Agent team, 4-6 teammates + nested subagent waves, interactive

Present the scoring to the user. User can override tier with --depth <tier>.

Scaling Strategy

Scale work by query complexity and available orchestration capabilities:

Scope	Strategy	Delegation
Quick (0-2)	Inline answer after 1-2 searches	No subagents
Standard (3-5)	Parallel broad sweep across 2-5 sub-questions	Use available subagent primitive; otherwise batch sequentially
Deep (6-8)	Lead-driven team workflow with perspective expansion	Use team/subagent primitives when present; otherwise bounded serial waves
Exhaustive (9-10)	Deep workflow plus adversarial and nested waves	Use nested delegation when available; otherwise state degraded throughput explicitly

Wave Pipeline

All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.

Wave 0: Triage (always inline, never parallelized)

Run !uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS" for deterministic pre-scan
Decompose query into 2-5 sub-questions
Score complexity on the 5-dimension rubric
Check tool availability — probe key MCP tools; set degraded mode flags and confidence ceilings per references/source-selection.md
Select tools per domain signals — read references/source-selection.md
Check for existing journals — if track or resume, load prior state
Present triage to user — show: complexity score, sub-questions, planned strategy, estimated tier. User may override.

Wave 1: Broad Sweep (parallel)

Scale by tier:

Quick (inline): 1-2 tool calls sequentially. No subagents.

Standard (subagent wave): Dispatch 3-5 parallel subagents with the platform's available delegation primitive:

Subagent A → brave-search + duckduckgo-search for sub-question 1
Subagent B → exa + g-search for sub-question 2
Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specifics
Subagent D → wikipedia / wikidata for factual grounding
[Subagent E → PubMed / openalex if academic domain detected]

Deep (agent team): Create a research team with the platform's available team primitive:

Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4)
  |-- web-researcher:       brave-search, duckduckgo-search, exa, g-search
  |-- tech-researcher:      context7, deepwiki, arxiv, semantic-scholar, package-version
  |-- content-extractor:    fetcher, trafilatura, docling, wikipedia, wayback
  |-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed]
  |-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]

Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.

Exhaustive: Deep team + each teammate runs nested subagent waves internally when supported; otherwise use serial batches and label the run "degraded orchestration."

Each subagent/teammate returns structured findings:

{
  "sub_question": "...",
  "findings": [{
    "claim": "...",
    "confidence": 0.6,
    "evidence": [{"tool": "brave-search", "url": "https://...", "timestamp": "2026-04-24T12:00:00Z", "excerpt": "..."}],
    "cross_validation": "unknown",
    "bias_markers": [],
    "gaps": []
  }],
  "leads": ["url1", "url2"],
  "gaps": ["could not find data on X"]
}

Wave 1.5: Perspective Expansion (Deep/Exhaustive only)

STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:

Perspective	Focus	Question Style
Skeptic	What could be wrong? What's missing?	"What evidence would disprove this?"
Domain Expert	Technical depth, nuance, edge cases	"What do practitioners actually encounter?"
Practitioner	Real-world applicability, trade-offs	"What matters when you actually build this?"
Theorist	First principles, abstractions, frameworks	"What underlying model explains this?"

Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.

Wave 2: Deep Dive (parallel, targeted)

Rank leads from Wave 1 by potential value (citation frequency, source authority, relevance)
Dispatch deep-read subagents — use fetcher/trafilatura/docling to extract full content from top leads
Follow citation chains — if a source cites another, fetch the original
Fill gaps — for each gap identified in Wave 1, dispatch targeted searches
Use thinking MCPs:
- cascade-thinking for multi-perspective analysis of complex findings
- structured-thinking for tracking evidence chains and contradictions
- think-strategies for complex question decomposition (Standard+ only)

Wave 3: Cross-Validation (parallel)

The anti-hallucination wave. Read references/confidence-rubric.md and references/self-verification.md.

For every claim surviving Waves 1-2:

Independence check — are supporting sources truly independent? Sources citing each other are NOT independent.
Counter-search — explicitly search for evidence AGAINST each major claim using a different search engine
Freshness check — verify sources are current (flag if >1 year old for time-sensitive topics)
Contradiction scan — read references/contradiction-protocol.md, identify and classify disagreements
Confidence scoring — assign 0.0-1.0 per references/confidence-rubric.md
Bias sweep — check each finding against 10 bias categories (7 core + 3 LLM-specific) per references/bias-detection.md

Self-Verification (3+ findings survive): Spawn devil's advocate subagent per references/self-verification.md:

For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.

Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0. Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.

Wave 4: Synthesis (always inline, lead only)

Produce the final research product. Read references/output-formats.md for templates.

The synthesis is NOT a summary. It must:

Answer directly — answer the user's question clearly
Map evidence — all verified findings with confidence and citations
Surface contradictions — where sources disagree, with analysis of why
Show confidence landscape — what is known confidently, what is uncertain, what is unknown
Audit biases — biases detected during research
Identify gaps — what evidence is missing, what further research would help
Distill takeaways — 3-7 numbered key findings
Cite sources — full bibliography with provenance

Output format adapts to mode:

Investigate → Research Brief (Standard) or Deep Report (Deep/Exhaustive)
Fact-check → Quick Answer with verdict + evidence
Compare → Decision Matrix
Survey → Annotated Bibliography
User can override with --format brief|deep|bib|matrix

Confidence Scoring

Score	Basis
0.9-1.0	Official docs + 2 independent sources agree, no contradictions
0.7-0.8	2+ independent sources agree, minor qualifications
0.5-0.6	Single authoritative source, or 2 sources with partial agreement
0.3-0.4	Single non-authoritative source, or conflicting evidence
0.2-0.3	Multiple non-authoritative sources with partial agreement, or single source with significant caveats
0.1-0.2	LLM reasoning only, no external evidence found
0.0	Actively contradicted by evidence

Hard rules:

No claim reported at >= 0.7 unless supported by 2+ independent sources
Single-source claims cap at 0.6 regardless of source authority
Degraded mode (all research tools unavailable): max confidence 0.4, all findings labeled "unverified"

Merged confidence (for claims supported by multiple sources): c_merged = 1 - (1-c1)(1-c2)...(1-cN) capped at 0.99

Evidence Chain Structure

Every finding carries this structure:

FINDING RR-{seq:03d}: [claim statement]
  CONFIDENCE: [0.0-1.0]
  EVIDENCE:
    1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
    2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
  CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources
  BIAS MARKERS: [none | list of detected biases with category]
  GAPS: [none | what additional evidence would strengthen this finding]

Use !uv run python skills/research/scripts/finding-formatter.py --format markdown to normalize.

Source Selection

Read references/source-selection.md during Wave 0 for the full tool-to-domain mapping. Summary:

Domain Signal	Primary Tools	Secondary Tools
Library/API docs	`llms.txt`/`llms-full.txt`, context7, deepwiki, package-version	brave-search
Academic/scientific	arxiv, semantic-scholar, PubMed, openalex	crossref, brave-search
Current events/trends	brave-search, exa, duckduckgo-search, g-search	fetcher, trafilatura
GitHub repos/OSS	deepwiki, repomix	brave-search
General knowledge	wikipedia, wikidata, brave-search	fetcher
Historical content	wayback, brave-search	fetcher
Fact-checking	3+ search engines mandatory	wikidata for structured claims
PDF/document analysis	docling	trafilatura

Progressive Disclosure

Load only the next required reference:

Start with this file for routing, classification, and wave order.
Load references/source-selection.md during Wave 0 only.
Load validation references during Wave 3 only: confidence, contradiction, self-verification, and bias files as needed.
Load references/output-formats.md or references/dashboard-schema.md only when producing final output or exports.
Never preload all references; summarize tool limitations instead of filling context with unused mappings.

Bias Detection

Check every finding against 10 bias categories. Read references/bias-detection.md for full detection signals and mitigation strategies.

Bias	Detection Signal	Mitigation
LLM prior	Matches common training patterns, lacks fresh evidence	Flag; require fresh source confirmation
Recency	Overweighting recent results, ignoring historical context	Search for historical perspective
Authority	Uncritically accepting prestigious sources	Cross-validate even authoritative claims
Confirmation	Queries constructed to confirm initial hypothesis	Use neutral queries; search for counterarguments
Survivorship	Only finding successful examples	Search for failures/counterexamples
Selection	Search engine bubble, English-only	Use multiple engines; note coverage limitations
Anchoring	First source disproportionately shapes interpretation	Document first source separately; seek contrast

State Management

Journal path: ~/.{gemini|copilot|codex|claude}/research/
Archive path: ~/.{gemini|copilot|codex|claude}/research/archive/
Filename convention: {YYYY-MM-DD}-{domain}-{slug}.md
- {domain}: tech, academic, market, policy, factcheck, compare, survey, track, general
- {slug}: 3-5 word semantic summary, kebab-case
- Collision: append -v2, -v3
Format: YAML frontmatter + markdown body +  blocks

Save protocol:

Quick: save once at end with status: Complete
Standard/Deep/Exhaustive: save after Wave 1 with status: In Progress, update after each wave, finalize after synthesis

Resume protocol:

resume (no args): find status: In Progress journals. One → auto-resume. Multiple → show list.
resume N: Nth journal from list output (reverse chronological).
resume keyword: search frontmatter query and domain_tags for match.

Use !uv run python skills/research/scripts/journal-store.py for all journal operations.

State snapshot (appended after each wave save):

<!-- STATE
wave_completed: 2
findings_count: 12
leads_pending: ["url1", "url2"]
gaps: ["topic X needs more sources"]
contradictions: 1
next_action: "Wave 3: cross-validate top 8 findings"
-->

In-Session Commands (Deep/Exhaustive)

Available during active research sessions:

Command	Effect
`drill <finding #>`	Deep dive into a specific finding with more sources
`pivot <new angle>`	Redirect research to a new sub-question
`counter <finding #>`	Explicitly search for evidence against a finding
`export`	Render HTML dashboard
`status`	Show current research state without advancing
`sources`	List all sources consulted so far
`confidence`	Show confidence distribution across findings
`gaps`	List identified knowledge gaps
`?`	Show command menu

Read references/session-commands.md for full protocols.

Reference File Index

File	Content	Read When
`references/source-selection.md`	Tool-to-domain mapping, multi-engine protocol, degraded mode	Wave 0 (selecting tools)
`references/confidence-rubric.md`	Scoring rubric, cross-validation rules, independence checks	Wave 3 (assigning confidence)
`references/evidence-chain.md`	Finding template, provenance format, citation standards	Any wave (structuring evidence)
`references/bias-detection.md`	10 bias categories (7 core + 3 LLM-specific), detection signals, mitigation strategies	Wave 3 (bias audit)
`references/contradiction-protocol.md`	4 contradiction types, resolution framework	Wave 3 (contradiction detection)
`references/self-verification.md`	Devil's advocate protocol, hallucination detection	Wave 3 (self-verification)
`references/output-formats.md`	Templates for all 5 output formats	Wave 4 (formatting output)
`references/team-templates.md`	Team archetypes, subagent prompts, perspective agents	Wave 0 (designing team)
`references/session-commands.md`	In-session command protocols	When user issues in-session command
`references/dashboard-schema.md`	JSON data contract for HTML dashboard	`export` command

Loading rule: Load ONE reference at a time per the "Read When" column. Do not preload.

Critical Rules

No claim >= 0.7 unless supported by 2+ independent sources — single-source claims cap at 0.6
Never fabricate citations — if URL, author, title, or date cannot be verified, use vague attribution ("a study in this tradition") rather than inventing specifics
Always surface contradictions explicitly — never silently resolve disagreements; present both sides with evidence
Always present triage scoring before executing research — user must see and can override complexity tier
Save journal after every wave in Deep/Exhaustive mode — enables resume after interruption
Never skip Wave 3 (cross-validation) for Standard/Deep/Exhaustive tiers — this is the anti-hallucination mechanism
Multi-engine search is mandatory for fact-checking — use minimum 3 different search tools (e.g., brave-search + duckduckgo-search + exa)
Apply the Accounting Rule after every parallel dispatch — N dispatched = N accounted for before proceeding to next wave
Distinguish facts from interpretations in all output — factual claims carry evidence; interpretive claims are explicitly labeled as analysis
Flag all LLM-prior findings — claims matching common training data but lacking fresh evidence must be flagged with bias marker
Max confidence 0.4 in degraded mode — when all research tools are unavailable, report all findings as "unverified — based on training knowledge"
Load ONE reference file at a time — do not preload all references into context
Track mode must load prior journal before searching — avoid re-researching what is already known
The synthesis is not a summary — it must integrate findings into novel analysis, identify patterns across sources, and surface emergent insights not present in any single source
PreToolUse write guard is non-negotiable — the research skill never modifies source files; it only creates/updates journals in ~/.{gemini|copilot|codex|claude}/research/
Stop hook must pass — verify.py stop confirms the skill did not leave tracked research-source files dirty
Normalize legacy findings before synthesis — top-level source_url, source_tool, and confidence_raw must be converted into the canonical evidence[] + confidence shape