From agents
Deep multi-source research with confidence scoring. Auto-classifies complexity. Use for technical investigation, fact-checking. NOT for code review or simple Q&A.
npx claudepluginhub wyattowalsh/agents --plugin agentsThis skill uses the workspace's default tool permissions.
General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).
evals/test-cases.yamlreferences/bias-detection.mdreferences/confidence-rubric.mdreferences/contradiction-protocol.mdreferences/dashboard-schema.mdreferences/evidence-chain.mdreferences/output-formats.mdreferences/self-verification.mdreferences/session-commands.mdreferences/source-selection.mdreferences/team-templates.mdscripts/finding-formatter.pyscripts/journal-store.pyscripts/research-scanner.pyscripts/source-deduplicator.pyscripts/verify.pytemplates/dashboard.htmlCreates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).
| Term | Definition |
|---|---|
| query | The user's research question or topic; the unit of investigation |
| claim | A discrete assertion to be verified; extracted from sources or user input |
| source | A specific origin of information: URL, document, database record, or API response |
| evidence | A source-backed datum supporting or contradicting a claim; always has provenance |
| provenance | The chain from evidence to source: tool used, URL, access timestamp, excerpt |
| confidence | Score 0.0-1.0 per claim; based on evidence strength and cross-validation |
| cross-validation | Verifying a claim across 2+ independent sources; the core anti-hallucination mechanism |
| triangulation | Confirming a finding using 3+ methodologically diverse sources |
| contradiction | When two credible sources assert incompatible claims; must be surfaced explicitly |
| synthesis | The final research product: not a summary but a novel integration of evidence with analysis |
| journal | The saved markdown record of a research session, stored in `~/.{gemini |
| sweep | Wave 1: broad parallel search across multiple tools and sources |
| deep dive | Wave 2: targeted follow-up on specific leads from the sweep |
| lead | A promising source or thread identified during the sweep, warranting deeper investigation |
| tier | Complexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10) |
| finding | A verified claim with evidence chain, confidence score, and provenance; the atomic unit of output |
| gap | An identified area where evidence is insufficient, contradictory, or absent |
| bias marker | An explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.) |
| degraded mode | Operation when research tools are unavailable; confidence ceilings applied |
| capability | A research ability such as web search, docs lookup, extraction, or subagent delegation; tool names are preferred implementations, not guarantees |
$ARGUMENTS | Action |
|---|---|
Question or topic text (has verb or ?) | Investigate — classify complexity, execute wave pipeline |
Vague input (<5 words, no verb, no ?) | Intake — ask 2-3 clarifying questions, then classify |
check <claim> or verify <claim> | Fact-check — verify claim against 3+ search engines |
compare <A> vs <B> [vs <C>...] | Compare — structured comparison with decision matrix output |
survey <field or topic> | Survey — landscape mapping, annotated bibliography |
track <topic> | Track — load prior journal, search for updates since last session |
resume [number or keyword] | Resume — resume a saved research session |
list [active|domain|tier] | List — show journal metadata table |
archive | Archive — move journals older than 90 days |
delete <N> | Delete — delete journal N with confirmation |
export [N] | Export — render HTML dashboard for journal N (default: current) |
| Empty | Gallery — show topic examples + "ask me anything" prompt |
If no mode keyword matches:
? or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → Investigatevs, versus, compared to, or between noun phrases → ComparePresent research examples spanning domains:
| # | Domain | Example | Likely Tier |
|---|---|---|---|
| 1 | Technology | "What are the current best practices for LLM agent architectures?" | Deep |
| 2 | Academic | "What is the state of evidence on intermittent fasting for longevity?" | Standard |
| 3 | Market | "How does the competitive landscape for vector databases compare?" | Deep |
| 4 | Fact-check | "Is it true that 90% of startups fail within the first year?" | Standard |
| 5 | Architecture | "When should you choose event sourcing over CRUD?" | Standard |
| 6 | Trends | "What emerging programming languages gained traction in 2025-2026?" | Standard |
Pick a number, paste your own question, or type
guide me.
Before starting research, check if another skill is a better fit:
| Signal | Redirect |
|---|---|
| Code review, PR review, diff analysis | Suggest /honest-review |
| Strategic decision with adversaries, game theory | Suggest /wargame |
| Multi-perspective expert debate | Suggest /host-panel |
| Prompt optimization, model-specific prompting | Suggest /prompt-engineer |
If the user confirms they want general research, proceed.
Score the query on 5 dimensions (0-2 each, total 0-10):
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Scope breadth | Single fact/definition | Multi-faceted, 2-3 domains | Cross-disciplinary, 4+ domains |
| Source difficulty | Top search results suffice | Specialized databases or multiple source types | Paywalled, fragmented, or conflicting sources |
| Temporal sensitivity | Stable/historical | Evolving field (months matter) | Fast-moving (days/weeks matter), active controversy |
| Verification complexity | Easily verifiable (official docs) | 2-3 independent sources needed | Contested claims, expert disagreement, no consensus |
| Synthesis demand | Answer is a fact or list | Compare/contrast viewpoints | Novel integration of conflicting threads |
| Total | Tier | Strategy |
|---|---|---|
| 0-2 | Quick | Inline, 1-2 searches, fire-and-forget |
| 3-5 | Standard | Subagent wave, 3-5 parallel searchers, report delivered |
| 6-8 | Deep | Agent team (TeamCreate), 3-5 teammates, interactive session |
| 9-10 | Exhaustive | Agent team, 4-6 teammates + nested subagent waves, interactive |
Present the scoring to the user. User can override tier with --depth <tier>.
Scale work by query complexity and available orchestration capabilities:
| Scope | Strategy | Delegation |
|---|---|---|
| Quick (0-2) | Inline answer after 1-2 searches | No subagents |
| Standard (3-5) | Parallel broad sweep across 2-5 sub-questions | Use available subagent primitive; otherwise batch sequentially |
| Deep (6-8) | Lead-driven team workflow with perspective expansion | Use team/subagent primitives when present; otherwise bounded serial waves |
| Exhaustive (9-10) | Deep workflow plus adversarial and nested waves | Use nested delegation when available; otherwise state degraded throughput explicitly |
Capability resolution: Treat named tools and orchestration APIs as preferred capabilities. Claude Code may use Task/TeamCreate; Codex may use dynamic subagents or parallel tool calls; other agents may use their native delegation or run the wave pipeline serially. If no delegation equivalent exists, use degraded orchestration: preserve wave order, reduce breadth, and report the limitation in methodology. Apply confidence ceilings only when source or retrieval capabilities are unavailable, per references/source-selection.md.
All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.
!uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS" for deterministic pre-scanreferences/source-selection.mdreferences/source-selection.mdtrack or resume, load prior stateScale by tier:
Quick (inline): 1-2 tool calls sequentially. No subagents.
Standard (subagent wave): Dispatch 3-5 parallel subagents with the platform's available delegation primitive:
Subagent A → brave-search + duckduckgo-search for sub-question 1
Subagent B → exa + g-search for sub-question 2
Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specifics
Subagent D → wikipedia / wikidata for factual grounding
[Subagent E → PubMed / openalex if academic domain detected]
Deep (agent team): Create a research team with the platform's available team primitive:
Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4)
|-- web-researcher: brave-search, duckduckgo-search, exa, g-search
|-- tech-researcher: context7, deepwiki, arxiv, semantic-scholar, package-version
|-- content-extractor: fetcher, trafilatura, docling, wikipedia, wayback
|-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed]
|-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]
Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.
Exhaustive: Deep team + each teammate runs nested subagent waves internally when supported; otherwise use serial batches and label the run "degraded orchestration."
Each subagent/teammate returns structured findings:
{
"sub_question": "...",
"findings": [{
"claim": "...",
"confidence": 0.6,
"evidence": [{"tool": "brave-search", "url": "https://...", "timestamp": "2026-04-24T12:00:00Z", "excerpt": "..."}],
"cross_validation": "unknown",
"bias_markers": [],
"gaps": []
}],
"leads": ["url1", "url2"],
"gaps": ["could not find data on X"]
}
STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:
| Perspective | Focus | Question Style |
|---|---|---|
| Skeptic | What could be wrong? What's missing? | "What evidence would disprove this?" |
| Domain Expert | Technical depth, nuance, edge cases | "What do practitioners actually encounter?" |
| Practitioner | Real-world applicability, trade-offs | "What matters when you actually build this?" |
| Theorist | First principles, abstractions, frameworks | "What underlying model explains this?" |
Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.
cascade-thinking for multi-perspective analysis of complex findingsstructured-thinking for tracking evidence chains and contradictionsthink-strategies for complex question decomposition (Standard+ only)The anti-hallucination wave. Read references/confidence-rubric.md and references/self-verification.md.
For every claim surviving Waves 1-2:
references/contradiction-protocol.md, identify and classify disagreementsreferences/confidence-rubric.mdreferences/bias-detection.mdSelf-Verification (3+ findings survive): Spawn devil's advocate subagent per references/self-verification.md:
For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.
Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0. Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.
Produce the final research product. Read references/output-formats.md for templates.
The synthesis is NOT a summary. It must:
Output format adapts to mode:
--format brief|deep|bib|matrix| Score | Basis |
|---|---|
| 0.9-1.0 | Official docs + 2 independent sources agree, no contradictions |
| 0.7-0.8 | 2+ independent sources agree, minor qualifications |
| 0.5-0.6 | Single authoritative source, or 2 sources with partial agreement |
| 0.3-0.4 | Single non-authoritative source, or conflicting evidence |
| 0.2-0.3 | Multiple non-authoritative sources with partial agreement, or single source with significant caveats |
| 0.1-0.2 | LLM reasoning only, no external evidence found |
| 0.0 | Actively contradicted by evidence |
Hard rules:
Merged confidence (for claims supported by multiple sources):
c_merged = 1 - (1-c1)(1-c2)...(1-cN) capped at 0.99
Every finding carries this structure:
FINDING RR-{seq:03d}: [claim statement]
CONFIDENCE: [0.0-1.0]
EVIDENCE:
1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources
BIAS MARKERS: [none | list of detected biases with category]
GAPS: [none | what additional evidence would strengthen this finding]
Use !uv run python skills/research/scripts/finding-formatter.py --format markdown to normalize.
Read references/source-selection.md during Wave 0 for the full tool-to-domain mapping. Summary:
| Domain Signal | Primary Tools | Secondary Tools |
|---|---|---|
| Library/API docs | llms.txt/llms-full.txt, context7, deepwiki, package-version | brave-search |
| Academic/scientific | arxiv, semantic-scholar, PubMed, openalex | crossref, brave-search |
| Current events/trends | brave-search, exa, duckduckgo-search, g-search | fetcher, trafilatura |
| GitHub repos/OSS | deepwiki, repomix | brave-search |
| General knowledge | wikipedia, wikidata, brave-search | fetcher |
| Historical content | wayback, brave-search | fetcher |
| Fact-checking | 3+ search engines mandatory | wikidata for structured claims |
| PDF/document analysis | docling | trafilatura |
Multi-engine protocol: For any claim requiring verification, use minimum 2 different search engines. Different engines have different indices and biases. Agreement across engines increases confidence.
Load only the next required reference:
references/source-selection.md during Wave 0 only.references/output-formats.md or references/dashboard-schema.md only when producing final output or exports.Check every finding against 10 bias categories. Read references/bias-detection.md for full detection signals and mitigation strategies.
| Bias | Detection Signal | Mitigation |
|---|---|---|
| LLM prior | Matches common training patterns, lacks fresh evidence | Flag; require fresh source confirmation |
| Recency | Overweighting recent results, ignoring historical context | Search for historical perspective |
| Authority | Uncritically accepting prestigious sources | Cross-validate even authoritative claims |
| Confirmation | Queries constructed to confirm initial hypothesis | Use neutral queries; search for counterarguments |
| Survivorship | Only finding successful examples | Search for failures/counterexamples |
| Selection | Search engine bubble, English-only | Use multiple engines; note coverage limitations |
| Anchoring | First source disproportionately shapes interpretation | Document first source separately; seek contrast |
~/.{gemini|copilot|codex|claude}/research/~/.{gemini|copilot|codex|claude}/research/archive/{YYYY-MM-DD}-{domain}-{slug}.md
{domain}: tech, academic, market, policy, factcheck, compare, survey, track, general{slug}: 3-5 word semantic summary, kebab-case-v2, -v3<!-- STATE --> blocksSave protocol:
status: Completestatus: In Progress, update after each wave, finalize after synthesisResume protocol:
resume (no args): find status: In Progress journals. One → auto-resume. Multiple → show list.resume N: Nth journal from list output (reverse chronological).resume keyword: search frontmatter query and domain_tags for match.Use !uv run python skills/research/scripts/journal-store.py for all journal operations.
State snapshot (appended after each wave save):
<!-- STATE
wave_completed: 2
findings_count: 12
leads_pending: ["url1", "url2"]
gaps: ["topic X needs more sources"]
contradictions: 1
next_action: "Wave 3: cross-validate top 8 findings"
-->
Available during active research sessions:
| Command | Effect |
|---|---|
drill <finding #> | Deep dive into a specific finding with more sources |
pivot <new angle> | Redirect research to a new sub-question |
counter <finding #> | Explicitly search for evidence against a finding |
export | Render HTML dashboard |
status | Show current research state without advancing |
sources | List all sources consulted so far |
confidence | Show confidence distribution across findings |
gaps | List identified knowledge gaps |
? | Show command menu |
Read references/session-commands.md for full protocols.
| File | Content | Read When |
|---|---|---|
references/source-selection.md | Tool-to-domain mapping, multi-engine protocol, degraded mode | Wave 0 (selecting tools) |
references/confidence-rubric.md | Scoring rubric, cross-validation rules, independence checks | Wave 3 (assigning confidence) |
references/evidence-chain.md | Finding template, provenance format, citation standards | Any wave (structuring evidence) |
references/bias-detection.md | 10 bias categories (7 core + 3 LLM-specific), detection signals, mitigation strategies | Wave 3 (bias audit) |
references/contradiction-protocol.md | 4 contradiction types, resolution framework | Wave 3 (contradiction detection) |
references/self-verification.md | Devil's advocate protocol, hallucination detection | Wave 3 (self-verification) |
references/output-formats.md | Templates for all 5 output formats | Wave 4 (formatting output) |
references/team-templates.md | Team archetypes, subagent prompts, perspective agents | Wave 0 (designing team) |
references/session-commands.md | In-session command protocols | When user issues in-session command |
references/dashboard-schema.md | JSON data contract for HTML dashboard | export command |
Loading rule: Load ONE reference at a time per the "Read When" column. Do not preload.
~/.{gemini|copilot|codex|claude}/research/verify.py stop confirms the skill did not leave tracked research-source files dirtysource_url, source_tool, and confidence_raw must be converted into the canonical evidence[] + confidence shape