From agent-loops
Generates novel, testable research hypotheses grounded in real literature using a multi-agent loop with generation, literature grounding, and scoring. For exploratory research questions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-loops:hypothesis-genThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A **multi-agent, literature-grounded** generation loop. The artifact is a growing **pool of
A multi-agent, literature-grounded generation loop. The artifact is a growing pool of hypotheses; the feedback signal is the count of strong, distinct hypotheses that clear the bar — where "strong" is decided against real retrieved literature, not assertion. Each round: generate → ground → judge → keep → mutate toward the gaps, until the pool stops growing (saturation).
The discipline: a hypothesis enters the pool only if the literature says it is not already established (novelty), prior work makes it plausible (grounding), and a feasible test exists. Generating is not confirming — the output is a ranked set of strong candidates to test, each stated with how to test it.
The cast (all in roles/):
roles/Generator.md — proposes a batch of candidate hypotheses aimed at the open gaps.roles/LiteratureScout.md — grounds each candidate in real literature (novelty · support · gap);
emits litscout.json (validates schemas/litscout.schema.json).roles/Judge.md — scores each against the fixed rubric and decides keep/kill/dedupe; emits
verdict.json (validates schemas/verdict.schema.json).Spawn-or-degrade. On Claude Code, spawn Generator / LiteratureScout / Judge as real Agent
subagents each round; otherwise adopt each role inline in this context. You are the orchestrator.
Use when the user wants candidate hypotheses generated and vetted for a question or domain. Default: run the full generate→ground→judge loop below until saturation. Escape hatch: if the user only wants a single batch (no looping), run one round and report the kept hypotheses. Not for sharpening or decomposing a question, and not for grading an existing written proposal.
Resolve bindings interactively. If loop.run.yaml exists in the working dir, load it, confirm the
values in one line, and skip to the loop. Otherwise: on Claude Code (the AskUserQuestion tool is
available) infer a likely value for each binding and present it as the recommended option; on other
hosts ask each as a quoted plain-text prompt. Then write loop.run.yaml (format:
examples/run.example.yaml) and confirm every value plus the live/degraded literature tier before
creating any other files.
| binding | meaning | default | how to infer |
|---|---|---|---|
<question> | the research question / domain, plus any scope (field, population, constraints) | — | ask the user |
<gen_n> | candidate hypotheses the Generator proposes per round | 6 | — |
<keep_threshold> | rubric score (0-100) a hypothesis must clear to enter the pool | 65 | — |
<eval_scale> | LiteratureScout grounding depth (low/medium/high, see below) | medium | — |
<sandbox_root> | where rounds, ledger, and lit cache live | ./sandbox | — |
<budget> | max rounds | 6 | — |
<patience> | stop after this many rounds with no new kept hypothesis | 2 | — |
<report> | final ranked hypothesis set | <sandbox_root>/hypotheses.md | — |
Grounding depth dial (<eval_scale> caps per round — candidates examined deeply · queries each ·
papers read full-text):
| preset | candidates deep | queries each | fulltext reads |
|---|---|---|---|
| low | 2 | 1 | 0 (snippet/abstract only) |
| medium (recommended) | all | 2 | 1 |
| high | all | 3 | 3 |
Literature toolchain. Paper search goes through the sibling literature-search skill — resolve
<lit_skill_dir> (it installs as a sibling, e.g. ~/.claude/skills/literature-search/),
<lit_py> = python3, and <lit> = <lit_skill_dir>/tools/lit_search.py (note the tools/
segment); append --cache-dir <sandbox_root>/literature/.cache after a subcommand to reuse the cache.
Confirm <lit> --help works at setup; if the skill is absent, tell the user and either install it
(copy the repo's loops/literature-search folder into ~/.claude/skills/) or degrade all retrieval to
WebSearch/WebFetch (tag that evidence source:"web"). The keyless S2 + arXiv core needs no setup; a
free S2_API_KEY makes snippet/cite reliable.
API key (optional, never block). The literature-search skill owns the key convention: run
<lit> keys --init, then have the user fill the printed keys.env themselves and never paste secrets
into chat. Re-run <lit> keys to record the tier in loop.run.yaml (literature_tiers, presence
only). A missing key just degrades to the keyless pool → WebSearch.
Initialise the sandbox once bindings are confirmed:
<sandbox_root>/
├── loop.run.yaml ← resolved bindings + literature_tiers
├── ledger.tsv ← header only (see Ledger)
└── literature/.cache/ ← lit_search on-disk cache
Start with an empty pool and gaps seeded from <question>; create no round files until the loop runs.
pool = the kept hypotheses (starts empty). gaps = open questions the LiteratureScout has surfaced
(starts empty; seed from <question>). dry = consecutive rounds with no new keep (starts 0). <N>
starts at 1.
Copy this checklist and tick items off:
roles/Generator.md (spawn-or-degrade) with <question>, the current pool, gaps, and <gen_n>; it writes round<N>/candidates.json (<gen_n> specific, testable, plausibly-novel candidates aimed at the gaps, none duplicating the pool).roles/LiteratureScout.md (spawn-or-degrade) on candidates.json with <lit> and the <eval_scale> caps; it writes round<N>/litscout.json (validates schemas/litscout.schema.json) — per candidate: novelty + closest prior work, support, gap, testability, each citing real evidence.roles/Judge.md, apply the fixed rubric + evidence gate to litscout.json, checking each candidate against the pool for duplicates; write round<N>/verdict.json (validates schemas/verdict.schema.json): scores, total, keep, duplicate_of.keep:true non-duplicate to pool (with scores + grounding + how-to-test); add this round's gap points to gaps. If ≥1 new keep, dry = 0; else dry += 1.N = N + 1.dry == <patience> (saturation) or N > <budget> → stop (see Stops).Re-ground every round — novelty is judged from a fresh literature check each round, never carried
over, so "the literature already covers this" reliably kills a crowded idea. Every cited snippet comes
from a real retrieval that round; on {"error","fallback"} fall back to WebSearch/WebFetch — never
invent a paper.
schemas/litscout.schema.json gates the LiteratureScout output — a generic instance:
{"round": 2,
"evaluations": [{"hid": "r2h1", "novelty_assessment": "novel",
"closest_prior_work": [{"claim": "X tested for facts, not skills", "cites": ["E1"]}],
"support": [{"claim": "spacing aids motor consolidation", "cites": ["E2"]}],
"gap": [{"claim": "long-term procedural retention untested at scale", "cites": ["E1"]}],
"testability_note": "RCT: spaced vs massed schedule, 1-month retention."}],
"evidence": [{"key": "E1", "title": "...", "source": "s2", "id": "a1b2", "snippet": "...verbatim..."}]}
schemas/verdict.schema.json gates the Judge output — a generic instance:
{"round": 2,
"verdicts": [{"hid": "r2h1",
"scores": {"novelty": 4, "grounding": 4, "testability": 5, "specificity": 4, "significance": 4},
"total": 82.0, "keep": true, "gate_failures": [], "duplicate_of": null,
"rationale": "Closest work shows X untested for skills -> novel; supported; clean RCT named."}]}
<sandbox_root>/ledger.tsv, tab-separated, never commas in free text. Header:
round generated kept_new pool_size top_kept
Example:
round generated kept_new pool_size top_kept
1 6 3 3 spaced practice aids procedural (not just declarative) retention [82]
2 6 2 5 sleep-timed review beats time-of-day-matched review [78]
3 6 0 5 -
Per-round candidates.json / litscout.json / verdict.json live in round<N>/. Report the best
state of the pool when stopping, not just the last round. Leave ledger.tsv, round*/, and
literature/ untracked.
<lit>/WebFetch retrieval
from that round, verbatim; the evidence gate exists to catch fabrication.literature-search skill is stdlib-only; never print or commit API keys
(keys.env stays gitignored at the project root). The sandbox is self-contained — no ../ escapes.The loop stops on the first of:
dry == <patience> consecutive rounds add no new kept hypothesis.<budget> rounds reached.End with the ranked hypothesis set (<report> path) — each hypothesis with its statement, novelty
assessment + closest prior work (cited), supporting evidence (cited), the gap it fills, and how to test
it — plus the pool-size trajectory from ledger.tsv and the strongest unexplored gaps, so the user
sees both the vetted hypotheses and where a deeper run would look next.
npx claudepluginhub gaasher/agent-loop-skills --plugin agent-loopsAutomates LLM-driven hypothesis generation and testing on tabular datasets using HypoGeniC, combining literature insights with data-driven testing for empirical research like deception detection or content analysis.
Generates and tests hypotheses from tabular data using LLMs, integrating literature insights with data-driven methods for empirical research.
Generates and tests LLM-driven hypotheses on labeled tabular datasets using HypoGeniC (data-driven), HypoRefine (literature+data), and Union methods with iterative refinement and Redis caching.