From agent-loops
Iteratively revises a scientific draft (prose, figures, code) via multi-judge critique and peer-review grading until a quality threshold is met. Requires Python 3.9+.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-loops:scientific-writerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The artifact is a **piece of scientific writing** (draft + its dataset + figures + optional code).
The artifact is a piece of scientific writing (draft + its dataset + figures + optional code).
Each iteration critiques → grades → revises: five specialist judges produce concrete findings,
an independent peer_reviewer turns the paper into a graded 0-100 score on the same axes, and a
scientific_writer fixes the prose, figures, and code — running the user's <plot_command> to
regenerate figures. The loop runs until the score clears <pass_threshold> or the budget is hit. All
work happens on copies inside a sandbox; the user's originals are never touched.
The cast (all in this folder):
roles/figures_judge.md, roles/scientific_judge.md, roles/style_judge.md,
roles/formatting_judge.md, roles/code_reviewer.md — the five critics; each emits the shared
schemas/finding.schema.json.roles/peer_reviewer.md — the summative grader (its own honesty rules); emits
schemas/peer_review.schema.json and decides pass.roles/scientific_writer.md — the reviser; fixes code → regenerates figures → updates prose.schemas/finding.schema.json, schemas/peer_review.schema.json — the two validated outputs.Spawn-or-degrade. On Claude Code, spawn the active judges as real Agent subagents in
parallel, then one fresh peer_reviewer, then the scientific_writer; otherwise adopt each role
inline. You are the orchestrator.
The peer_reviewer grades on the same axes the judges critique — which invites echoing, inflation
under loop-termination pressure, and a writer that games the rubric. roles/peer_reviewer.md
counters this: it (1) grades independently, re-deriving each axis from the paper + dataset before
reading the critiques, (2) verifies a sample of numbers/citations itself rather than trusting
"it's fixed", (3) must surface issues the judges missed, (4) holds a fixed, anchored,
reproducible bar with no credit for effort or elapsed iterations, (5) applies hard gates (a
confirmed block fails the paper regardless of the average), and (6) runs a substance check against
surface compliance. The writer optimizes the judges' concrete findings; the grader judges
holistically — so "address every finding" does not mechanically buy a pass.
Use when a written scientific draft exists and the user wants it pushed past a quality bar with multi-judge critique and an independent peer-review grade. Default: run the full critique→grade→revise loop below. Escape hatch: if the user only wants the critique (no rewriting), run one round of judges
Resolve bindings interactively. If loop.run.yaml exists in the working dir, load it, confirm the
values in one line, and skip to the loop. Otherwise: on Claude Code (the AskUserQuestion tool is
available) infer a likely value for each binding and present it as the recommended option; on other
hosts ask each as a quoted plain-text prompt. Then write loop.run.yaml (format:
examples/run.example.yaml) and confirm every value — including which axes are active (4 vs 5) and the
live/degraded literature tier — before creating any other files.
| binding | meaning | default | how to infer |
|---|---|---|---|
<draft_path> | the draft to improve (markdown/text/tex) | — | scan the working dir for a likely draft |
<dataset_paths> | data file(s) the claims/figures derive from — used to verify numbers | — | scan for data files near the draft |
<figures_dir> | directory of figure images the draft references | — | scan near the draft; may be null |
<code_paths> | source files that produce plots/results; empty → code axis dropped (n_axes=4, no code_reviewer) | — | scan for plotting/analysis scripts |
<plot_command> | command that regenerates figures/results from the code; null → figures/code are edit-only (flagged "needs regeneration") | — | pyproject.toml/.venv/README; e.g. python3 code/make_figures.py |
<citation_style> | the single style the formatting_judge enforces (APA|MLA) | APA | ask the user |
<target_venue> | optional venue whose style/length norms apply | — | ask the user |
<length_limits> | optional abstract/paper word-count targets | — | ask the user |
<intent> | the paper's core finding/contribution, 1-2 sentences — frozen; the writer may never change it | — | read the draft, extract it, confirm with the user |
<pass_threshold> | overall_score (0-100) the peer_reviewer must reach (and no hard gate) to stop | 85 | a solid paper without demanding perfection |
<budget> | max iterations | 6 | — |
<patience> | stop after this many consecutive no-improvement iterations | 2 | — |
<sandbox_root> | where working copies, critiques, reviews, and the ledger live | ./sandbox | — |
Literature toolchain. Citation grounding (writer) and verification (scientific_judge,
peer_reviewer) go through the sibling literature-search skill — resolve <lit_skill_dir> (it
installs as a sibling, e.g. ~/.claude/skills/literature-search/), <lit_py> = python3, and
<lit> = <lit_skill_dir>/tools/lit_search.py; append --cache-dir <sandbox_root>/literature/.cache after a subcommand to reuse the cache. Confirm <lit> --help works
at setup; if the skill is absent, tell the user and either install it (copy the repo's
loops/literature-search folder into ~/.claude/skills/) or degrade all retrieval to
WebSearch/WebFetch. The keyless S2 + arXiv core works with no setup; a free S2_API_KEY makes
snippet/cite reliable — run <lit> keys --init, have the user fill the printed keys.env
themselves, and never paste secrets into chat. Record the tier (presence only) in loop.run.yaml.
Environment. The writer regenerates figures by running the user's <plot_command> in the user's
own environment — that code may need third-party deps (matplotlib, pandas, …), so the skill ships
none and never installs them; it shells out to the user's command and reads the regenerated outputs.
Helper code the skill writes stays stdlib-only. tools/lit_search.py (in the sibling skill) is
stdlib-only too.
Initialise the sandbox once bindings are confirmed (copy in the originals; never edit them in place):
<sandbox_root>/
├── loop.run.yaml ← resolved bindings + <intent> + literature_tiers
├── ledger.tsv ← header only (see Ledger)
├── literature/.cache/ ← lit_search on-disk cache
└── iter1/
├── draft.md ← COPY of <draft_path>
├── figures/ ← COPY of <figures_dir>
└── code/ ← COPY of <code_paths> (omit if no code)
<N> starts at 1; iteration 1 critiques and grades the unmodified draft (the baseline — no
revision before it). Re-grade fresh every iteration: the score comes only from a new peer review of
the revised paper, never carried over. Surface-only changes won't move it.
Copy this checklist and tick items off:
iter<N>/ working copies + dataset; each writes iter<N>/critiques/<reviewer>.json (validates against schemas/finding.schema.json). Skip code_reviewer when no code.peer_reviewer (roles/peer_reviewer.md): it grades independently (own read first, then reconcile with the critiques; spot-checks numbers/citations; surfaces missed issues), writes iter<N>/peer_review.json (validates against schemas/peer_review.schema.json): 1-5 per active axis, overall_score = 100 × Σscore / (5 × n_axes), hard gates → pass.ledger.tsv row (see Ledger).peer_review.pass == true, or N == <budget>, or overall_score flat for <patience> iterations → stop (see Stops).scientific_writer (roles/scientific_writer.md) with the critiques, the peer review, <lit>, <plot_command>, <citation_style>, and <intent>. It fixes block/gate items first, fixes code → regenerates figures (running the copied <plot_command> inside the sandbox) → updates prose, grounds new citations via <lit>, and writes iter<N+1>/{draft.md,figures/,code/} + iter<N+1>/revision_notes.md.N = N + 1 and repeat.A judge finding and a peer_review look like (abridged; full shapes in schemas/):
{"reviewer": "code_reviewer", "iteration": 1, "overall": "block",
"summary": "Block: make_figures.py sorts the two columns independently, fabricating r=0.98 (true r≈0.62).",
"findings": [{"urgency": "must_fix", "action_type": "replace", "area": "make_figures:broken-pairing",
"finding": "xs=sorted(study); ys=sorted(score) destroys per-row pairing; r inflated to 0.98 (real ≈0.62).",
"proposed_action": "Correlate the paired arrays; re-render Fig 1; update r everywhere.",
"target_artifact": "iter1/code/make_figures.py", "evidence": "code/make_figures.py:~70; paired r=0.62"}]}
{"iteration": 1,
"axes": {"figures": {"score": 1, "justification": "Fig 1 plots bug-sorted data; numbers != data.", "blocking_issues": ["fig r=0.98 != paired r=0.62"]},
"scientific": {"score": 1, "justification": "Causal claim from one correlation; r unreproducible."},
"style": {"score": 2, "justification": "Promotional, AI-flavored prose."},
"formatting": {"score": 2, "justification": "Mixed APA/MLA; Results before Methods."},
"code": {"score": 1, "justification": "Independent column sort fabricates r=0.98."}},
"overall_score": 28.0, "pass": false,
"gate_failures": ["make_figures.py sort bug makes r=0.98 an artifact (true 0.62)"],
"issues_judges_missed": ["Methods omits the test used and n."],
"spotchecks": [{"target": "paired Pearson r", "method": "recomputed from data", "result": "refuted"}],
"substance_check": "Baseline iteration — nothing revised yet."}
<sandbox_root>/ledger.tsv, tab-separated, never commas in free text:
iter overall_score pass figures scientific style formatting code top_fix revision_summary
1 28.0 no 1 1 2 2 1 fix make_figures sort bug baseline (no revision)
2 61.0 no 3 3 3 3 4 report honest r, de-causalize fixed bug+claims; relabeled fig1
3 88.0 yes 4 5 4 4 5 - unified APA; trimmed abstract; balanced sleep claim
Use - in the code column when the code axis is absent. The per-iteration critiques/*.json,
peer_review.json, and revision_notes.md live in iter<N>/. Report the best-scoring iteration
when stopping on budget/plateau, not necessarily the last. Leave the sandbox untracked.
<sandbox_root> — read the originals once at setup, copy
them in, and work only on the copies; the <plot_command> is copied in and run from the sandbox.<lit> /
WebFetch retrievals, quoted verbatim; on {"error","fallback"}, fall back to WebSearch/WebFetch.<intent> — the writer strengthens the same finding; it never changes the core result
or deletes a real finding to dodge a critique, because removing a result to raise a score is gaming.<plot_command> runs in the user's env, helper code is
stdlib-only. Never print or commit API keys (keys.env stays gitignored).The loop stops on the first of:
peer_review.pass == true. Report the deliverable (iter<N>/ artifacts), the score, and
the trajectory.N == <budget>. Report the best-scoring iteration as the deliverable.overall_score flat for <patience> iterations. Report the best iteration + the
standing gate_failures/must_fix blockers.Always end with the deliverable (iter<N>/ path), its overall_score and pass/fail, the per-axis
scores, the score trajectory from ledger.tsv, and — if it did not pass — the standing blockers
(gate_failures + open must_fix) between the paper and the bar.
Generates publication-quality scientific figures from data by iterating a generator-critic loop: the generator drafts and renders a figure, the adversarial critic grades it against a fixed rubric, and the generator revises until passing. Supports literature grounding via Semantic Scholar and arXiv.
Revises existing academic drafts (manuscripts, theses, papers) using parallel critic perspectives on evidence, method, and argument in an iterative loop until all major issues are resolved.
Structured peer review of scientific manuscripts and grants with checklist-based evaluation of methodology, statistics, reporting standards (CONSORT/STROBE), and reproducibility.
npx claudepluginhub gaasher/agent-loop-skills --plugin agent-loops