From autoresearch
Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.
npx claudepluginhub wjgoarxiv/autoresearch-skillThis skill is limited to using the following tools:
Autonomous research loop inspired by Karpathy's autoresearch. Where autoresearch optimizes ML training on a single GPU, this skill generalizes the loop to any domain: prompt engineering, literature review, code optimization, configuration tuning, and more. You write a `research.md` — the agent does the rest.
7-step setup wizard that produces a complete, ready-to-run research.md without executing the research loop. Walks the user through goal, metric, search space, constraints, evaluator design, and baseline measurement, then writes the file. TRIGGER when: user wants to set up a research project; user wants to plan before running the loop; user says "plan my research"; user has a goal but no research.md; user invokes /autoresearch:plan. DO NOT TRIGGER when: research.md already exists and the user wants to run the loop; user wants a one-shot answer; user wants to debug, not optimize.
Orchestrates autonomous experiments to optimize measurable metrics like build time, latency, accuracy, or configs via git branches and .lab/ logging.
Orchestrates full research pipeline from Brainstorming to Reporting via Planning, Implementation, Testing & Visualization phases with user checkpoints. Configurable for physics, AI/ML, statistics, math domains, depth, and agent personas.
Share bugs, ideas, or general feedback.
Autonomous research loop inspired by Karpathy's autoresearch. Where autoresearch optimizes ML training on a single GPU, this skill generalizes the loop to any domain: prompt engineering, literature review, code optimization, configuration tuning, and more. You write a research.md — the agent does the rest.
You are an autonomous research agent. Once the loop begins:
max_iterations is exhausted (budget spent — this is normal, not failure)Think of max_iterations as a budget to spend, not a limit to fear. Using all 20 iterations means you gave the problem your full effort. Stopping at iteration 4 means you gave up.
Before starting the research loop, the agent MUST ask the user these questions if not already answered in research.md. Do NOT assume — ask.
Ask: "Do you want this research loop to run unattended (overnight)?"
If yes:
pause_every: neverbash scripts/autoresearch-loop.sh ./research-dir/bash scripts/check_progress.sh"If no:
pause_every accordinglyAsk: "Do you have a script that can automatically measure the success metric? (e.g., python evaluate.py that outputs JSON)"
If yes:
If no:
Evaluator: _(none — agent judges manually)_IMPORTANT: Do NOT start Stage 1 of the first iteration until pre-flight questions are answered. If research.md already has all answers (evaluator, pause_every defined), skip the questions and proceed.
Before the first iteration, verify the environment:
git status — warn if there are uncommitted changes that might be lost during rollback. Suggest committing or stashing first..autoresearch.lock exists and is >10 minutes old, warn the user that a previous session may have crashed. Offer to delete the lock and continue.TBD, TODO).Before starting, detect your runtime capabilities and select the appropriate tier:
Check 1: Can I run Bash + Python?
YES -> Tier 1 (Full experimentation — run code, measure results)
NO -> Check 2: Can I use WebFetch or WebSearch?
YES -> Tier 2 (Research-only — literature review, web research)
NO -> Tier 3 (Analysis-only — work with user-provided data)
| Tier | Environment | Capabilities | Experimentation Method |
|---|---|---|---|
| Tier 1 | Claude Code, Codex CLI, any terminal | Bash + Python + full tool access | Run code, measure metrics, modify files, benchmark |
| Tier 2 | Claude App (Web) with web access | WebFetch + WebSearch | Web research, literature review, synthesis |
| Tier 3 | Fully restricted (no network, no shell) | Text generation only | Analyze user-provided data, propose hypotheses without executing |
Five-stage loop, repeating until the success metric is met or constraints are exhausted:
[research.md] --> [Understand] --> [Hypothesize] --> [Experiment] --> [Evaluate] --> [Log]
^ |
|______________________________________________________________|
(iterate until done)
Stage 1 — Understand: Read research.md. Load the goal, success metric, constraints, search space, and iteration history. Assess current state: What has been tried? What worked? What failed? Where is the metric now relative to the target?
Stage 2 — Hypothesize: Based on prior results and remaining search space, propose a single specific, testable change. State the hypothesis clearly: "Changing X to Y should improve the metric because Z." Avoid repeating failed approaches unless the context has changed.
Stage 3 — Experiment: Execute the change. Tier 1: run code, modify files, execute benchmarks — wrap all Bash commands with timeout 5m <command>. If timeout kills the process (exit code 124), treat as a failed experiment — revert and log "TIMEOUT: experiment exceeded 5-minute budget", then proceed to the next iteration. Tier 2: search the web, fetch papers, gather data. Tier 3: apply analytical reasoning to user-provided data. Always preserve the ability to revert.
Stage 4 — Evaluate: Measure the result against the defined success metric. Compare to baseline and to the best result so far. Determine: improved, regressed, or no change? See evaluator-contract.md for mechanical evaluator details.
Stage 5 — Log & Iterate: If improved (or evaluator returns pass+score_improvement) — keep the change, update the best-known result. If not — revert the change, log the failure reason. In both cases: append a row to the History table in research.md, append detailed notes to research_log.md, append a row to autoresearch-results.tsv. After logging, update progress.png — a live convergence plot refreshed every iteration. Use rcparams() from scripts/style_presets.py before plotting. Single-panel: iteration number (x) vs metric value (y), kept iterations as filled markers, reverted as hollow, best-so-far envelope line, target threshold as horizontal dashed line. Overwrite progress.png each iteration. Then check termination conditions: (1) Target metric achieved? (2) Max iterations exhausted? If NEITHER condition is true, return to Stage 1 immediately — do not pause, do not summarize, do not ask the user. Begin the next iteration NOW.
For metrics that are noisy (e.g., benchmarks, ML training), configure these optional fields in research.md Constraints:
noise_runs (default: 1): Number of runs to take the median of. Set to 3–5 for noisy benchmarks.min_delta (default: 0): Minimum improvement required to count as "better". Prevents keeping noise-driven false positives. Example: min_delta: 0.01 means the metric must improve by at least 1% to be kept.Confirmation run: If a result looks unexpectedly large (>2× the previous best improvement), run one additional confirmation measurement before committing. Log: "CONFIRMATION RUN: verifying unexpected improvement."
In addition to the success metric (what to optimize), you can define a guard — a hard safety constraint:
status: guard_violation in the TSV.See evaluator-contract.md for the full evaluator specification, JSON contract, and keep policies.
Quick reference:
research.md Constraints: Evaluator: python evaluate.py{"pass": true, "score": 0.94}score_improvement (default) or pass_onlyThe research.md file is both input and state. The user writes the top sections; the agent maintains the History table. See assets/research_template.md for the full template.
Sections: Goal, Success Metric, Constraints (evaluator, pause_every, max_iterations, guard, noise_runs, min_delta), Current Approach, Search Space, Context & References, History.
| File | Updated | Purpose |
|---|---|---|
research.md | Every iteration | Living research document with History table |
research_log.md | Every iteration (append-only) | Detailed audit trail of every experiment |
progress.png | Every iteration | Live convergence plot |
autoresearch-results.tsv | Every iteration | Machine-readable TSV (8 columns: see references/results-logging.md) |
final_report.md | End only | Structured summary with best result + recommendations |
max_iterations (default: 20) — Iteration budget. Aim to USE all iterations.pause_every — Optional human review checkpoint. Default: never. Only set for safety-critical domains.forbidden_changes — Hard boundaries defined in research.md. Never modify anything in this list.timeout 5m <command>. Exit code 124 = timeout — treat as failed experiment, revert, and continue.See stuck-detection.md for the full Pivot Protocol.
Quick reference:
final_report.md.Normal mode (remaining iterations >= 2): Balance EXPLORE (new approaches) and EXPLOIT (refine best). Give new strategies at least 2 iterations before judging.
Last iteration only: Refine best approach with micro-optimizations, ensure all output files are complete, produce final_report.md.
| Situation | Handling |
|---|---|
| No metric defined | Refuse to start. Ask user to define a measurable metric. |
| Experiment crashes | Log error, revert, try different approach next iteration. |
| Guard violation | Revert. Log as guard_violation. Metric improvement does not count. |
| Same metric for 3+ iterations | Shift strategy (Level 1 Pivot). |
| Max iterations reached | Produce final_report.md. Normal outcome, not failure. |
| Evaluator crashes / invalid JSON / timeout | Treat as failed experiment — revert and continue. |
| No search space left | Try combinations of kept changes. If truly exhausted, produce final_report.md. |
# Foreground
bash scripts/autoresearch-loop.sh ./my-research/
# Background (no tmux needed)
nohup bash scripts/autoresearch-loop.sh ./my-research/ > autoresearch.log 2>&1 &
# tmux
tmux new-session -d -s research 'bash scripts/autoresearch-loop.sh ./my-research/'
# Monitor
bash scripts/check_progress.sh ./my-research/