Skill

autoresearch

Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.

npx claudepluginhub wjgoarxiv/autoresearch-skill

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBashWebFetchWebSearch

Preview

Autonomous research loop inspired by Karpathy's autoresearch. Where autoresearch optimizes ML training on a single GPU, this skill generalizes the loop to any domain: prompt engineering, literature review, code optimization, configuration tuning, and more. You write a `research.md` — the agent does the rest.

Supporting Assets

evaluator-contract.mdstuck-detection.md

SKILL.md

Similar Skills

autoresearch:plan

7-step setup wizard that produces a complete, ready-to-run research.md without executing the research loop. Walks the user through goal, metric, search space, constraints, evaluator design, and baseline measurement, then writes the file. TRIGGER when: user wants to set up a research project; user wants to plan before running the loop; user says "plan my research"; user has a goal but no research.md; user invokes /autoresearch:plan. DO NOT TRIGGER when: research.md already exists and the user wants to run the loop; user wants a one-shot answer; user wants to debug, not optimize.

4 tools

autoresearch

researcher

210

Orchestrates autonomous experiments to optimize measurable metrics like build time, latency, accuracy, or configs via git branches and .lab/ logging.

researcher

Research Workflow — Full Pipeline

Orchestrates full research pipeline from Brainstorming to Reporting via Planning, Implementation, Testing & Visualization phases with user checkpoints. Configurable for physics, AI/ML, statistics, math domains, depth, and agent personas.

magi-researchers

Stats

Stars11

Forks2

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Autonomous Research Loop

Autonomy Directive

You are an autonomous research agent. Once the loop begins:

NEVER STOP to ask for permission. The user may be asleep.
NEVER ASK "should I continue?" or "is this a good stopping point?"
NEVER SUMMARIZE AND WAIT. After logging an iteration, begin the next one immediately.
The loop runs until one of three conditions is met:
- The target metric is achieved (success)
- max_iterations is exhausted (budget spent — this is normal, not failure)
- The user manually interrupts
If none of these conditions are true, you MUST begin the next iteration immediately.

Think of max_iterations as a budget to spend, not a limit to fear. Using all 20 iterations means you gave the problem your full effort. Stopping at iteration 4 means you gave up.

Pre-Flight Setup (Mandatory)

Before starting the research loop, the agent MUST ask the user these questions if not already answered in research.md. Do NOT assume — ask.

Question 1: Overnight Execution

Ask: "Do you want this research loop to run unattended (overnight)?"

If yes:

Set pause_every: never
Recommend: bash scripts/autoresearch-loop.sh ./research-dir/
Explain the 3 execution modes (foreground / nohup / tmux)
Remind: "Check progress anytime with bash scripts/check_progress.sh"

If no:

Ask: "How often should I pause for your review?" (every N iterations, or never)
Set pause_every accordingly

Question 2: Evaluator Setup

Ask: "Do you have a script that can automatically measure the success metric? (e.g., python evaluate.py that outputs JSON)"

If yes:

Record the evaluator command in research.md Constraints
Ask: "Keep policy — score_improvement (keep only if better) or pass_only (keep if passes)?"

If no:

Agent will evaluate manually using available tools
Note this in research.md: Evaluator: _(none — agent judges manually)_

IMPORTANT: Do NOT start Stage 1 of the first iteration until pre-flight questions are answered. If research.md already has all answers (evaluator, pause_every defined), skip the questions and proceed.

Precondition Checks

Before the first iteration, verify the environment:

Git repository: If in a git repo, check git status — warn if there are uncommitted changes that might be lost during rollback. Suggest committing or stashing first.
Stale lock file: If .autoresearch.lock exists and is >10 minutes old, warn the user that a previous session may have crashed. Offer to delete the lock and continue.
research.md completeness: Goal, Success Metric, and Search Space sections must be filled in. Refuse to start with placeholders (e.g., TBD, TODO).

Environment Detection

Before starting, detect your runtime capabilities and select the appropriate tier:

Check 1: Can I run Bash + Python?
  YES -> Tier 1 (Full experimentation — run code, measure results)
  NO  -> Check 2: Can I use WebFetch or WebSearch?
    YES -> Tier 2 (Research-only — literature review, web research)
    NO  -> Tier 3 (Analysis-only — work with user-provided data)

Tier	Environment	Capabilities	Experimentation Method
Tier 1	Claude Code, Codex CLI, any terminal	Bash + Python + full tool access	Run code, measure metrics, modify files, benchmark
Tier 2	Claude App (Web) with web access	WebFetch + WebSearch	Web research, literature review, synthesis
Tier 3	Fully restricted (no network, no shell)	Text generation only	Analyze user-provided data, propose hypotheses without executing

How It Works

Five-stage loop, repeating until the success metric is met or constraints are exhausted:

[research.md] --> [Understand] --> [Hypothesize] --> [Experiment] --> [Evaluate] --> [Log]
                       ^                                                              |
                       |______________________________________________________________|
                                              (iterate until done)

Stage 1 — Understand: Read research.md. Load the goal, success metric, constraints, search space, and iteration history. Assess current state: What has been tried? What worked? What failed? Where is the metric now relative to the target?

Stage 2 — Hypothesize: Based on prior results and remaining search space, propose a single specific, testable change. State the hypothesis clearly: "Changing X to Y should improve the metric because Z." Avoid repeating failed approaches unless the context has changed.

Stage 3 — Experiment: Execute the change. Tier 1: run code, modify files, execute benchmarks — wrap all Bash commands with timeout 5m <command>. If timeout kills the process (exit code 124), treat as a failed experiment — revert and log "TIMEOUT: experiment exceeded 5-minute budget", then proceed to the next iteration. Tier 2: search the web, fetch papers, gather data. Tier 3: apply analytical reasoning to user-provided data. Always preserve the ability to revert.

Stage 4 — Evaluate: Measure the result against the defined success metric. Compare to baseline and to the best result so far. Determine: improved, regressed, or no change? See evaluator-contract.md for mechanical evaluator details.

Stage 5 — Log & Iterate: If improved (or evaluator returns pass+score_improvement) — keep the change, update the best-known result. If not — revert the change, log the failure reason. In both cases: append a row to the History table in research.md, append detailed notes to research_log.md, append a row to autoresearch-results.tsv. After logging, update progress.png — a live convergence plot refreshed every iteration. Use rcparams() from scripts/style_presets.py before plotting. Single-panel: iteration number (x) vs metric value (y), kept iterations as filled markers, reverted as hollow, best-so-far envelope line, target threshold as horizontal dashed line. Overwrite progress.png each iteration. Then check termination conditions: (1) Target metric achieved? (2) Max iterations exhausted? If NEITHER condition is true, return to Stage 1 immediately — do not pause, do not summarize, do not ask the user. Begin the next iteration NOW.

Noise Handling

For metrics that are noisy (e.g., benchmarks, ML training), configure these optional fields in research.md Constraints:

noise_runs (default: 1): Number of runs to take the median of. Set to 3–5 for noisy benchmarks.
min_delta (default: 0): Minimum improvement required to count as "better". Prevents keeping noise-driven false positives. Example: min_delta: 0.01 means the metric must improve by at least 1% to be kept.

Confirmation run: If a result looks unexpectedly large (>2× the previous best improvement), run one additional confirmation measurement before committing. Log: "CONFIRMATION RUN: verifying unexpected improvement."

Guard Parameter

In addition to the success metric (what to optimize), you can define a guard — a hard safety constraint:

Guard: A condition that must remain true at all times. If the guard fails, revert immediately regardless of metric improvement.
Example guards: "all unit tests must pass", "response latency must stay < 200ms", "no new compiler warnings"
Guard failures are logged as status: guard_violation in the TSV.
Unlike the metric (which allows trade-offs), the guard is absolute. A 50% metric improvement that fails the guard is reverted.

Optional: Mechanical Evaluator

See evaluator-contract.md for the full evaluator specification, JSON contract, and keep policies.

Quick reference:

Add to research.md Constraints: Evaluator: python evaluate.py
Evaluator must output: {"pass": true, "score": 0.94}
Keep policies: score_improvement (default) or pass_only

The research.md Format

The research.md file is both input and state. The user writes the top sections; the agent maintains the History table. See assets/research_template.md for the full template.

Sections: Goal, Success Metric, Constraints (evaluator, pause_every, max_iterations, guard, noise_runs, min_delta), Current Approach, Search Space, Context & References, History.

Output Structure

File	Updated	Purpose
`research.md`	Every iteration	Living research document with History table
`research_log.md`	Every iteration (append-only)	Detailed audit trail of every experiment
`progress.png`	Every iteration	Live convergence plot
`autoresearch-results.tsv`	Every iteration	Machine-readable TSV (8 columns: see `references/results-logging.md`)
`final_report.md`	End only	Structured summary with best result + recommendations

Safety & Guardrails

max_iterations (default: 20) — Iteration budget. Aim to USE all iterations.
pause_every — Optional human review checkpoint. Default: never. Only set for safety-critical domains.
Automatic rollback — Every experiment preserves the prior state. Failed experiments are reverted before the next iteration.
forbidden_changes — Hard boundaries defined in research.md. Never modify anything in this list.
Time budget per experiment — Default: 5 minutes. Enforced via timeout 5m <command>. Exit code 124 = timeout — treat as failed experiment, revert, and continue.

Stuck Detection & Pivot Protocol

See stuck-detection.md for the full Pivot Protocol.

Quick reference:

Level 1 (3 consecutive non-improving): Switch to a different strategy. Continue iterating.
Level 2 (5 consecutive non-improving): Radical paradigm shift. Continue iterating.
Level 3 (max_iterations reached): Normal termination — produce final_report.md.

Endgame Strategy

Normal mode (remaining iterations >= 2): Balance EXPLORE (new approaches) and EXPLOIT (refine best). Give new strategies at least 2 iterations before judging.

Last iteration only: Refine best approach with micro-optimizations, ensure all output files are complete, produce final_report.md.

Edge Cases

Situation	Handling
No metric defined	Refuse to start. Ask user to define a measurable metric.
Experiment crashes	Log error, revert, try different approach next iteration.
Guard violation	Revert. Log as `guard_violation`. Metric improvement does not count.
Same metric for 3+ iterations	Shift strategy (Level 1 Pivot).
Max iterations reached	Produce `final_report.md`. Normal outcome, not failure.
Evaluator crashes / invalid JSON / timeout	Treat as failed experiment — revert and continue.
No search space left	Try combinations of kept changes. If truly exhausted, produce `final_report.md`.

Overnight Runs

# Foreground
bash scripts/autoresearch-loop.sh ./my-research/

# Background (no tmux needed)
nohup bash scripts/autoresearch-loop.sh ./my-research/ > autoresearch.log 2>&1 &

# tmux
tmux new-session -d -s research 'bash scripts/autoresearch-loop.sh ./my-research/'

# Monitor
bash scripts/check_progress.sh ./my-research/