From autoresearch
7-step setup wizard that produces a complete, ready-to-run research.md without executing the research loop. Walks the user through goal, metric, search space, constraints, evaluator design, and baseline measurement, then writes the file. TRIGGER when: user wants to set up a research project; user wants to plan before running the loop; user says "plan my research"; user has a goal but no research.md; user invokes /autoresearch:plan. DO NOT TRIGGER when: research.md already exists and the user wants to run the loop; user wants a one-shot answer; user wants to debug, not optimize.
npx claudepluginhub wjgoarxiv/autoresearch-skillThis skill is limited to using the following tools:
A 7-step interview that produces a complete `research.md` (and optionally `evaluate.py`) before a single experiment runs. The wizard is conversational — ask each step, wait for the answer, then proceed. Do not batch all questions at once.
Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.
Executes non-interactive end-to-end research pipeline: auto-configures program.md with defaults, runs judge+refine loop up to 3 iterations, then runs campaign from single goal command.
Generates program.md for autonomous AI research experiments (Karpathy's autoresearch). Interviews user on codebase, metrics, constraints; explores code; tailors agent instructions from template.
Share bugs, ideas, or general feedback.
A 7-step interview that produces a complete research.md (and optionally evaluate.py) before a single experiment runs. The wizard is conversational — ask each step, wait for the answer, then proceed. Do not batch all questions at once.
One step at a time. Present the step title and question(s). Wait for the user's response. Summarize what you recorded ("Got it — I'll set metric: accuracy, direction: maximize"). Then proceed to the next step.
Do not skip steps. Each step produces a concrete artifact that feeds Step 7. If the user's answer is vague, probe once for specificity, then record your best interpretation and note it as an assumption.
Probe for specificity. Vague goals produce useless research loops.
Ask:
Probe rules:
Record: goal_statement (1-2 sentences, specific and measurable)
What to measure, how to measure it, and what direction counts as progress.
Ask:
Guide the user if stuck:
Record: metric_name, direction (maximize/minimize), target_value, noise_runs (1 if deterministic, 3-5 if noisy)
Enumerate what can change and, critically, what must not.
Ask:
Probe rules:
Record: allowed_changes (bullet list), forbidden_changes (bullet list), guard (optional hard constraint)
Scope the loop before it starts.
Ask:
If the user wants overnight/unattended:
pause_every: nevernohup bash scripts/autoresearch-loop.sh ./research-dir/ > autoresearch.log 2>&1 &bash scripts/check_progress.sh"If the user wants periodic reviews:
pause_every: NRecord: max_iterations, pause_every, time_budget_per_experiment (default: 5 minutes), any resource constraints
Can measurement be automated? This determines loop speed and quality.
Ask: "Can the success metric be measured by running a script? For example, python evaluate.py that outputs a number."
Guide the user to produce a script that prints:
{"pass": true, "score": 0.94}
Ask clarifying questions:
Offer to write a evaluate.py starter template based on their answers. Use the appropriate example from below as a starting point, adapt it to their domain, and write it to the research directory.
Template — timing/benchmark (minimize):
#!/usr/bin/env python3
import json, subprocess, statistics, time
times = []
for _ in range(3):
t0 = time.perf_counter()
subprocess.run(["python", "TARGET_SCRIPT.py"], check=True)
times.append(time.perf_counter() - t0)
median = statistics.median(times)
print(json.dumps({"pass": median < TARGET_VALUE, "score": median}))
Template — accuracy/quality (maximize):
#!/usr/bin/env python3
import json, subprocess
result = subprocess.run(["python", "test_suite.py"], capture_output=True, text=True)
score = float(result.stdout.strip().split("score:")[-1].strip())
print(json.dumps({"pass": score > TARGET_VALUE, "score": score}))
Keep policy: Ask: "Keep only if strictly better than best so far (score_improvement), or keep anything that passes the threshold (pass_only)?"
Note in research.md: Evaluator: _(none — agent judges manually)_
Explain: "The agent will evaluate each experiment using its own judgment. This is slower and less reliable than a script — consider adding a script later."
Record: evaluator_command (or none), keep_policy
Run the evaluator NOW to establish iteration 0. This is mandatory before writing research.md.
If an evaluator was designed in Step 5:
# Run the evaluator dry run
python evaluate.py
Expected output: a JSON line like {"pass": true, "score": 0.73}
If the evaluator runs successfully:
If the evaluator fails (non-zero exit, invalid JSON, crash):
evaluate.py and re-run.If no evaluator (manual evaluation):
Record: baseline_score (iteration 0 value)
Write the fully populated research.md using all recorded values.
If scripts/init_research.py is available:
python scripts/init_research.py \
--goal "GOAL_STATEMENT" \
--metric "METRIC_NAME" \
--direction "DIRECTION" \
--target "TARGET_VALUE" \
--evaluator "EVALUATOR_COMMAND" \
--output ./research-dir/
If the script is not available, write research.md directly using this structure:
# Research: GOAL_TITLE
## Goal
GOAL_STATEMENT
## Success Metric
- **Metric:** METRIC_NAME
- **Target:** TARGET_VALUE
- **Direction:** DIRECTION
## Constraints
- **Max iterations:** MAX_ITERATIONS
- **Time budget per experiment:** 5 minutes
- **Pause for review every:** PAUSE_EVERY
- **Evaluator:** EVALUATOR_COMMAND
- **Keep policy:** KEEP_POLICY
- **Guard:** GUARD (if any)
- **Noise runs:** NOISE_RUNS
- **Min delta:** 0
## Current Approach
BASELINE_DESCRIPTION
## Search Space
- **Allowed changes:** ALLOWED_CHANGES
- **Forbidden changes:** FORBIDDEN_CHANGES
## Context & References
REFERENCES (if any)
---
## History
| # | Change | Metric | Result | Timestamp |
|---|--------|--------|--------|-----------|
| 0 | Baseline | BASELINE_SCORE | -- | TODAY |
After writing:
evaluate.py was written, confirm that too.To start the loop, tell your agent:
"Run autoresearch on ./research-dir/research.md"
Or for overnight unattended:
nohup bash scripts/autoresearch-loop.sh ./research-dir/ > autoresearch.log 2>&1 &
bash scripts/check_progress.sh ./research-dir/
/autoresearch:ship to publish the results."Before declaring the wizard complete, verify:
research.md written with all sections populated (no TBD or TODO placeholders)evaluate.py written if automated evaluation was chosen