From autoresearch
Sets up Karpathy-style autoresearch experiments to autonomously optimize code in one constrained file via iterative evals against a numerical metric, generating instructions.md, eval script, test data, and launch prompt.
npx claudepluginhub pjhoberman/autoresearch --plugin autoresearch-discoverThis skill uses the workspace's default tool permissions.
Set up autonomous experiment loops on any codebase. The pattern: **one file, one metric, one loop.** An agent edits the constrained file, runs the eval, keeps improvements, reverts failures, repeats.
Orchestrates autonomous experiments to optimize measurable metrics like build time, latency, accuracy, or configs via git branches and .lab/ logging.
Runs metric-driven iterative optimization loops for code performance, prompts, clustering, search relevance, or other metrics. Builds measurement scaffolding, runs parallel experiments, evaluates via hard gates/LLM judges, iterates to best solution.
Guides interactive setup of optimization goals, metrics, and scope; runs autonomous git-committed experiment loops: code changes, testing, measurement, keep improvements or revert. For performance tuning in git repos.
Share bugs, ideas, or general feedback.
Set up autonomous experiment loops on any codebase. The pattern: one file, one metric, one loop. An agent edits the constrained file, runs the eval, keeps improvements, reverts failures, repeats.
Based on Karpathy's autoresearch pattern, generalized beyond ML training to any code with a measurable outcome.
The user invokes this skill with the path to the constrained file: /autoresearch path/to/file.py
If the user invokes /autoresearch without a file path, ask them which file they want the agent to optimize. Don't proceed until you have it — the whole pattern depends on this choice.
Once you have the file path:
Read the constrained file immediately. Identify:
Identify the metric. Ask the user:
Check for eval speed. The loop needs fast iterations:
Identify guard metrics (optional but recommended). A guard metric is a secondary metric that must NOT regress while the primary metric improves. Examples:
Generate four files, customized to the user's codebase. Use the templates in templates/ as starting points, but adapt heavily — every autoresearch setup is different.
instructions.mdRead templates/instructions_template.md for the full structure. Key sections to customize:
uv run manage.py run_autoresearch_eval)Important details often missed:
Read templates/eval_template.py for the structure. Key decisions:
Metric choice:
0.8 * precision + 0.2 * mrr)Caching strategy:
unittest.mock.patch) to intercept API calls and return cached values.gitignore and the instructions should warn the agent not to delete themFor Django projects: Write as a management command, not a standalone script.
Output format: Must print SCORE: X.XXXX on its own line — this is what the agent parses.
test_queries.json or equivalent)The hardest part. Options:
Minimum 15 test cases. 20-30 is better. Cover the variety of inputs the system handles.
Short — the instructions.md does the heavy lifting. Include:
autoresearch.jsonl (structured state) and update autoresearch_dashboard.md (human-readable)Before handing off to Claude Code:
Run the eval 3 times with no code changes. Confirm it produces a SCORE each time.
Verify the constrained file path and revert command work
Check cache behavior — run eval twice, confirm second run is faster
Review test data with the user — bad labels will send the agent in wrong directions
Estimate iteration time — multiply by 30 to set expectations for total runtime
Initialize autoresearch.jsonl with the config header line (see State tracking section)
Stop and ask the user to review the generated files. Present a summary of what was generated (instructions.md, eval script, test data, launch prompt, and the JSONL config) and ask the user to review them before proceeding. Do not kick off the autonomous loop or hand off the launch prompt until the user confirms they're happy with the setup. This is the last chance to catch bad metric definitions, missing frozen files, wrong strategy guidance, or test data issues before the agent burns 30 iterations on a flawed harness.
After Round 1 completes, read the log. The "what didn't work" and "recommended next steps" sections tell you what Round 2 should target.
Common patterns:
For each new round:
instructions_2.md)run_autoresearch_eval_2.py)autoresearch_2.jsonl) and dashboard (autoresearch_dashboard_2.md)Caching masks changes. If the system caches results (Redis, file cache, Django cache), the eval must clear or bypass the cache. Otherwise prompt/logic changes have zero effect.
Non-deterministic evals. LLM calls with temperature >0, network timing, random seeds — all add noise. Set temperature=0 for any LLM in the eval path if possible. Run baseline 3x to measure variance. If variance > expected improvement, the experiment won't produce signal.
Co-optimization trap. If Round 1 tunes component A to work with component B's current output, Round 2 can't improve B without re-tuning A. Either co-optimize both simultaneously or accept that sequential optimization has a lower ceiling.
Metric gaming. The agent will optimize exactly what you measure. If your metric doesn't capture what you care about, the agent will find exploits. Composite metrics with guardrails (e.g., "optimize precision but warn if MRR drops below 0.9") help.
Diminishing returns. Most gains come in the first 10-15 iterations. If the score plateaus for 5+ consecutive iterations, the ceiling is likely architectural, not parametric. The agent's final report should say this.
The experiment loop must track state in autoresearch.jsonl — one JSON object per line. This format is machine-parseable, survives context window compaction (the agent can re-read it to recover state), and makes the experiment auditable.
Line 0 is the config header:
{"type": "config", "constrained_file": "path/to/file.py", "eval_command": "...", "metric": "precision@12", "guard_metric": "mrr", "guard_threshold": 0.90, "baseline": 0.6930}
Subsequent lines are iteration results:
{"type": "result", "iteration": 1, "commit": "abc1234", "score": 0.7050, "delta": "+0.0120", "guard_score": 0.95, "guard_pass": true, "status": "keep", "description": "Increased location base weight from 2x to 5x", "timestamp": "2025-03-15T02:14:33Z"}
{"type": "result", "iteration": 2, "commit": "def5678", "score": 0.6850, "delta": "-0.0200", "guard_score": null, "guard_pass": null, "status": "discard", "description": "Added title matching as a ranking signal", "timestamp": "2025-03-15T02:16:01Z"}
Status values: baseline, keep, discard, crash, guard_fail
guard_fail means the primary score improved but the guard metric crossed its threshold — the change is discarded despite the score gain.
After every iteration, regenerate autoresearch_dashboard.md with:
# Autoresearch Dashboard
**Constrained file:** `path/to/file.py`
**Baseline:** 0.6930 | **Current best:** 0.7200 | **Iterations:** 14/30
**Guard:** MRR >= 0.90 (current: 0.95)
| # | Score | Delta | Guard | Status | Description |
|---|-------|-------|-------|--------|-------------|
| 1 | 0.7050 | +0.012 | 0.95 PASS | keep | Increased location weight |
| 2 | 0.6850 | -0.020 | — | discard | Title matching signal |
...
**Kept:** 3 | **Discarded:** 10 | **Crashed:** 1 | **Guard failures:** 0
This dashboard is what the user checks to monitor progress. The agent should also read it (along with the JSONL) when resuming after context compaction to understand what's been tried.
Long experiment loops may hit context window limits. The JSONL file is the agent's persistent memory — the launch prompt should instruct the agent:
If you lose context or are resuming, read
autoresearch.jsonlandautoresearch_dashboard.mdto recover your state. The JSONL has every iteration's result and the config header has all experiment parameters. Continue from the last iteration number.
Noisy metrics produce false positives — the agent keeps "improvements" that are just variance. Address this during setup:
Before starting the loop, run the eval 3 times with no code changes. If the scores vary by more than 0.01 (or whatever the expected per-iteration improvement is), the eval is too noisy for autoresearch. Fix the noise source first:
Only count a change as an improvement if the score increases by more than a minimum delta. Set this based on the baseline stability check:
instructions.md and the JSONL config headerIf the eval has irreducible noise (e.g., it must call a live API), instruct the agent to run a confirmation eval on every "improvement":
This doubles iteration time but prevents noise-driven drift.
The templates/ directory contains starter templates. Do not use these as-is — they must be adapted to the specific codebase. They exist to show the structure and remind you of sections to include.
templates/instructions_template.md — Full instructions.md structure with TODOstemplates/eval_template.py — Eval script with multiple metric functions and caching patterntemplates/launch_prompt.md — Claude Code launch prompt templatereferences/lessons.md — Detailed findings from two rounds of autoresearch on a production search system. Read this before designing a new experiment — it covers caching pitfalls, co-optimization traps, eval design, and when to stop. Especially useful sections: "What didn't work" (saves iteration budget), "General Lessons: On eval design" (caching, metrics, test data), and "On multi-round experiments" (how to plan Round 2).