Skill

autoresearch

Autonomous agent-driven optimization loop inspired by Karpathy's autoresearch. Sets up and runs an iterative hill-climbing harness where subagents modify an artifact, evaluate against a single scalar metric, and keep improvements. Use this skill whenever the user wants to "optimize something iteratively", "run an autoresearch loop", "hill-climb on performance", "auto-optimize", "iterate and improve automatically", "run experiments autonomously", "autonomous optimization", or mentions "autoresearch" in any context. Also triggers when the user describes a workflow like "try variations and measure which is best", "keep tweaking until it's faster", "optimize this config", "find the best prompt", "tune hyperparameters", "benchmark variations", or any scenario where they want an agent to autonomously explore a search space against a measurable objective. Works with any domain — code performance, prompt engineering, config tuning, SQL optimization, CSS optimization, model training, build flags, or anything with a measurable outcome.

From autoresearch

Install

Run in your terminal

npx claudepluginhub thomasrohde/marketplace --plugin autoresearch

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

Similar Skills

payload

11 files

Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.

payload

41.6k

ab-test-setup

Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.

antigravity-bundle-data-analytics

31.0k

analytics-tracking

Designs, audits, and improves analytics tracking systems using Signal Quality Index for reliable, decision-ready data in marketing, product, and growth.

antigravity-bundle-data-analytics

31.0k

Stats

Parent Repo Stars1

Parent Repo Forks0

Last CommitMar 15, 2026

Actions

View Source View Plugin View on GitHub View README

autoresearch

An autonomous optimization loop where subagents iteratively modify an artifact, evaluate it against a single scalar metric, and keep improvements. Inspired by Karpathy's autoresearch — generalized beyond ML training to any domain with a measurable objective.

The pattern in four invariants

Component	What it is	Why it matters
Editable artifact	The file(s) the agent modifies each iteration	Bounds the search space
Evaluation oracle	A command that produces a single number on stdout	Unambiguous better/worse signal
Metric direction	`minimize` or `maximize`	Tells the agent which way is "better"
Program document	Human-written intent, constraints, and boundaries	Steers the search without micromanaging

The loop is simple: hypothesize → edit → evaluate → keep or revert → repeat.

Phase 1: Setup interview

Before scaffolding anything, understand the use case. Gather these answers from the user — extract what you can from conversation context and ask for the rest:

What are you optimizing? Identify the artifact file(s). Examples:
- A Python script's runtime performance
- A prompt template's accuracy on a test set
- An nginx config's throughput
- A SQL query's execution time
- A CSS file's Lighthouse score
How do we measure success? Identify or help create the evaluation command. It must:
- Be runnable as a single shell command
- Print exactly one number to stdout (the metric)
- Exit 0 on success, non-zero on failure
- Complete in a bounded time (ideally under 5 minutes)
Which direction is better? minimize (latency, error rate, file size) or maximize (throughput, accuracy, score).
What's off limits? Constraints the agent must respect. Examples:
- "Don't change the public API"
- "Keep Python 3.10 compatibility"
- "Don't add dependencies"
- "Output format must stay the same"
How many iterations? Default: 20. The user can set a limit or say "run until I stop you."
Evaluation budget per iteration? If the evaluation command has a natural time bound (like training for N minutes), note it. Otherwise the evaluation command's own runtime is the budget.

If the user provides a use case description up front (e.g., "optimize my sorting algorithm for speed"), extract as much as possible and confirm the gaps.

Phase 2: Scaffold the harness

Create a .autoresearch/ directory in the project root with these files:

`.autoresearch/config.json`

{
  "artifact_paths": ["path/to/file.py"],
  "eval_command": "python evaluate.py",
  "metric_name": "execution_time_ms",
  "metric_direction": "minimize",
  "max_iterations": 20,
  "branch_prefix": "autoresearch",
  "created_at": "2026-03-15T10:00:00Z"
}

`.autoresearch/program.md`

Write this collaboratively with the user. It should contain:

# Research Program: [Title]

## Objective
[One sentence: what we're optimizing and why]

## Artifact
[Path to the file(s) being modified, and a brief description of what they contain]

## Evaluation
[The eval command and what the metric means]

## Constraints
[What the agent must NOT change or break]

## Strategy hints
[Optional: suggested directions to explore, known dead ends, domain knowledge]

The program.md is the most important file — it's the agent's "research brief." Help the user write one that's specific enough to be useful but open enough to allow creative exploration. Explain why constraints exist so the agent can reason about edge cases.

`.autoresearch/results.tsv`

Initialize with the header row:

iteration	timestamp	commit	metric_value	hypothesis	changes_summary	kept

Evaluation script

If the user doesn't already have an evaluation command, help create one. Save it as evaluate.py (or evaluate.sh) in the project root. The script must:

Print exactly one number to stdout
Send all diagnostics to stderr
Exit 0 on success
Exit non-zero if the artifact is broken (syntax error, crash, etc.)

Example patterns:

Performance benchmark:

import subprocess, time, statistics, sys

times = []
for _ in range(5):
    start = time.perf_counter()
    result = subprocess.run(["python", "solution.py"], capture_output=True)
    if result.returncode != 0:
        print("FAILED", file=sys.stderr)
        sys.exit(1)
    times.append(time.perf_counter() - start)

print(f"{statistics.median(times) * 1000:.1f}")  # median ms

Accuracy eval:

import json, sys

results = json.load(open("test_results.json"))
correct = sum(1 for r in results if r["predicted"] == r["expected"])
print(f"{correct / len(results) * 100:.2f}")  # accuracy %

Baseline run

Before starting the loop, run the evaluation once on the unmodified artifact to establish a baseline. Record it as iteration 0 in results.tsv.

Git branch

Create a dedicated branch:

git checkout -b autoresearch/<topic>-<date>

Commit the scaffold files as the first commit on this branch.

Phase 3: Run the optimization loop

This is the core of the skill. You act as the coordinator, spawning a subagent for each iteration.

Loop structure

for each iteration (1 to max_iterations):
    1. Spawn a researcher subagent
    2. Wait for it to complete
    3. Collect the metric
    4. If improved: git commit, log as kept
    5. If not improved: git revert changes, log as not kept
    6. Report progress to user
    7. Check stopping criteria

Spawning the researcher subagent

For each iteration, spawn a subagent with this prompt template (adapt the specifics to the use case):

You are a researcher running iteration {N} of an optimization loop.

## Your goal
Improve the metric "{metric_name}" ({metric_direction} is better).
Current best: {best_value} (iteration {best_iteration}).

## Research program
{contents of .autoresearch/program.md}

## History of past experiments
{contents of .autoresearch/results.tsv}

## Instructions
1. Read the artifact file(s): {artifact_paths}
2. Analyze the history — what has been tried, what worked, what didn't
3. Form a hypothesis about what change might improve the metric
4. Edit the artifact file(s) to test your hypothesis
5. Run the evaluation: {eval_command}
6. Report your results

## Output format
After running the evaluation, output exactly this JSON to stdout:
{
  "hypothesis": "what you tried and why",
  "changes_summary": "brief description of edits made",
  "metric_value": <the number from evaluation>,
  "eval_exit_code": <0 or non-zero>,
  "notes": "any observations for future iterations"
}

If evaluation fails (non-zero exit), still report — set metric_value to null.

## Rules
- Make ONE focused change per iteration (easier to attribute improvements)
- Read the full experiment history before deciding what to try
- Don't repeat experiments that already failed
- Stay within the constraints defined in the research program
- If you're stuck after seeing many failures, try a fundamentally different approach

Use mode: "bypassPermissions" for the subagent so it can edit files and run commands without prompting.

Handling results

After each subagent completes:

Parse the result — extract metric_value from the subagent's output
Compare to best — check if this iteration improved the metric
If improved:
- git add the changed artifact files
- git commit -m "autoresearch: iteration {N} — {hypothesis} ({metric}: {value})"
- Update best_value and best_iteration
- Append to results.tsv with kept=true
If not improved or evaluation failed:
- git checkout -- {artifact_paths} to revert changes
- Append to results.tsv with kept=false
Report to user — brief status line: Iteration {N}: {metric}={value} (best={best}) — {kept/reverted} — "{hypothesis}"

Stopping criteria

Stop the loop when any of these are true:

Reached max_iterations
The user interrupts (sends a message)
5 consecutive iterations without improvement (plateau detection)
The evaluation command fails 3 times in a row (something is broken)

When stopping, report a summary.

Phase 4: Summary report

When the loop ends (or is interrupted), produce a summary:

## Autoresearch Summary: [topic]

**Iterations:** {total} ({kept} improvements, {reverted} reverted)
**Baseline:** {metric_name} = {baseline_value}
**Final best:** {metric_name} = {best_value} (iteration {best_iteration})
**Improvement:** {percentage}% {better/worse}

### Top improvements
| Iter | Metric | Hypothesis | Kept |
|------|--------|-----------|------|
| ...  | ...    | ...       | ...  |

### Key observations
- [What worked]
- [What didn't work]
- [Suggested next directions]

Also point the user to:

.autoresearch/results.tsv — full experiment log
The git log on the autoresearch branch — each kept iteration is a commit
The final state of the artifact — the current best version

Tips for effective use

Writing good evaluation commands

Deterministic is ideal. If your metric has variance (e.g., network latency), run multiple samples and report the median.
Fast feedback wins. A 10-second eval enables 6x more iterations per hour than a 60-second eval. If your eval is slow, consider a proxy metric.
Fail loudly. If the artifact is broken (syntax error, crash), exit non-zero so the iteration is marked as failed and reverted.

Writing good program documents

Explain the why. "Don't use recursion" is less useful than "Don't use recursion because the input can be 10M items deep and Python's stack limit is 1000."
Seed with domain knowledge. If you know that "batch processing usually helps" or "the bottleneck is probably I/O", say so — it saves the agent from rediscovering basics.
Define the boundaries, not the path. Tell the agent what's off-limits, not what to do. Let it explore creatively within the constraints.

When autoresearch works best

Clear, fast, scalar metric
Single file or small set of files to modify
Large search space with many possible improvements
Domain where LLMs can reason about the artifact (code, configs, prompts, text)

When to use something else

Multi-objective optimization with no clear weighting
Artifacts that require human judgment (visual design, UX, writing tone)
Evaluation that takes hours per run
Changes that require coordinated edits across many files