Autonomous agent-driven optimization loop inspired by Karpathy's autoresearch. Sets up and runs an iterative hill-climbing harness where subagents modify an artifact, evaluate against a single scalar metric, and keep improvements. Use this skill whenever the user wants to "optimize something iteratively", "run an autoresearch loop", "hill-climb on performance", "auto-optimize", "iterate and improve automatically", "run experiments autonomously", "autonomous optimization", or mentions "autoresearch" in any context. Also triggers when the user describes a workflow like "try variations and measure which is best", "keep tweaking until it's faster", "optimize this config", "find the best prompt", "tune hyperparameters", "benchmark variations", or any scenario where they want an agent to autonomously explore a search space against a measurable objective. Works with any domain — code performance, prompt engineering, config tuning, SQL optimization, CSS optimization, model training, build flags, or anything with a measurable outcome.
From autoresearchnpx claudepluginhub thomasrohde/marketplace --plugin autoresearchThis skill uses the workspace's default tool permissions.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Enforces A/B test setup with gates for hypothesis locking, metrics definition, sample size calculation, assumptions checks, and execution readiness before implementation.
Designs, audits, and improves analytics tracking systems using Signal Quality Index for reliable, decision-ready data in marketing, product, and growth.
An autonomous optimization loop where subagents iteratively modify an artifact, evaluate it against a single scalar metric, and keep improvements. Inspired by Karpathy's autoresearch — generalized beyond ML training to any domain with a measurable objective.
| Component | What it is | Why it matters |
|---|---|---|
| Editable artifact | The file(s) the agent modifies each iteration | Bounds the search space |
| Evaluation oracle | A command that produces a single number on stdout | Unambiguous better/worse signal |
| Metric direction | minimize or maximize | Tells the agent which way is "better" |
| Program document | Human-written intent, constraints, and boundaries | Steers the search without micromanaging |
The loop is simple: hypothesize → edit → evaluate → keep or revert → repeat.
Before scaffolding anything, understand the use case. Gather these answers from the user — extract what you can from conversation context and ask for the rest:
What are you optimizing? Identify the artifact file(s). Examples:
How do we measure success? Identify or help create the evaluation command. It must:
Which direction is better? minimize (latency, error rate, file size) or
maximize (throughput, accuracy, score).
What's off limits? Constraints the agent must respect. Examples:
How many iterations? Default: 20. The user can set a limit or say "run until I stop you."
Evaluation budget per iteration? If the evaluation command has a natural time bound (like training for N minutes), note it. Otherwise the evaluation command's own runtime is the budget.
If the user provides a use case description up front (e.g., "optimize my sorting algorithm for speed"), extract as much as possible and confirm the gaps.
Create a .autoresearch/ directory in the project root with these files:
.autoresearch/config.json{
"artifact_paths": ["path/to/file.py"],
"eval_command": "python evaluate.py",
"metric_name": "execution_time_ms",
"metric_direction": "minimize",
"max_iterations": 20,
"branch_prefix": "autoresearch",
"created_at": "2026-03-15T10:00:00Z"
}
.autoresearch/program.mdWrite this collaboratively with the user. It should contain:
# Research Program: [Title]
## Objective
[One sentence: what we're optimizing and why]
## Artifact
[Path to the file(s) being modified, and a brief description of what they contain]
## Evaluation
[The eval command and what the metric means]
## Constraints
[What the agent must NOT change or break]
## Strategy hints
[Optional: suggested directions to explore, known dead ends, domain knowledge]
The program.md is the most important file — it's the agent's "research brief." Help the user write one that's specific enough to be useful but open enough to allow creative exploration. Explain why constraints exist so the agent can reason about edge cases.
.autoresearch/results.tsvInitialize with the header row:
iteration timestamp commit metric_value hypothesis changes_summary kept
If the user doesn't already have an evaluation command, help create one. Save it as
evaluate.py (or evaluate.sh) in the project root. The script must:
Example patterns:
Performance benchmark:
import subprocess, time, statistics, sys
times = []
for _ in range(5):
start = time.perf_counter()
result = subprocess.run(["python", "solution.py"], capture_output=True)
if result.returncode != 0:
print("FAILED", file=sys.stderr)
sys.exit(1)
times.append(time.perf_counter() - start)
print(f"{statistics.median(times) * 1000:.1f}") # median ms
Accuracy eval:
import json, sys
results = json.load(open("test_results.json"))
correct = sum(1 for r in results if r["predicted"] == r["expected"])
print(f"{correct / len(results) * 100:.2f}") # accuracy %
Before starting the loop, run the evaluation once on the unmodified artifact to establish a baseline. Record it as iteration 0 in results.tsv.
Create a dedicated branch:
git checkout -b autoresearch/<topic>-<date>
Commit the scaffold files as the first commit on this branch.
This is the core of the skill. You act as the coordinator, spawning a subagent for each iteration.
for each iteration (1 to max_iterations):
1. Spawn a researcher subagent
2. Wait for it to complete
3. Collect the metric
4. If improved: git commit, log as kept
5. If not improved: git revert changes, log as not kept
6. Report progress to user
7. Check stopping criteria
For each iteration, spawn a subagent with this prompt template (adapt the specifics to the use case):
You are a researcher running iteration {N} of an optimization loop.
## Your goal
Improve the metric "{metric_name}" ({metric_direction} is better).
Current best: {best_value} (iteration {best_iteration}).
## Research program
{contents of .autoresearch/program.md}
## History of past experiments
{contents of .autoresearch/results.tsv}
## Instructions
1. Read the artifact file(s): {artifact_paths}
2. Analyze the history — what has been tried, what worked, what didn't
3. Form a hypothesis about what change might improve the metric
4. Edit the artifact file(s) to test your hypothesis
5. Run the evaluation: {eval_command}
6. Report your results
## Output format
After running the evaluation, output exactly this JSON to stdout:
{
"hypothesis": "what you tried and why",
"changes_summary": "brief description of edits made",
"metric_value": <the number from evaluation>,
"eval_exit_code": <0 or non-zero>,
"notes": "any observations for future iterations"
}
If evaluation fails (non-zero exit), still report — set metric_value to null.
## Rules
- Make ONE focused change per iteration (easier to attribute improvements)
- Read the full experiment history before deciding what to try
- Don't repeat experiments that already failed
- Stay within the constraints defined in the research program
- If you're stuck after seeing many failures, try a fundamentally different approach
Use mode: "bypassPermissions" for the subagent so it can edit files and run
commands without prompting.
After each subagent completes:
git add the changed artifact filesgit commit -m "autoresearch: iteration {N} — {hypothesis} ({metric}: {value})"kept=truegit checkout -- {artifact_paths} to revert changeskept=falseIteration {N}: {metric}={value} (best={best}) — {kept/reverted} — "{hypothesis}"Stop the loop when any of these are true:
When stopping, report a summary.
When the loop ends (or is interrupted), produce a summary:
## Autoresearch Summary: [topic]
**Iterations:** {total} ({kept} improvements, {reverted} reverted)
**Baseline:** {metric_name} = {baseline_value}
**Final best:** {metric_name} = {best_value} (iteration {best_iteration})
**Improvement:** {percentage}% {better/worse}
### Top improvements
| Iter | Metric | Hypothesis | Kept |
|------|--------|-----------|------|
| ... | ... | ... | ... |
### Key observations
- [What worked]
- [What didn't work]
- [Suggested next directions]
Also point the user to:
.autoresearch/results.tsv — full experiment log