Runs autonomous optimization loops: edits code, commits changes, runs benchmarks on metrics like test speed or bundle size, keeps improvements or reverts via git, repeats until stopped.
npx claudepluginhub proyecto26/autoresearch-ai-plugin --plugin autoresearch-ai-pluginThis skill uses the workspace's default tool permissions.
An autonomous optimization loop where Claude edits code, runs a benchmark, measures a metric, and keeps improvements or reverts — repeating forever until stopped.
Sets up autonomous experiment loops for code optimization targets. Gathers goal/metric/files, creates git branch/benchmark script/logging, runs baseline via subagent. For 'run autoresearch' or iterative experiments.
Orchestrates autonomous experiments to optimize measurable metrics like build time, latency, accuracy, or configs via git branches and .lab/ logging.
Guides interactive setup of optimization goals, metrics, and scope; runs autonomous git-committed experiment loops: code changes, testing, measurement, keep improvements or revert. For performance tuning in git repos.
Share bugs, ideas, or general feedback.
An autonomous optimization loop where Claude edits code, runs a benchmark, measures a metric, and keeps improvements or reverts — repeating forever until stopped.
The loop is simple: edit → commit → run → measure → keep or discard → repeat.
git revert.autoresearch.jsonl (append-only log) and autoresearch.md (living session document).When the user triggers autoresearch, gather the following (ask if not provided):
pnpm test, uv run train.py)lower or higher is better)Optionally check for .claude/autoresearch-ai-plugin.local.md in the project root for persistent configuration:
---
enabled: true
max_iterations: 50
working_dir: "/path/to/project"
benchmark_timeout: 600
checks_timeout: 300
---
# Autoresearch Configuration
Additional context or notes for this project's autoresearch setup.
enabled — whether autoresearch is active (default: true)max_iterations — stop after N experiments (default: 0 = unlimited)working_dir — override directory for experiment files (default: current directory)benchmark_timeout — benchmark timeout in seconds (default: 600)checks_timeout — correctness checks timeout in seconds (default: 300)If the file doesn't exist, use defaults. The file should be added to .gitignore (.claude/*.local.md).
Then execute these setup steps:
git checkout -b autoresearch/<goal>-<date>git revert will fail if autoresearch.jsonl is tracked):
echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"
autoresearch.md — the session document (see examples/autoresearch.md)autoresearch.sh — the benchmark script (see examples/autoresearch.sh)autoresearch.checks.sh — correctness checks (tests, lint, types)bash autoresearch.shMETRIC name=value)autoresearch.jsonl (with "type":"config" header first, then baseline result)LOOP FOREVER. Never ask "should I continue?" — just keep going.
The user might be asleep, away from the computer, or expects you to work indefinitely. If each experiment takes ~5 minutes, you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period.
Each iteration:
1. Read current git state and autoresearch.md
2. Choose an experimental change (informed by past results and ASI notes)
3. Edit files in scope
4. git add <files> && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If autoresearch.checks.sh exists, run it (separate timeout, default 300s)
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat
keep (commit stays, branch advances)discard (run git revert $(git rev-parse HEAD) --no-edit)discard (revert, note the failure in ASI)keep (removing complexity is a win)discard even if primary improved (e.g., 1% speed gain but 10x memory usage)autoresearch.ideas.md if it exists. Re-read source files for new angles. Try combining previous near-misses. Try more radical changes. Read any papers or docs referenced in the code.All else being equal, simpler is better. Weigh complexity cost against improvement magnitude:
If the user sends a message while the loop is running:
If 3 consecutive experiments fail or get discarded:
autoresearch.ideas.md for untried ideasBenchmark scripts output metrics as structured lines:
METRIC total_time=4.23
METRIC memory_mb=512
METRIC val_bpb=1.042
Parse these with the helper script at ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh:
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh
Beyond the primary metric, output additional METRIC lines for tradeoff monitoring:
METRIC total_ms=4230 # primary
METRIC compile_ms=1200 # secondary — helps identify bottlenecks
METRIC memory_mb=512 # secondary — monitors resource usage
METRIC cache_hit_rate=0.85 # secondary — instrumentation data
Secondary metrics are tracked in the JSONL log and help guide future experiments, but they rarely affect keep/discard decisions (only discard if a catastrophic secondary regression accompanies a marginal primary improvement).
Output instrumentation data — phase timings, error counts, cache rates, domain-specific signals. This data guides the next iteration and helps identify where optimization effort should focus.
ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — making them the only structured memory of what happened.
Record ASI for every experiment:
{
"hypothesis": "Reducing loop iterations by breaking early",
"result": "Marginal speedup but code readability suffered",
"next_action_hint": "Try vectorization instead of loop unrolling",
"bottleneck": "Memory bandwidth on L2 cache misses"
}
ASI fields are free-form — use whatever keys are useful:
hypothesis — what you expectedresult — what actually happenednext_action_hint — guidance for the next experimentbottleneck — identified performance bottleneckerror_details — crash/failure diagnostics{"type":"config","name":"Optimize unit test runtime","metricName":"total_ms","metricUnit":"ms","bestDirection":"lower"}
Each experiment appends one JSON line:
{"run":5,"commit":"abc1234","metric":4230,"metrics":{"compile_ms":1200,"memory_mb":512},"status":"keep","description":"parallelized test suites","timestamp":1700000000,"segment":0,"confidence":2.3,"asi":{"hypothesis":"parallel tests reduce wall time","next_action_hint":"try worker pool size tuning"}}
Fields:
run — experiment number (1-indexed, sequential)commit — short git commit hash (7 chars)metric — primary metric valuemetrics — secondary metrics dict (optional)status — one of: keep, discard, crash, checks_faileddescription — brief description of what was triedtimestamp — Unix timestamp (seconds)segment — session segment index (0-based, incremented when optimization target changes)confidence — MAD-based confidence score (null if < 3 experiments)asi — Actionable Side Information dict (optional, omit if empty)Use ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh to append entries:
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
--run 5 \
--commit "$(git rev-parse --short HEAD)" \
--metric 4230 \
--status keep \
--description "parallelized test suites" \
--metrics '{"compile_ms":1200,"memory_mb":512}' \
--segment 0 \
--confidence 2.3 \
--asi '{"hypothesis":"parallel tests reduce wall time"}'
Valid statuses: keep, discard, crash, checks_failed
When the optimization target changes mid-session (different benchmark, metric, or workload):
autoresearch.jsonl with the updated targetThis allows a single session to evolve — e.g., first optimize compilation speed, then switch to runtime performance.
If autoresearch.jsonl and autoresearch.md exist in the working directory:
autoresearch.md for full context (goal, metrics, files, constraints, learnings)autoresearch.jsonl to see all past experiments, current best, and ASI annotationsautoresearch.ideas.md if it exists — prune stale entries, experiment with remaining ideasAfter 3+ experiments, assess whether improvements are real or noise:
Record confidence on each experiment result in the JSONL log. When confidence is low, consider:
autoresearch.sh and reporting the medianSee references/confidence-scoring.md for detailed methodology.
| File | Purpose | Created by |
|---|---|---|
autoresearch.md | Living session document — goal, metrics, scope, learnings | Setup phase |
autoresearch.sh | Benchmark script — outputs METRIC name=value lines | Setup phase |
autoresearch.checks.sh | Optional correctness checks (tests, lint, types) | Setup phase |
autoresearch.jsonl | Append-only experiment log (survives restarts) | First experiment |
autoresearch.ideas.md | Optional backlog of ideas to try | Anytime |
.claude/autoresearch-ai-plugin.local.md | Optional persistent configuration (max_iterations, working_dir, timeouts) | User-provided |
When the user asks to cancel or stop autoresearch:
autoresearch.jsonl to count total experiments and results.claude/autoresearch-ai-plugin.local.md if it existsautoresearch.jsonl or autoresearch.md — they contain valuable history/autoresearchWhen the user asks about autoresearch status or progress:
autoresearch.jsonl exists — if not, report "No active session"autoresearch.md for the goal and primary metricautoresearch.jsonl to compute: total runs, kept/discarded/crashed counts, baseline vs best, improvement percentage, confidence scorereferences/confidence-scoring.md — Detailed MAD-based confidence methodologyreferences/best-practices.md — Tips for writing good benchmarks, choosing experiments, ASI patterns, and avoiding pitfallsexamples/autoresearch.md — Example session document templateexamples/autoresearch.sh — Example benchmark script with METRIC outputexamples/autoresearch.checks.sh — Example correctness checks scriptscripts/parse-metrics.sh — Extract METRIC lines from benchmark outputscripts/log-experiment.sh — Append an experiment result to autoresearch.jsonl