autoresearch-claude-code

Autonomous experiment loop for Claude Code. Give it a goal, a benchmark, and files to modify — it loops forever: try ideas, measure results, keep winners, discard losers.
Port of pi-autoresearch as a pure skill — no MCP server, just instructions the agent follows with its built-in tools.
Install
Option A: Let Claude do it (easiest)
git clone https://github.com/drivelineresearch/autoresearch-claude-code.git ~/autoresearch-claude-code
claude -p "Install the autoresearch plugin from ~/autoresearch-claude-code"
Claude will read the repo, run install.sh, and configure everything.
Option B: Plugin flag
# One-session test drive
claude --plugin-dir /path/to/autoresearch-claude-code
# Permanent — add to ~/.claude/settings.json:
# { "plugins": ["~/autoresearch-claude-code"] }
# Toggle on/off
claude plugin disable autoresearch
claude plugin enable autoresearch
Option C: Manual symlinks
git clone https://github.com/drivelineresearch/autoresearch-claude-code.git ~/autoresearch-claude-code
cd ~/autoresearch-claude-code && ./install.sh
To remove: ./uninstall.sh
Quick Start
/autoresearch optimize test suite runtime
/autoresearch # resume existing loop
/autoresearch off # pause (in-session)
The agent creates a branch, writes a session doc + benchmark script, runs a baseline, then loops autonomously. Send messages mid-loop to steer the next experiment.
What Can You Optimize?
Anything with a measurable metric:
- ML models — R², RMSE, accuracy, F1 (see the OpenBiomechanics example)
- Code performance — runtime, memory usage, throughput
- Build systems — bundle size, compile time, dependency count
- Frontend — Lighthouse score, load time, CLS
- Prompt engineering — eval scores, parameter-golf
- Any script that outputs
METRIC name=number to stdout
The only requirement: a bash command that runs your benchmark and prints METRIC name=number lines.
Example: Fastball Velocity Prediction
Included in examples/ — predicts fastball velocity from biomechanical data using the Driveline OpenBiomechanics dataset and a model zoo of 19 algorithms.

22 autonomous experiments took R² from 0.44 to 0.78 (+78%), predicting a new player's velocity within ~2 mph from biomechanics alone.
| Metric | Baseline | Best | Change |
|---|
| R² | 0.440 | 0.783 | +78% |
| RMSE | 3.53 mph | 2.20 mph | -38% |
Setup
# Clone data
mkdir -p third_party
git clone https://github.com/drivelineresearch/openbiomechanics.git third_party/openbiomechanics
# Install dependencies with uv (https://docs.astral.sh/uv/)
cd examples
uv sync # core deps (xgboost, sklearn, rich, etc.)
uv sync --extra all # all model backends (PyTorch, CatBoost, LightGBM, TabPFN, TabNet)
# Copy example files to working directory and run
cd ..
cp examples/train.py examples/models.py examples/autoresearch.sh .
uv run python train.py
See examples/obp-autoresearch.md for the session config and experiments/worklog.md for the full experiment narrative.
Model Zoo
The example ships with 19 models the agent can swap between. All use a common interface — change MODEL_TYPE in train.py to switch.
| Category | Models | GPU | Extra Deps |
|---|
| Boosting | xgboost, catboost, lightgbm, histgb | xgb/catboost/lgbm | catboost, lightgbm |
| Neural | pytorch_mlp, mc_dropout, ft_transformer, tabpfn, tabnet, mlp | torch-based | torch, tabpfn, pytorch-tabnet |
| Linear | ridge, elasticnet, lasso, huber | — | — |
| Bayesian | bayesian_ridge, gp | — | — |
| Other | svr, knn | — | — |
| Ensemble | stacking | — | — |
Models use lazy imports — missing optional deps produce clear error messages, not crashes. Install what you need:
uv sync # core (xgboost, sklearn, rich)
uv sync --extra torch # + PyTorch/CUDA models
uv sync --extra boost # + CatBoost, LightGBM
uv sync --extra all # everything
GPU is auto-detected. When CUDA is available, XGBoost/CatBoost/LightGBM/PyTorch models use it automatically.
How It Works