From autoresearch
Use this skill when you want to run an autonomous ML research loop that iteratively modifies training code, runs timed experiments, evaluates a single metric, and keeps only improvements — repeating indefinitely until manually stopped. Activate when the user wants to autonomously discover better model architectures, hyperparameters, or training strategies overnight or over a defined session, with full cross-platform support (CUDA / Apple Silicon MPS / CPU / Windows).
npx claudepluginhub aviskaar/open-org --plugin autoresearch# AutoResearch Autonomously run an indefinite ML research loop: modify training code, run a timed experiment, measure one metric, keep improvements, discard regressions — repeat until manually stopped. Inspired by Karpathy's autoresearch and extended with cross-platform support (CUDA / Apple Silicon MPS / CPU / Windows) and user-configurable experiment parameters. --- ## Overview AutoResearch turns an AI agent into a tireless overnight researcher. Given a training codebase and a metric to minimize (or maximize), the agent continuously proposes code modifications, runs fixed-budget train...
/SKILLGuides implementation of defense-in-depth security architectures, compliance (SOC2, ISO27001, GDPR, HIPAA), threat modeling, risk assessments, SecOps, incident response, and SDLC security integration.
/SKILLEvaluates LLMs on 60+ benchmarks (MMLU, HumanEval, GSM8K) using lm-eval harness. Provides CLI commands for HuggingFace/vLLM models, task lists, and evaluation checklists.
/SKILLApplies systematic debugging strategies to track down bugs, performance issues, and unexpected behavior using checklists, scientific method, and testing techniques.
/SKILLSummarizes content from URLs, local files, podcasts, and YouTube videos. Extracts transcripts with --extract-only flag. Supports AI models, lengths, and JSON output.
/SKILLRuns `yarn extract-errors` on React project to detect new error messages needing codes, reports them, and verifies existing codes are up to date.
/SKILLManages major dependency upgrades via compatibility analysis, staged rollouts with npm/yarn, and testing for frameworks like React.
Autonomously run an indefinite ML research loop: modify training code, run a timed experiment, measure one metric, keep improvements, discard regressions — repeat until manually stopped. Inspired by Karpathy's autoresearch and extended with cross-platform support (CUDA / Apple Silicon MPS / CPU / Windows) and user-configurable experiment parameters.
AutoResearch turns an AI agent into a tireless overnight researcher. Given a training codebase and a metric to minimize (or maximize), the agent continuously proposes code modifications, runs fixed-budget training runs, logs results, and builds a growing record of what works — without requiring human involvement between iterations.
The core loop:
┌──────────────────────────────────────────────────────────────────┐
│ AUTONOMOUS RESEARCH LOOP │
│ │
│ [1] Read current [2] Propose [3] Apply change │
│ codebase → modification → to train.py │
│ ↑ ↓ │
│ [7] Keep / discard [5] Evaluate [4] Commit & │
│ & iterate ← metric ← train (T min) │
│ ↑ ↓ │
│ [6] Log to ────────────────────────────── │
│ results.tsv │
└──────────────────────────────────────────────────────────────────┘
Key design principles (from Karpathy's autoresearch):
train.py or its equivalent) — data prep and evaluation are read-onlyresults.tsv — a permanent, auditable recordBefore touching any code, collect all required parameters. Ask explicitly for any missing values.
| # | Parameter | Description | Default if omitted |
|---|---|---|---|
| 1 | Training script path | Path to the file the agent will modify (e.g., train.py) | train.py in cwd |
| 2 | Read-only script path | Path to the file the agent must never modify (e.g., prepare.py) | prepare.py in cwd |
| 3 | Time budget per experiment | How long each training run is allowed to run | Ask — do not assume |
| 4 | Metric to optimize | Metric name and direction (minimize val_bpb, maximize val_acc, etc.) | Ask — do not assume |
| 5 | Max iterations | How many experiment cycles to run (or unlimited for indefinite) | unlimited |
| 6 | Allowed modification scope | What can be changed: architecture / hyperparameters / optimizer / all | all |
| 7 | Branch tag | Short identifier for the research branch (e.g., mar8, exp-attention) | Date-based (e.g., mar8) |
| 8 | Results file | Where to log results | results.tsv in cwd |
The original autoresearch fixed the training window at 5 minutes. This skill lets the user choose:
"How long should each training run be? (e.g., 3 minutes, 10 minutes, 30 minutes)"
Likewise, the user decides:
Produce and confirm a Research Config before proceeding:
## AutoResearch Config
| Parameter | Value |
|------------------------|-------------------------------|
| Training script | train.py |
| Read-only script | prepare.py |
| Time budget / run | [USER SPECIFIED] |
| Metric | [METRIC NAME] ([min/max]) |
| Max iterations | [N or unlimited] |
| Modification scope | [architecture / hyper / all] |
| Branch | autoresearch/[TAG] |
| Results file | results.tsv |
| Platform | [detected — see Phase 1] |
Get explicit user confirmation before starting Phase 1.
Detect the compute backend and configure the training environment accordingly.
Run this detection sequence before touching any training code:
# Pseudo-detection logic (adapt to actual runtime)
import torch
if torch.cuda.is_available():
platform = "CUDA"
device = "cuda"
notes = "FlashAttention-3 may be available; use torch.compile if supported"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
platform = "MPS (Apple Silicon)"
device = "mps"
notes = "Disable torch.compile; use SDPA instead of FlashAttention; lower batch size for Metal bounds; cast optimizer states explicitly"
else:
platform = "CPU"
device = "cpu"
notes = "Use smaller batch sizes; expect slower iteration; torch.compile may help"
Windows note: On Windows, MPS is not available. CUDA detection is the same as Linux. CPU fallback applies for systems without a supported GPU.
Apply these adjustments based on detected platform before the first run:
| Platform | Adjustments |
|---|---|
| CUDA | Enable torch.compile if PyTorch >= 2.0. FlashAttention available if installed. Standard batch sizes apply. |
| MPS (Apple Silicon) | Disable torch.compile (unsupported paths). Replace FlashAttention with PyTorch native SDPA + manual sliding window causal masking. Lower device batch size (reduce by 2–4×). Explicitly cast optimizer states to float32 if Metal errors appear. |
| CPU | Disable torch.compile. Use smaller batch sizes. Gradient accumulation to maintain effective batch. Expect 5–10× slower runs — advise user to use shorter time budgets. |
Log the detected platform and applied adjustments in the research config.
git checkout -b autoresearch/<tag> from the main branch.results.tsv with a header row and the baseline entry:commit val_bpb mem_gb status description
baseline [BASELINE_VALUE] [MEM] keep Baseline — no modifications
This loop runs indefinitely (or until the configured max iterations). Do NOT pause to ask the human between iterations. Work continuously.
At the beginning of each iteration:
train.py (or equivalent) in full.results.tsv to understand what has been tried and what worked.Propose one focused modification per iteration. Good candidates include:
Architecture changes:
Optimizer changes:
Training dynamics:
Hypothesis discipline:
exp: try ReLU² activation instead of GELU).# Example — adapt to actual metric name and log format
uv run train.py 2>&1 | tee run.log
VAL_BPB=$(grep "val_bpb:" run.log | tail -1 | awk '{print $2}')
Compare the extracted metric to the current best:
| Outcome | Action |
|---|---|
| Improved (metric better by any amount) | Keep the commit. Update current best. Mark keep in results. |
| No change or regression | Discard: git reset --hard HEAD~1. Mark discard in results. |
| Crash / error | Discard. Log the failure reason. Mark crash in results. Do not retry the same change. |
| Memory OOM | If on MPS: reduce device batch size by 2×, retry once. If still OOM: discard. If on CUDA: reduce batch or enable gradient checkpointing, retry once. |
Simplicity principle (from Karpathy's autoresearch): If two configurations achieve the same metric, prefer the one with fewer lines of code. A deletion that maintains performance is better than an addition.
After every run (kept or discarded), append to results.tsv:
<7-char-commit> <metric-value> <mem_gb> <keep|discard|crash> <one-line description>
Example:
a3f12bc 0.9821 43.2 keep ReLU² activation: -0.0158 improvement
d9e44a1 1.0023 44.1 discard Deeper MLP: regression
c0011f2 — — crash OOM: doubled batch size without accumulation fix
Every 10 iterations (or when manually queried), emit a progress report:
## AutoResearch Progress — Iteration [N]
**Best so far:** [METRIC_VALUE] (iteration [K], commit [HASH])
**vs. Baseline:** [DELTA] ([+/-]%)
**Iterations:** [N completed] / [max or unlimited]
**Time elapsed:** [HH:MM]
**Recent results (last 5):**
| Iter | Commit | Metric | Status | Change |
|------|--------|--------|--------|--------|
| N | abc1234| 0.9821 | keep | ReLU² activation |
| N-1 | def5678| 1.0023 | discard| Deeper MLP |
| ... | | | | |
**Current direction:** [what the agent is exploring next]
The loop stops when:
When stopped, produce a Research Summary document (autoresearch-summary.md):
# AutoResearch Summary
**Research session:** autoresearch/<tag>
**Platform:** [CUDA / MPS / CPU]
**Date:** [date range]
**Total iterations:** [N]
**Time budget / run:** [T minutes]
## Results
| Baseline metric | Best metric | Improvement | Best commit |
|-----------------|-------------|-------------|-------------|
| [BASELINE] | [BEST] | [DELTA] | [HASH] |
## Top Improvements (kept commits)
| Rank | Commit | Metric | Description |
|------|--------|--------|-------------|
| 1 | abc123 | 0.9612 | [description] |
| 2 | def456 | 0.9744 | [description] |
| ... | | | |
## What Didn't Work
| Category | Description | Outcome |
|----------|-------------|---------|
| [category] | [description] | [why it failed] |
## Recommended Next Steps
- [3–5 actionable recommendations based on the session findings]
After summarizing:
| Feature | CUDA (Linux/Windows) | MPS (macOS) | CPU (any OS) |
|---|---|---|---|
torch.compile | ✅ Enabled | ❌ Disabled | ✅ Optional |
| FlashAttention | ✅ If installed | ❌ Use SDPA | ❌ Use SDPA |
| Sliding window attn | ✅ via FA or SDPA | ✅ SDPA + manual mask | ✅ SDPA + manual mask |
| Mixed precision (bf16) | ✅ | ✅ (M2+) | ❌ Use fp32 |
torch.compile modes | reduce-overhead | Disabled | reduce-overhead |
| Batch size guidance | Full config | Reduce 2–4× | Reduce 4–8× |
| Optimizer state casting | Not needed | fp32 explicit cast | Not needed |
The agent may only modify the training script (train.py or equivalent). It must never modify:
prepare.py or equivalent)The agent must never install new packages mid-session (all dependencies must already be available).
AutoResearch is designed to plug directly into the research pipeline as a specialized experiment execution layer:
lead-researcher: A hypothesis about a training approach to explore (e.g., "try hierarchical attention with reduced context window").experiment-design: A structured experiment plan specifying what to vary, what to hold fixed, and what metric to optimize.principal-scientist: A results.tsv and autoresearch-summary.md with the best configuration found and all supporting evidence.research-writing: A structured table of experiments suitable for the methodology and results sections of a paper.When operating within the research team, the research config (Phase 0) is pre-populated by the lead-researcher or experiment-design teammates rather than entered manually by the user.
| User intent | Entry point |
|---|---|
| "Run autoresearch overnight on my training script" | Phase 0 (full config) → Phase 1 → Phase 2 loop |
| "I have a hypothesis — test it and keep going" | Phase 0 (pass hypothesis as first iteration direction) → loop |
| "Run this for exactly 20 iterations and report" | Phase 0 (max_iterations=20) → loop → Phase 3 |
| "Continue a previous session" | Phase 1.3 (skip branch creation, re-read existing results.tsv) → Phase 2 |
| "Analyze what happened in last night's session" | Phase 3 only — summarize existing results.tsv |
| Phase | Artifact |
|---|---|
| 0 | Research Config (confirmed by user) |
| 1 | Platform detection log, baseline entry in results.tsv |
| 2 | Growing results.tsv + per-iteration progress reports |
| 3 | autoresearch-summary.md with ranked improvements and recommendations |