npx claudepluginhub proyecto26/autoresearch-ai-plugin --plugin autoresearch-ai-pluginThis skill uses the workspace's default tool permissions.
An autonomous experiment loop for single-GPU LLM pretraining. Edit `train.py` → commit → run 5-minute training → measure `val_bpb` → keep improvement or revert → **repeat forever**.
Conducts post-training LLM research using Tinker API: replicates papers, runs experiments (SFT, RL, DPO, distillation), monitors runs, tunes hyperparameters, analyzes logs.
Train or fine-tune language and vision models using TRL (SFT, DPO, GRPO, reward modeling) or Unsloth on Hugging Face Jobs. Covers dataset prep, hardware selection, cost estimation, GGUF conversion for local deployment.
Train or fine-tune TRL language models on Hugging Face Jobs using SFT, DPO, GRPO, Reward Modeling, with GGUF export for local deployment.
Share bugs, ideas, or general feedback.
An autonomous experiment loop for single-GPU LLM pretraining. Edit train.py → commit → run 5-minute training → measure val_bpb → keep improvement or revert → repeat forever.
This skill is self-contained — it includes everything needed to set up and run the loop.
Copy the bundled training template to the project directory:
cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .
uv sync # Install dependencies
uv run prepare.py # Download data shards, train tokenizer (~2 min)
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"
git checkout -b autoresearch/<tag>-<date>git revert will fail if tracked):
echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"
prepare.py and train.py thoroughly to understand the codebaseautoresearch.md — a living session document recording goal, metrics, files in scope, constraints, and learningsautoresearch.sh — the benchmark script (see Benchmark Script section below)bash autoresearch.shMETRIC name=value)autoresearch.jsonl:
{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}LOOP FOREVER. Never ask "should I continue?" — just keep going.
The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read train.py for new angles, try combining previous near-misses, try more radical architectural changes.
Each iteration:
1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat
keep (commit stays, branch advances)discard (run git revert $(git rev-parse HEAD) --no-edit)discard (revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.keep (removing complexity is a win)discard even if val_bpb improved slightlyAll else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.
train.py changes; prepare.py is immutable. This ensures fair comparison (same data, same evaluation).pyproject.toml.If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read train.py for new angles. Try a fundamentally different approach.
If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.
Each experiment appends one JSON line:
{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}
Use the shared logging script:
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
--run 2 \
--commit "$(git rev-parse --short HEAD)" \
--metric 0.993 \
--status keep \
--description "increase LR to 0.04" \
--metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
--segment 0 \
--asi '{"hypothesis":"higher LR converges faster"}'
Parse metrics from benchmark output:
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh
Valid statuses: keep, discard, crash, checks_failed
ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.
Record ASI for every experiment:
{
"hypothesis": "Deeper model with fewer steps should compress better",
"arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
"result": "val_bpb improved 0.998→0.992, but 2x VRAM",
"next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}
If autoresearch.jsonl and autoresearch.md exist in the working directory:
autoresearch.md for full context (goal, metrics, files, constraints, learnings)autoresearch.jsonl to see all past experiments, current best, and ASI annotationsAfter 3+ experiments, assess whether improvements are real or noise:
ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.
evaluate_bpb() computes bits-per-byte (vocab-size-independent metric)Key constants: MAX_SEQ_LEN = 2048, TIME_BUDGET = 300, EVAL_TOKENS = 40 * 524288, VOCAB_SIZE = 8192
Editable: ASPECT_RATIO, DEPTH, WINDOW_PATTERN, TOTAL_BATCH_SIZE, learning rates, LR schedule phases, and the full model architecture.
| Tier | GPUs | VRAM | Notes |
|---|---|---|---|
| Consumer | GTX 1080 Ti, RTX 2080 Ti | 11GB | fp32 fallback, gradient checkpointing required |
| Consumer+ | RTX 3090, RTX 4090 | 24GB | Great for experiments |
| Enthusiast | RTX 5090 | 32GB | Excellent — larger models possible |
| Datacenter | A100, H100 | 40-80GB | Original development target |
For GPUs with limited VRAM (< 16GB), apply these changes to train.py during the first experiment:
from kernels import get_kernel block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the fa3.flash_attn_func() call in CausalSelfAttention.forward() with torch.nn.functional.scaled_dot_product_attention. Also remove kernels from pyproject.toml and run uv sync again.torch.utils.checkpoint.checkpoint() with use_reentrant=False to trade ~30% compute for ~50% VRAM savingsDEPTH and DEVICE_BATCH_SIZE to fit VRAM budget (see table below)| VRAM Budget | DEPTH | n_embd | Batch Size | Seq Length | ~Params |
|---|---|---|---|---|---|
| 4GB | 2 | 128 | 4 | 512 | ~1M |
| 8GB | 4 | 256 | 8 | 1024 | ~5M |
| 12GB | 6 | 384 | 16 | 1024 | ~14M |
| 16GB | 8 | 512 | 32 | 2048 | ~25M |
| 24GB | 8 | 512 | 128 | 2048 | ~50M |
| 32GB | 12 | 768 | 128 | 2048 | ~85M |
| 80GB | 16 | 1024 | 128 | 2048 | ~200M |
Note: n_embd must be a multiple of HEAD_DIM (default 128). Config search: start with the largest depth that fits, reduce DEVICE_BATCH_SIZE then MAX_SEQ_LEN if OOM.
How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See references/gpu-training-guide.md for the formula and interpretation table.
Use this as autoresearch.sh:
#!/usr/bin/env bash
set -euo pipefail
uv run train.py > run.log 2>&1
val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")
echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"
| File | Purpose |
|---|---|
autoresearch.md | Living session document — goal, metrics, scope, learnings |
autoresearch.sh | Benchmark script — outputs METRIC name=value lines |
autoresearch.jsonl | Append-only experiment log with ASI (survives restarts) |
references/gpu-training-guide.md — Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuningscripts/parse-metrics.sh — Extract METRIC lines from benchmark outputscripts/log-experiment.sh — Append experiment results to autoresearch.jsonlassets/prepare.py — Data preparation (download, tokenizer, dataloader, evaluation)assets/train.py — Model architecture and training loopassets/program.md — Self-contained agent instructions for the ML loopassets/pyproject.toml — Python dependencies (PyTorch, Flash Attention, etc.)