From autoloop
Generates program.md and auto/run.sh for autonomous iterative code optimization loops with Claude CLI, git checkpoints, tiered quality gates, and structured metric output.
npx claudepluginhub joshuaoliphant/claude-plugins --plugin autoloopThis skill is limited to using the following tools:
Turn an LLM coding agent into an autonomous scientist. Generate a self-contained `program.md` + `auto/run.sh` that lets the agent loop forever — edit code, run experiment, parse metric, git commit (keep) or git reset (revert) — while the human walks away.
Sets up autonomous experiment loops for code optimization targets. Gathers goal/metric/files, creates git branch/benchmark script/logging, runs baseline via subagent. For 'run autoresearch' or iterative experiments.
Guides interactive setup of optimization goals, metrics, and scope; runs autonomous git-committed experiment loops: code changes, testing, measurement, keep improvements or revert. For performance tuning in git repos.
Runs autonomous improvement loops on code: modifies files in scope, measures metric via shell command, keeps gains/discards regressions, repeats until stop conditions like stagnation or target met.
Share bugs, ideas, or general feedback.
Turn an LLM coding agent into an autonomous scientist. Generate a self-contained program.md + auto/run.sh that lets the agent loop forever — edit code, run experiment, parse metric, git commit (keep) or git reset (revert) — while the human walks away.
The skill's job is the design thinking: mapping an arbitrary project onto the seven essential components that make this loop work, then generating the files. Getting the components right is the difference between a loop that runs 126 experiments overnight and one that crashes after 3.
autoloop:codebase-scout — Subagent that explores the project directory to identify build system, test commands, source files, and candidate metrics. Delegates via Agent(subagent_type="autoloop:codebase-scout", model="haiku").git — Used for checkpoint/rollback (commit to keep, reset to revert). Must be available in the project.claude CLI — The generated loop runs via claude --dangerously-skip-permissions -p "Read program.md and execute the loop protocol.".Every autoloop maps onto these seven components. There is no orchestration code — program.md IS the entire system.
METRIC outputPlus a results ledger (results.tsv) and an embedded progress log (in program.md itself) that give the agent full history every iteration.
| Goal | Likely mutable file |
|---|---|
| ML training improvement | The training script (train.py, train.rs) |
| Test coverage | The source files being tested (pick the lowest-coverage one) |
| Performance | The module containing the hot path |
| Lint score | Source files with the most violations |
| Prompt engineering | The prompt template file |
| Config tuning | The config file being tuned |
The mutable file should be small enough for the agent to read in one pass. If >500 lines, suggest a focused subset or ask the user to extract the relevant section.
Check what the project already has:
| Project has... | Candidate metric |
|---|---|
| Tests | Test count, coverage %, pass rate |
| Benchmarks | Execution time, throughput, ops/sec |
| Linting | Ruff/pylint issue count (lower is better) |
| ML training | Validation loss, accuracy, perplexity |
| Eval suite | Accuracy, F1, score |
Common secondary metrics by domain (guardrails, NOT optimized):
| Primary metric | Good secondary metrics |
|---|---|
| Execution time (µs) | Allocations, memory usage, code complexity |
| Test coverage (%) | Test count, test execution time |
| Lint score | Lines of code, cyclomatic complexity |
| Validation loss | Training time, GPU memory, inference latency |
| Throughput (req/s) | P99 latency, error rate, CPU usage |
Gates run before the benchmark, ordered fastest-first. Early gate failure → immediate exit → no wasted benchmark time.
| Gate | Purpose | Failure mode | Example |
|---|---|---|---|
| Unit tests (fast) | Correctness | Hard fail (exit 1) | uv run pytest tests/unit -x |
| Conformance/lint | Style + spec | Soft fail with threshold | ruff check --statistics, allow ≤N issues |
| Type check | Type safety | Hard fail | uv run mypy src/ |
Use what the project already has — don't add new tooling.
For detailed allowed change types per domain (ML, test coverage, performance, lint, prompts, config tuning), consult:
→ references/domain-examples.md
Load any stored feedback preferences before starting:
python ${PLUGIN_ROOT}/scripts/feedback_manager.py autoloop show-feedback
If feedback entries exist, apply them throughout loop design:
Delegate to the codebase-scout agent:
Agent(
subagent_type="autoloop:codebase-scout",
model="haiku",
prompt="Explore {cwd} and return a structured summary of: project type, language, build/test/bench commands, source files, config files, candidate metrics, and immutable files. See your instructions for the full output format.",
description="Scout project for autoloop"
)
Tell the user: "I'm exploring your project to understand the build system, test infrastructure, and what metrics we can optimize. This takes about 15 seconds."
When results come back, summarize in 3-5 bullet points. Don't dump the raw output.
Using the scout results AND the user's stated goal, design all seven components. Think carefully — wrong choices here waste hours of autonomous runtime.
2a. Infer the mutable artifact — Use the selection table from Context. If the answer isn't obvious, present 2-3 options with trade-offs.
2b. Infer the metric — Use the metric inference tables from Context. Determine the direction: "lowest" (minimize) or "highest" (maximize). Identify 1-3 secondary metrics as guardrails.
STOP if no metric can be inferred. Do not guess. Ask the user: "I can see how to run experiments, but I can't determine what metric to optimize. What number should I be trying to improve? It needs to be something I can parse from command output."
2c. Infer the execution command — Usually comes directly from scout results. The command should redirect output to a log file: {cmd} > run.log 2>&1.
2d. Design the time budget:
2e. Define files in scope and off limits — Be specific with paths. "Don't touch tests" is vague; test/**/*.py — test suite, must continue to pass unchanged is clear.
2f. Define allowed change types — Read the appropriate domain block from references/domain-examples.md.
2g. Design quality gates — Use the gate design table from Context. For each gate, determine: command, failure mode (hard/soft), threshold (for soft fails).
Present the complete design as a single summary:
## Autoloop Design
**Goal**: {what we're optimizing}
**Mutable file**: `{path}` — {description}
**Primary metric**: {metric_name} ({units}, {direction} is better)
**Secondary metrics**: {name1} ({units}), {name2} ({units}) — tracked, not optimized
**Quality gates**:
1. {gate1_name}: `{command}` — {hard/soft fail}
2. {gate2_name}: `{command}` — {hard/soft fail, threshold if soft}
3. Benchmark: `{bench_command}`
**Time budget**: ~{budget} per experiment (timeout: {timeout})
**Files in scope**: {list}
**Off limits**: {list}
**Strategy**: {domain} — {brief description of change types}
Does this look right? I'll adjust anything before generating.
Wait for user confirmation before proceeding.
3a. Generate auto/run.sh — Read references/runner-script-template.sh and fill in quality gates + metric extraction from the design.
mkdir -p auto
The runner script structure:
#!/usr/bin/env bash + set -euo pipefailcd "$(dirname "$0")/.."METRIC key=value linesMake it executable: chmod +x auto/run.sh
3b. Verify baseline — Run the script once and check:
./auto/run.sh > run.log 2>&1
echo "Exit code: $?"
grep '^METRIC ' run.log
Verify: exit code 0, METRIC lines present, values reasonable (not NaN, not 0 when shouldn't be).
Do not proceed to generation until the baseline passes. If anything fails, debug it with the user.
Record the baseline commit hash: git rev-parse --short HEAD
4a. Read the template — Read references/program-md-template.md.
4b. Read domain strategy — Read the appropriate section from references/domain-examples.md.
4c. Fill variables — Replace all {VARIABLE} placeholders with values from the design.
For the complete variable mapping, consult:
→ references/program-md-template.md (variables are documented inline)
Show the user the generated program.md content:
"Here's the program.md I'll write to your project root. Review it — once you confirm, I'll create the files."
Wait for user confirmation before writing.
On confirmation, write:
program.md to the project rootresults.tsv with just the header rowresults.tsv and run.log to .gitignore (append if exists, create if not, skip if already listed)Do NOT git commit. Leave that to the user.
| File | Purpose | Mutable by agent? |
|---|---|---|
auto/run.sh | Quality gates + METRIC output | Never |
program.md | Loop instructions + embedded progress log | Progress log only |
results.tsv | Experiment ledger (append-only) | Append only, never committed |
Print to the user after file generation:
## Ready to Launch
To start:
1. Review `auto/run.sh` and `program.md`.
2. Start the loop:
claude --dangerously-skip-permissions -p "Read program.md and execute the loop protocol. Do not stop until I interrupt you."
3. Walk away. The agent will:
- Create a branch (autoloop/{tag})
- Establish baseline via ./auto/run.sh
- Loop: edit → run → measure → keep/revert
- Log every experiment to results.tsv
- Update the Progress Log in program.md
4. When you come back:
cat results.tsv # Full experiment trajectory
grep '^- ' program.md | tail -20 # Progress log of kept changes
git log --oneline # Which iterations were kept
git diff main..HEAD # Cumulative changes
5. If you like the results:
git checkout main
git merge autoloop/{tag} # Or cherry-pick specific commits
For common issues (agent stops early, every experiment crashes, metric not improving, etc.), consult:
→ references/troubleshooting.md