From agent-loops
Runs population-based evolutionary search over ML model code using parallel SEARCH/REPLACE diffs, cascade-evaluated training runs, and MAP-Elites archiving across islands. For multi-variant exploration, not single-thread refinements.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-loops:alpha-evolveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> Reference (read if you need the algorithm's details): AlphaEvolve — https://arxiv.org/abs/2506.13131 ·
Reference (read if you need the algorithm's details): AlphaEvolve — https://arxiv.org/abs/2506.13131 · OpenEvolve (open-source impl) — https://github.com/algorithmicsuperintelligence/openevolve
A population-based evolutionary loop over a program. The artifact is the editable model code; a
child is one analysis-informed SEARCH/REPLACE diff to a parent, and the feedback signal is a
cascade-evaluated training run (<metric>, smoke→full). Children are placed in a MAP-Elites
archive across islands (complexity × diversity axes), so a child survives by being either better or
more novel, not just better. The discipline this enforces: diversity is preserved, not collapsed —
diverse high performers co-exist instead of one local optimum winning. You are the controller: sample
a parent + inspirations, spawn parallel Mutators to propose and evaluate children, place them, migrate
between islands, checkpoint. Loops to a fixed compute budget or until interrupted.
Use this for parallel, diversity-preserving search over a model/program where many variants explore at once and the archive keeps the illuminated frontier. Default to broad island coverage; if quality stalls, bias selection toward exploiting top elites; if coverage stalls, bias toward empty cells. Not for the sequential autoresearch loops (one change at a time), and not for fixing a known anomaly.
The cast (both in this folder): roles/Mutator.md produces + cascade-evaluates one child (the
generation step); schemas/result.schema.json is the result a Mutator returns.
Resolve bindings interactively. If loop.run.yaml exists in the working dir, load it, confirm the
values in one line, and skip to the loop. Otherwise: on Claude Code (the AskUserQuestion tool is
available, <host> = claude-code) infer a likely value for each binding and present it as the
recommended option; on other hosts (<host> = other) ask each as a quoted plain-text prompt. Then
write loop.run.yaml (format: examples/run.example.yaml) and confirm the values before creating any
other files. <host> also decides execution: Claude Code spawns real Agent Mutators in parallel
(capped at <concurrency>); other hosts degrade to running a generation's children serially (identical
algorithm).
Probe the box first (mandatory — measure, never assume <concurrency>). Record and report:
<cores>: python3 -c "import os; print(os.cpu_count())".<ram_gb>: macOS sysctl -n hw.memsize; Linux grep MemTotal /proc/meminfo.<accelerator>/<vram>/<gpu_count>: nvidia-smi --query-gpu=name,memory.total,count --format=csv (NVIDIA); else macOS Apple GPU/MPS; else CPU-only.| binding | meaning | default | how to infer |
|---|---|---|---|
<metric> + <metric_direction> | scalar to optimize; min/maximize | — | ask; scan eval output for the reported metric |
<run_cmd> / <entrypoint> | command for one training run (the evaluator) | — | pyproject.toml/.venv/uv/README |
<editable_files> | the program being evolved (e.g. model.py, config.yaml); never the harness or data | — | ask explicitly — this is the code that gets mutated; do not default it (multi-select on Claude Code) |
<sandbox_root> | where lae/ is created | ./sandbox | — |
<gate> + <budget> | one full run's size: time/epochs + amount; the FIXED eval budget applied to every program | — | identify the duration key now (e.g. train.epochs) so the controller can override it |
<total_budget> | total compute = number of full training runs (or wall-clock minutes); the single cost dial | — | ask |
<concurrency> | parallel evaluations C | derived from the probe | CPU-only → max(1, <cores>//4); single GPU/MPS → 1 (ask if more fit <vram>); multi-GPU → <gpu_count> (pin one child/GPU) |
num_generations is derived: ceil(<total_budget> / <concurrency>). The cascade is derived
from <budget> (not asked): smoke = ~1 epoch / a small subset, full = <budget>, gate = child's
smoke <metric> ≥ parent's smoke. <budget>/<metric>/eval split are FIXED — never mutation targets
(a child may not "train longer" to look better); changing them means re-running the whole loop.
Advanced (opt-in). Ask one yes/no: "Use defaults for the evolutionary settings, or customize?"
Defaults are faithful to AlphaEvolve/OpenEvolve — use them and ask nothing more. Only on "customize"
ask for each (showing the default as recommended): num_islands (4), num_top (3), num_diverse (2),
num_bins (10), migration_interval (5), diversity_reference_size (10), pop_per_island (40),
seed (42). Axes are fixed: complexity × diversity. See examples/run.example.yaml for the shape.
Print the resolved bindings + the probe + derived num_generations, and do not create files or
launch until the user confirms. Then initialise the sandbox (header rows only; programs/ is created
as children are evaluated):
<sandbox_root>/lae/
├── archive.tsv ← current elites = program database + checkpoint
├── history.tsv ← append-only record of every child
├── leaderboard.md ← rendered UI
└── programs/ ← one self-contained dir per program
You maintain num_islands MAP-Elites maps in archive.tsv, the append-only history.tsv, running
per-axis percentile stats, and leaderboard.md. You are the sole writer of all shared logs —
Mutators only return results, so there are no write races. Copy this checklist and tick items off:
num_generations derived.<editable_files>) + optionally a few stochastic variants; cascade-evaluate; place in the archive.<concurrency> tasks (round-robin island, seeded-rule parent, top num_top + num_diverse most-diverse inspirations); make each child dir by copying the parent program + harness.C Mutators (spawn-or-degrade), each with roles/Mutator.md, parent code, inspirations, parent artifacts, its child dir, and the smoke/full budgets.history.tsv row; if evaluated, compute its niche → cell and place it in the island map iff <metric> is better (kept=y); record smoke_dropped/crash without placing.leaderboard.md; checkpoint (archive.tsv is the checkpoint); print a status line.migration_interval generations: ring-migrate top elites island k → k+1.<total_budget> (reserve a little for synthesis), then synthesize the final report.Niche computation (you do this, from a child's sandbox):
complexity = trainable param count (fallback: total LOC of the editable files + any files
the child added), log10-scaled.diversity = average normalized edit distance of the program's concatenated code (editable +
added files) to a random sample of diversity_reference_size programs from its island (vs the
baseline if the island is near-empty). Higher = more novel.scaled = clamp01((v − p5)/(p95 − p5)); bin = min(num_bins−1, int(scaled × num_bins)); cell = (complexity_bin, diversity_bin). Re-bin existing elites when a percentile
shifts enough to move an edge (keep the higher <metric> on collisions; the archive is small).The Mutator's prompt (the sampler): parent code + inspirations + the parent's rendered artifacts
(<metric>, per-class accuracy, loss curve, stderr) + the instruction to return one SEARCH/REPLACE
diff. Single harness model — no LLM ensemble. The Mutator applies its diff in the child dir,
cascade-evaluates at the FIXED <budget> (the controller injects/caps the duration key on the run
command), and returns a result validated against schemas/result.schema.json:
{"child_id": "g3-i1-a2", "parent_id": "g1-i1-a0", "approach_summary": "add BatchNorm after conv2",
"sandbox_path": "<sandbox_root>/lae/programs/g3-i1-a2", "status": "evaluated",
"smoke_metric": 0.61, "metric": 0.71}
status ∈ {evaluated, smoke_dropped, crash}; metric is null unless evaluated. Mutators
compute nothing about the archive — the controller derives every niche from the sandbox.
Program sandboxes. A parallel population doesn't map onto branches, so every program is a
self-contained, fully-runnable dir <sandbox_root>/lae/programs/<child_id>/; the archive references it
by id. Build each child dir by copying real files (the parent's <editable_files>, then apply the
diff, plus the harness/entrypoint code it imports) and evaluate from inside it
(cd <child_dir> && <entrypoint>). Symlink only large read-only data, never the entrypoint or any
imported .py: Python resolves a symlinked script's __file__ to the link target, so sys.path[0]
becomes the original dir and the child's model.py/dataset.py are silently shadowed by the baselines
— every architecture/data mutation becomes a no-op (tell-tale: identical loss curves across different
"architectures"). Isolation sanity gate: the harness logs the param count / a code fingerprint;
flag any child whose code changed but whose metric/loss curve is identical to its parent's (shadowed),
and fix the sandbox before placing it. The repo working tree is never mutated.
Final synthesis. Report the global-best program + its lae/programs/<id>/ path, the illuminated
complexity×diversity map (coverage + who won each region), per-island bests, and 2–3 notably diverse
runners-up.
All three logs live under <sandbox_root>/lae/, tab-separated, never commas in free text. The
controller is the sole writer; resume from archive.tsv + history.tsv if interrupted.
archive.tsv — current elites + checkpoint. Header
island cell metric child_id parent_id sandbox_path complexity diversity:
island cell metric child_id parent_id sandbox_path complexity diversity
0 (2,7) 0.7100 g4-i0-a1 g2-i0-a3 lae/programs/g4-i0-a1 2.1M 0.71
history.tsv — every child, append-only. Header
gen island parent_id child_id smoke_metric full_metric status kept cell:
gen island parent_id child_id smoke_metric full_metric status kept cell
4 0 g2-i0-a3 g4-i0-a1 0.61 0.71 evaluated y (2,7)
4 1 g2-i1-a0 g4-i1-a2 0.40 - smoke_dropped n -
leaderboard.md — re-rendered each generation: global best + per-island coverage + the archive
ranked by <metric>. Report the best program at stop (not the last), the archive coverage, and a
few diverse runners-up. Leave lae/ untracked.
lae/programs/<child_id>/ dir — it may edit the copied
<editable_files> and create new files there, but never modify any file outside it (the repo, the
read-only harness, the data, other programs' dirs are ground truth or shared state).archive.tsv/history.tsv/leaderboard.md, so parallel
Mutators never race on the logs.<concurrency> comes from the probe + the user's confirmation — never assume the box; pin one
child per GPU on multi-GPU; if a run OOMs/thrashes, lower C and say so (don't rewrite a child's
config to fit), because the box's limit is real and rewriting the child corrupts the comparison.<budget> (epochs/time), <metric>, and the eval/test split are FIXED and out-of-bounds for
mutation. The controller injects <budget> on every run, overriding any duration the child set —
so "train longer" / change-the-metric / change-the-test-set can never win. Evolve the
model/optimizer/data pipeline, not the compute or the scoring; comparability across programs depends
on it..py into a child dir (it shadows the child's code
via sys.path[0]); copy harness code, symlink only data, and run the isolation sanity gate before
placing a child — a shadowed result is a phantom.<metric> is ground truth.<total_budget> (reserving a little for
synthesis) or interrupt; if coverage stalls bias toward empty cells, if quality stalls exploit top
elites. A child that overruns its gate is killed and recorded as crash.npx claudepluginhub gaasher/agent-loop-skills --plugin agent-loopsRuns propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.
Starts an autonomous evolutionary code optimization run using Claude Code models as mutation operators, iteratively improving code via selection, crossover, and evaluation.
Evolves any measurable artifact (prompt, skill, code) through autonomous mutation-evaluate-gate loops. Supports GT case suites and scalar metric loops with automatic keep/revert.