Skill

evolve

Orchestrates multi-cycle code improvements on rubrics: forms hypotheses on low scores, validates with scouts, dispatches axis attacks, extracts patterns, persists belief model across sessions.

code-quality

developer-tools

npx claudepluginhub sethgammon/citadel --plugin citadel

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**Use when:** You want sustained autonomous quality advancement — the director

Supporting Assets

__benchmarks__/happy-path-status.md__benchmarks__/no-rubric.md

SKILL.md

Similar Skills

improve

533

Runs autonomous loops to improve a target: scores against rubric, selects highest-leverage axis, attacks it, verifies changes, documents, and repeats until plateau or all axes >=8.0. Requires rubric file.

2 files

citadel

evolve

330

Runs autonomous improvement loops on codebases, selecting tasks from failing goals, test gaps, dep audits, perf profiles, and more via RPI cycles until dormancy.

13 files

agentops

harness:evolve

Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.

8 tools

harness-evolver

Stats

Stars533

Forks53

Last CommitMay 3, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

/evolve — Improvement Director

Orientation

Use when: You want sustained autonomous quality advancement — the director forms hypotheses, scouts before attacking, and builds a belief model that compounds across cycles. Runs until a natural ceiling, budget exhaustion, or you say stop.

Don't use when: You want a single scored loop (/improve), a known axis attacked directly (/improve --axis), or a one-time audit (/improve --score-only).

Key difference from /improve: /improve follows the rubric mechanically. /evolve asks why scores are where they are, validates those theories before spending fleet budget, and extracts cross-skill patterns that propagate to skills never directly attacked.

Invocation

/evolve {target}                  # run until ceiling, velocity drop, or budget
/evolve {target} --n={N}          # exactly N director cycles then stop
/evolve {target} --budget=${X}    # run until cumulative spend reaches $X
/evolve {target} --continue       # resume from saved director state
/evolve {target} --status         # show belief model, velocity, spend — no attack
/evolve {target} --axis={name}    # focus director on one axis (scout + attack only)

target maps to .planning/rubrics/{target}.md. If no rubric exists, run /improve {target} Phase 0 first — /evolve requires an approved rubric and will not auto-generate one.

Campaign Artifacts

All findings are externalized incrementally — written after every phase, not only at cycle end. A crashed or compacted session resumes with full context.

Artifact	Path	Contents
Director state	`.planning/evolve/{target}/director-state.json`	cycle count, spend, velocity history, current phase, halt status
Belief model	`.planning/evolve/{target}/belief-model.jsonl`	one record per (axis, skill) per cycle: score, hypothesis, evidence, confidence
Experiment log	`.planning/evolve/{target}/experiment-log.jsonl`	every experiment: hypothesis → prediction → actual delta → mechanism confirmed
Pattern library	`.planning/evolve/{target}/pattern-library.md`	transferable patterns: what change to what axis class caused what delta in which skills
Cycle digest	`.planning/evolve/{target}/cycle-{n}-digest.md`	human-readable per-cycle summary for review
Global patterns	`.planning/research/patterns.md`	cross-target patterns written outside campaign scope; available to future sessions and other targets

Create .planning/evolve/{target}/ on first invocation. Create .planning/research/ if absent.

Cycle Digest Format

# Cycle {n} — {target} | {date}

## Scores
| Axis | Prior | This Cycle | Delta |

## Hypotheses
| ID | Axis | Hypothesis | Scout Result | Confidence |

## What Was Attacked
| Axis | Skill | Delta | Mechanism Confirmed |

## Patterns Discovered This Cycle
- {pattern}: {evidence}

## Belief Model Updates
- {hypothesis confirmed / rejected / revised}

## Spend: ${cycle} this cycle | ${cumulative} cumulative | Velocity: {v}

Director Cycle Protocol

Phase 1: Survey

Run /improve {target} --score-only. Record scores to belief model with delta from prior cycle (empty on cycle 1). Flag any axis that dropped since last cycle as regression-watch — these are checked first in Phase 2.

Phase 2: Hypothesize

For every axis below 8.0, generate one primary hypothesis in this form:

HYPOTHESIS: {axis} scores {n}/10 because {specific mechanism},
            not because {common misread}.
PREDICTION: Fixing {mechanism} will raise score ≥ {delta} across {N} skills.
FALSIFICATION: If we apply {change} and score does not rise > 0.5, hypothesis rejected.

Draw hypotheses from: evaluator justifications in Phase 1, prior evidence in the belief model, and programmatic check failures. Do not hypothesize from score alone — the number is the symptom.

Write each hypothesis to the experiment log as { id, status: "pending", ... }.

Skip hypothesis generation for an axis if the belief model already has a confidence >= 0.8 confirmed hypothesis for it that has not yet been attacked.

Phase 3: Scout

For axes below 7.0, or axes with unconfirmed hypotheses: dispatch one scout agent per hypothesis. Scouts read — they do not modify files.

Each scout returns:

{ "hypothesis_id": "...", "confirmed": true, "evidence": "...", "confidence": 0.85 }

Scout confidence protocol: Scouts read relevant files only — no edits, no test runs. Assign confidence:

0.9+: mechanism is directly observable (explicit absence, missing section, wrong value in file)
0.7–0.89: strong indirect evidence from 2+ corroborating observations
0.4–0.69: single observation that supports the hypothesis; alternative explanations plausible
< 0.4: no direct evidence found; hypothesis is speculative from this file set

Run scouts in parallel. Update experiment log:

confidence >= 0.7 → confirmed
confidence 0.4–0.69 → needs-evidence (do not attack; add to next cycle)
confidence < 0.4 → rejected

Skip Phase 3 for any hypothesis already confirmed at confidence >= 0.8 in the belief model from a prior cycle.

Phase 4: Prioritize

For each confirmed hypothesis compute:

EV = (delta_estimate × axis_weight × confidence) / (effort_tier × collision_multiplier)

effort_tier: low=1.0, medium=1.5, high=2.5
collision_multiplier: 2.0 if axis shares primary files with another attack in this cycle

Select top K axes where K = min(confirmed count, 4). Document selection rationale in cycle digest. If --axis was set, skip ranking — attack only that axis.

Phase 5: Fleet Attack

Dispatch one agent per selected axis in an isolated worktree (Agent tool, isolation: "worktree"). Each agent receives:

The confirmed hypothesis and its falsification criterion
The specific files to modify
Verification oracle: node scripts/run-with-timeout.js 300 node scripts/test-all.js

Each agent returns a structured result:

{
  "axis": "...", "skill": "...",
  "delta": 1.2,
  "mechanism_confirmed": true,
  "files_changed": ["..."],
  "approach": "..."
}

Merge rules:

Non-conflicting worktrees: merge all
Conflicting worktrees (same file): keep higher delta, discard lower
Regression on any previously passing programmatic check: abort that worktree, do not merge
mechanism_confirmed: false (score improved but not via predicted mechanism): record as incidental_improvement, mark hypothesis as needs-revision

Commit each merged worktree with a message citing the hypothesis ID.

Phase 6: Synthesize

For each result:

Update belief model — append evidence record for (axis, skill)
Update experiment log — mark verified / refuted / incidental
Identify transferable patterns:

PATTERN: {axis_class} | Mechanism: {what caused improvement} | Delta: {avg} across {N} instances | Applies to: {skill list} | Confidence: high/medium/low

Write patterns to .planning/evolve/{target}/pattern-library.md.

Phase 7: Cross-Pollinate

For each confidence: high pattern, or any pattern confirmed in 2+ skills: apply to all other applicable skills as targeted single-file edits — without running a full attack cycle.

Run verification oracle per cross-pollinated skill. Commit only if all programmatic checks pass and no axis drops > 0.3. Revert on regression; mark pattern as context-dependent.

Write patterns that apply beyond this target to .planning/research/patterns.md.

Phase 8: Loop or Halt

Compute learning velocity:

velocity = Σ(delta across all attacked axes this cycle) / axes_attacked

Append to director-state.json velocity history.

Halt conditions (check in order):

--n cycles completed
--budget reached (cumulative cost ≥ limit)
All axes ≥ 9.0 across all scored skills
velocity < 0.2 for 3 consecutive cycles AND no needs-evidence hypotheses remain
Level-up triggered (see below)
User says stop

On velocity drop, before halting: attempt one axis-class switch — attack the highest-EV axis from a category not touched in the last 2 cycles. If velocity is still < 0.2 after that cycle, halt.

On level-up trigger (no axis improved > 0.5 for 2 loops, ≥ 3 loops run, no programmatic failures): write level-up proposals to .planning/rubrics/{target}-proposals.md, set status: level-up-pending in director state, halt. The campaign resumes only after the human approves and edits the live rubric.

On normal loop: increment cycle, compress prior cycle findings to continuation context, return to Phase 1.

Unlimited Mode

No --n and no --budget = unlimited. Declare before starting:

/evolve running in unlimited mode.
Target: {target} | Exit: all axes ≥ 9.0 OR velocity < 0.2 for 3 cycles
Estimated cost: $12–18/cycle | Spend so far: $0
To halt after current cycle: type /stop or press Escape.

Every cycle, report:

Cycle {n} complete. Spend: ${cycle} | Cumulative: ${total} | Velocity: {v}

When context approaches compression territory (session duration > 30 min or /compact recommended): write continuation checkpoint to director state, surface the --continue command. The next session picks up exactly where this one stopped.

For overnight / unattended runs: combine with /daemon. The director is daemon-compatible — daemon calls /evolve {target} --continue each session. Set --budget to cap total spend.

Fringe Cases

.planning/ does not exist: error — run /do setup first to initialize the harness state directory, then retry.
No rubric: error — run /improve {target} Phase 0 first. List available targets in .planning/rubrics/ as hint.
No prior scores in belief model: proceed from cycle 1; all deltas empty on first survey. Expected.
All scouts return needs-evidence: attack the top-EV axis anyway under low-confidence flag; record as exploratory. Mark result regardless.
Scout agent hangs or times out (dispatched scout never returns): After 10 minutes without a response, log the scout as status: timed-out in the experiment log with confidence: 0. Proceed with the remaining returned scouts. Never let a hung scout block the cycle — if all scouts time out, treat as "all scouts return needs-evidence" and attack the top-EV axis under low-confidence flag.
All axes collide (every axis shares files): serialize top 2 axes; parallelize remainder. Log collision.
Cross-pollination causes regression: revert that skill, mark pattern context-dependent, do not propagate further.
Level-up mid-campaign: pause, write proposals, set level-up-pending. /evolve --continue after human approval resumes cycle numbering from where it stopped.
Budget overrun risk: if projected spend for current cycle would exceed --budget by > 20%, warn and confirm before dispatching fleet.
--continue with no director state: error — no campaign to resume. Suggest /evolve {target} to start fresh.
Pattern library > 50 entries: consolidate — group by axis class, merge similar patterns, keep highest-confidence instance of each class. Log consolidation.
Zero skills match target rubric: error with message listing all .planning/rubrics/*.md targets.

Quality Gates

Every hypothesis must have an explicit falsification criterion before Phase 3
Scouts must run before fleet dispatch on any unconfirmed hypothesis
Belief model written after every phase, not only end of cycle
Cross-pollination requires passing verification oracle before commit
Regression on any previously-passing axis aborts that worktree commit
Pattern library and global patterns updated at every cycle end, even zero-improvement cycles
Cycle digest written even on abort or no-change cycles

Contextual Gates

Disclosure:

State mode (unlimited / fixed / budget) and exit conditions before first cycle
Estimate $12–18 per full cycle before starting
Report spend and velocity at end of every cycle
Confirm before continuing if cumulative spend exceeds $50

Reversibility: Red. Cross-pollination modifies many files across the repo; level-up rewrites rubric anchors permanently. Each commit is individually revertable; high volume. Range: git revert {first}^..{last}.

Trust gates:

Novice (0-4 sessions): --status and --n=1 only; unlimited blocked
Familiar (5-19): up to --n=5; unlimited requires explicit --budget cap
Trusted (20+): no cap; confirm if projected total > $100

Exit Protocol

---HANDOFF---
- Target: {target} | Cycles: {n} | Spend: ${total} | Mode: {unlimited/n/budget}
- Axes improved: {list with deltas}
- Belief model: .planning/evolve/{target}/belief-model.jsonl ({N} confirmed, {M} rejected)
- Pattern library: .planning/evolve/{target}/pattern-library.md ({N} patterns)
- Global patterns: .planning/research/patterns.md
- Cycle digests: .planning/evolve/{target}/cycle-*-digest.md
- Halt reason: {ceiling/velocity/budget/n-complete/user-stop/level-up-pending}
- Level-up proposals: {path or N/A}
- Reversibility: red — {N} commits across {M} files; revert range: git revert {range}
- Recommended next: {level-up and re-run / new target / done}
---