From citadel
Orchestrates multi-cycle code improvements on rubrics: forms hypotheses on low scores, validates with scouts, dispatches axis attacks, extracts patterns, persists belief model across sessions.
npx claudepluginhub sethgammon/citadel --plugin citadelThis skill uses the workspace's default tool permissions.
**Use when:** You want sustained autonomous quality advancement — the director
Runs autonomous loops to improve a target: scores against rubric, selects highest-leverage axis, attacks it, verifies changes, documents, and repeats until plateau or all axes >=8.0. Requires rubric file.
Runs autonomous improvement loops on codebases, selecting tasks from failing goals, test gaps, dep audits, perf profiles, and more via RPI cycles until dormancy.
Runs propose-evaluate-iterate loop to optimize and evolve AI agent performance using LangSmith evaluations and git worktrees for isolation. Requires .evolver.json setup.
Share bugs, ideas, or general feedback.
Use when: You want sustained autonomous quality advancement — the director forms hypotheses, scouts before attacking, and builds a belief model that compounds across cycles. Runs until a natural ceiling, budget exhaustion, or you say stop.
Don't use when: You want a single scored loop (/improve), a known axis
attacked directly (/improve --axis), or a one-time audit (/improve --score-only).
Key difference from /improve: /improve follows the rubric mechanically.
/evolve asks why scores are where they are, validates those theories before
spending fleet budget, and extracts cross-skill patterns that propagate to skills
never directly attacked.
/evolve {target} # run until ceiling, velocity drop, or budget
/evolve {target} --n={N} # exactly N director cycles then stop
/evolve {target} --budget=${X} # run until cumulative spend reaches $X
/evolve {target} --continue # resume from saved director state
/evolve {target} --status # show belief model, velocity, spend — no attack
/evolve {target} --axis={name} # focus director on one axis (scout + attack only)
target maps to .planning/rubrics/{target}.md.
If no rubric exists, run /improve {target} Phase 0 first — /evolve requires
an approved rubric and will not auto-generate one.
All findings are externalized incrementally — written after every phase, not only at cycle end. A crashed or compacted session resumes with full context.
| Artifact | Path | Contents |
|---|---|---|
| Director state | .planning/evolve/{target}/director-state.json | cycle count, spend, velocity history, current phase, halt status |
| Belief model | .planning/evolve/{target}/belief-model.jsonl | one record per (axis, skill) per cycle: score, hypothesis, evidence, confidence |
| Experiment log | .planning/evolve/{target}/experiment-log.jsonl | every experiment: hypothesis → prediction → actual delta → mechanism confirmed |
| Pattern library | .planning/evolve/{target}/pattern-library.md | transferable patterns: what change to what axis class caused what delta in which skills |
| Cycle digest | .planning/evolve/{target}/cycle-{n}-digest.md | human-readable per-cycle summary for review |
| Global patterns | .planning/research/patterns.md | cross-target patterns written outside campaign scope; available to future sessions and other targets |
Create .planning/evolve/{target}/ on first invocation. Create .planning/research/ if absent.
# Cycle {n} — {target} | {date}
## Scores
| Axis | Prior | This Cycle | Delta |
## Hypotheses
| ID | Axis | Hypothesis | Scout Result | Confidence |
## What Was Attacked
| Axis | Skill | Delta | Mechanism Confirmed |
## Patterns Discovered This Cycle
- {pattern}: {evidence}
## Belief Model Updates
- {hypothesis confirmed / rejected / revised}
## Spend: ${cycle} this cycle | ${cumulative} cumulative | Velocity: {v}
Run /improve {target} --score-only. Record scores to belief model with delta
from prior cycle (empty on cycle 1). Flag any axis that dropped since last cycle
as regression-watch — these are checked first in Phase 2.
For every axis below 8.0, generate one primary hypothesis in this form:
HYPOTHESIS: {axis} scores {n}/10 because {specific mechanism},
not because {common misread}.
PREDICTION: Fixing {mechanism} will raise score ≥ {delta} across {N} skills.
FALSIFICATION: If we apply {change} and score does not rise > 0.5, hypothesis rejected.
Draw hypotheses from: evaluator justifications in Phase 1, prior evidence in the belief model, and programmatic check failures. Do not hypothesize from score alone — the number is the symptom.
Write each hypothesis to the experiment log as { id, status: "pending", ... }.
Skip hypothesis generation for an axis if the belief model already has a
confidence >= 0.8 confirmed hypothesis for it that has not yet been attacked.
For axes below 7.0, or axes with unconfirmed hypotheses: dispatch one scout agent per hypothesis. Scouts read — they do not modify files.
Each scout returns:
{ "hypothesis_id": "...", "confirmed": true, "evidence": "...", "confidence": 0.85 }
Scout confidence protocol: Scouts read relevant files only — no edits, no test runs. Assign confidence:
Run scouts in parallel. Update experiment log:
confidence >= 0.7 → confirmedconfidence 0.4–0.69 → needs-evidence (do not attack; add to next cycle)confidence < 0.4 → rejectedSkip Phase 3 for any hypothesis already confirmed at confidence >= 0.8 in
the belief model from a prior cycle.
For each confirmed hypothesis compute:
EV = (delta_estimate × axis_weight × confidence) / (effort_tier × collision_multiplier)
effort_tier: low=1.0, medium=1.5, high=2.5collision_multiplier: 2.0 if axis shares primary files with another attack in this cycleSelect top K axes where K = min(confirmed count, 4). Document selection rationale
in cycle digest. If --axis was set, skip ranking — attack only that axis.
Dispatch one agent per selected axis in an isolated worktree
(Agent tool, isolation: "worktree"). Each agent receives:
node scripts/run-with-timeout.js 300 node scripts/test-all.jsEach agent returns a structured result:
{
"axis": "...", "skill": "...",
"delta": 1.2,
"mechanism_confirmed": true,
"files_changed": ["..."],
"approach": "..."
}
Merge rules:
mechanism_confirmed: false (score improved but not via predicted mechanism): record as incidental_improvement, mark hypothesis as needs-revisionCommit each merged worktree with a message citing the hypothesis ID.
For each result:
verified / refuted / incidentalPATTERN: {axis_class} | Mechanism: {what caused improvement} | Delta: {avg} across {N} instances | Applies to: {skill list} | Confidence: high/medium/low
Write patterns to .planning/evolve/{target}/pattern-library.md.
For each confidence: high pattern, or any pattern confirmed in 2+ skills:
apply to all other applicable skills as targeted single-file edits — without
running a full attack cycle.
Run verification oracle per cross-pollinated skill. Commit only if all
programmatic checks pass and no axis drops > 0.3. Revert on regression; mark
pattern as context-dependent.
Write patterns that apply beyond this target to .planning/research/patterns.md.
Compute learning velocity:
velocity = Σ(delta across all attacked axes this cycle) / axes_attacked
Append to director-state.json velocity history.
Halt conditions (check in order):
--n cycles completed--budget reached (cumulative cost ≥ limit)velocity < 0.2 for 3 consecutive cycles AND no needs-evidence hypotheses remainOn velocity drop, before halting: attempt one axis-class switch — attack the highest-EV axis from a category not touched in the last 2 cycles. If velocity is still < 0.2 after that cycle, halt.
On level-up trigger (no axis improved > 0.5 for 2 loops, ≥ 3 loops run,
no programmatic failures): write level-up proposals to
.planning/rubrics/{target}-proposals.md, set status: level-up-pending in
director state, halt. The campaign resumes only after the human approves and
edits the live rubric.
On normal loop: increment cycle, compress prior cycle findings to continuation context, return to Phase 1.
No --n and no --budget = unlimited. Declare before starting:
/evolve running in unlimited mode.
Target: {target} | Exit: all axes ≥ 9.0 OR velocity < 0.2 for 3 cycles
Estimated cost: $12–18/cycle | Spend so far: $0
To halt after current cycle: type /stop or press Escape.
Every cycle, report:
Cycle {n} complete. Spend: ${cycle} | Cumulative: ${total} | Velocity: {v}
When context approaches compression territory (session duration > 30 min or
/compact recommended): write continuation checkpoint to director state,
surface the --continue command. The next session picks up exactly where this
one stopped.
For overnight / unattended runs: combine with /daemon. The director is
daemon-compatible — daemon calls /evolve {target} --continue each session.
Set --budget to cap total spend.
.planning/ does not exist: error — run /do setup first to initialize the harness state directory, then retry./improve {target} Phase 0 first. List available targets in .planning/rubrics/ as hint.needs-evidence: attack the top-EV axis anyway under low-confidence flag; record as exploratory. Mark result regardless.status: timed-out in the experiment log with confidence: 0. Proceed with the remaining returned scouts. Never let a hung scout block the cycle — if all scouts time out, treat as "all scouts return needs-evidence" and attack the top-EV axis under low-confidence flag.context-dependent, do not propagate further.level-up-pending. /evolve --continue after human approval resumes cycle numbering from where it stopped.--budget by > 20%, warn and confirm before dispatching fleet.--continue with no director state: error — no campaign to resume. Suggest /evolve {target} to start fresh..planning/rubrics/*.md targets.Disclosure:
Reversibility: Red. Cross-pollination modifies many files across the repo;
level-up rewrites rubric anchors permanently. Each commit is individually
revertable; high volume. Range: git revert {first}^..{last}.
Trust gates:
--status and --n=1 only; unlimited blocked--n=5; unlimited requires explicit --budget cap---HANDOFF---
- Target: {target} | Cycles: {n} | Spend: ${total} | Mode: {unlimited/n/budget}
- Axes improved: {list with deltas}
- Belief model: .planning/evolve/{target}/belief-model.jsonl ({N} confirmed, {M} rejected)
- Pattern library: .planning/evolve/{target}/pattern-library.md ({N} patterns)
- Global patterns: .planning/research/patterns.md
- Cycle digests: .planning/evolve/{target}/cycle-*-digest.md
- Halt reason: {ceiling/velocity/budget/n-complete/user-stop/level-up-pending}
- Level-up proposals: {path or N/A}
- Reversibility: red — {N} commits across {M} files; revert range: git revert {range}
- Recommended next: {level-up and re-run / new target / done}
---