From citadel
Directs multi-cycle improvement campaigns by forming hypotheses, scouting before attacking, and extracting transferable patterns. Use for sustained autonomous quality advancement across sessions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/citadel:evolveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Use when:** You want sustained autonomous quality advancement — the director
Use when: You want sustained autonomous quality advancement — the director forms hypotheses, scouts before attacking, and builds a belief model that compounds across cycles. Runs until a natural ceiling, budget exhaustion, or you say stop.
Don't use when: You want a single scored loop (/improve), a known axis
attacked directly (/improve --axis), or a one-time audit (/improve --score-only).
Key difference from /improve: /improve follows the rubric mechanically.
/evolve asks why scores are where they are, validates those theories before
spending fleet budget, and extracts cross-skill patterns that propagate to skills
never directly attacked.
/evolve {target} # run until ceiling, velocity drop, or budget
/evolve {target} --n={N} # exactly N director cycles then stop
/evolve {target} --budget=${X} # run until cumulative spend reaches $X
/evolve {target} --continue # resume from saved director state
/evolve {target} --status # show belief model, velocity, spend — no attack
/evolve {target} --axis={name} # focus director on one axis (scout + attack only)
target maps to .planning/rubrics/{target}.md.
If no rubric exists, run /improve {target} Phase 0 first — /evolve requires
an approved rubric and will not auto-generate one.
All findings are externalized incrementally — written after every phase, not only at cycle end. A crashed or compacted session resumes with full context.
| Artifact | Path | Contents |
|---|---|---|
| Director state | .planning/evolve/{target}/director-state.json | cycle count, spend, velocity history, current phase, halt status |
| Belief model | .planning/evolve/{target}/belief-model.jsonl | one record per (axis, skill) per cycle: score, hypothesis, evidence, confidence |
| Experiment log | .planning/evolve/{target}/experiment-log.jsonl | every experiment: hypothesis → prediction → actual delta → mechanism confirmed |
| Pattern library | .planning/evolve/{target}/pattern-library.md | transferable patterns: what change to what axis class caused what delta in which skills |
| Cycle digest | .planning/evolve/{target}/cycle-{n}-digest.md | human-readable per-cycle summary for review |
| Global patterns | .planning/research/patterns.md | cross-target patterns written outside campaign scope; available to future sessions and other targets |
| Knowledge wiki | .planning/wiki/ | compiled wiki pages from /learn; integrates evolve discoveries across sessions |
Create .planning/evolve/{target}/ on first invocation. Create .planning/research/ if absent.
Cycle digest contents: scores table (axis, prior, this cycle, delta), hypotheses table (id, axis, hypothesis, scout result, confidence), what was attacked (axis, skill, delta, mechanism confirmed), patterns discovered this cycle, belief model updates, and the spend/velocity line. Full template: docs/QUALITY_LOOPS.md#cycle-digest-format.
Run /improve {target} --score-only. Record scores to belief model with delta
from prior cycle (empty on cycle 1). Flag any axis that dropped since last cycle
as regression-watch — these are checked first in Phase 2.
For every axis below 8.0, generate one primary hypothesis in this form:
HYPOTHESIS: {axis} scores {n}/10 because {specific mechanism},
not because {common misread}.
PREDICTION: Fixing {mechanism} will raise score ≥ {delta} across {N} skills.
FALSIFICATION: If we apply {change} and score does not rise > 0.5, hypothesis rejected.
Draw hypotheses from: evaluator justifications in Phase 1, prior evidence in the belief model, and programmatic check failures. Do not hypothesize from score alone — the number is the symptom.
Write each hypothesis to the experiment log as { id, status: "pending", ... }.
Skip hypothesis generation for an axis if the belief model already has a
confidence >= 0.8 confirmed hypothesis for it that has not yet been attacked.
For axes below 7.0, or axes with unconfirmed hypotheses: dispatch one scout agent per hypothesis. Scouts read — they do not modify files.
Each scout returns { "hypothesis_id", "confirmed", "evidence", "confidence" }
(schema example: docs/QUALITY_LOOPS.md#scout-result-schema).
Scout confidence protocol: Scouts read relevant files only — no edits, no test runs. Assign confidence:
Run scouts in parallel. Update experiment log:
confidence >= 0.7 → confirmedconfidence 0.4–0.69 → needs-evidence (do not attack; add to next cycle)confidence < 0.4 → rejectedSkip Phase 3 for any hypothesis already confirmed at confidence >= 0.8 in
the belief model from a prior cycle.
For each confirmed hypothesis compute:
EV = (delta_estimate × axis_weight × confidence) / (effort_tier × collision_multiplier)
effort_tier: low=1.0, medium=1.5, high=2.5collision_multiplier: 2.0 if axis shares primary files with another attack in this cycleSelect top K axes where K = min(confirmed count, 4). Document selection rationale
in cycle digest. If --axis was set, skip ranking — attack only that axis.
Dispatch one agent per selected axis in an isolated worktree
(Agent tool, isolation: "worktree"). Each agent receives:
node scripts/run-with-timeout.js 300 node scripts/test-all.jsEach agent returns a structured result:
{ "axis", "skill", "delta", "mechanism_confirmed", "files_changed", "approach" }
(schema example: docs/QUALITY_LOOPS.md#fleet-agent-result-schema).
Merge rules:
mechanism_confirmed: false (score improved but not via predicted mechanism): record as incidental_improvement, mark hypothesis as needs-revisionCommit each merged worktree with a message citing the hypothesis ID.
After committing changes to any SKILL.md (here or in Phase 7), run /reload-skills if the running Claude Code version supports it so the change is live this session; otherwise note that a fresh session is required before the updated skill takes effect.
For each result:
verified / refuted / incidentalPATTERN: {axis_class} | Mechanism: {what caused improvement} | Delta: {avg} across {N} instances | Applies to: {skill list} | Confidence: high/medium/low
Write patterns to .planning/evolve/{target}/pattern-library.md.
Compile into wiki: After writing to the pattern library, call
/learn --from-evolve {target} --cycle {n}. This compiles cycle discoveries
into .planning/wiki/ — integrating with findings from prior cycles and
campaigns rather than siloing them in the evolve directory. Skip if /learn
is not available in this session (log the skip, do not block the cycle).
For each confidence: high pattern, or any pattern confirmed in 2+ skills:
apply to all other applicable skills as targeted single-file edits — without
running a full attack cycle.
Run verification oracle per cross-pollinated skill. Commit only if all
programmatic checks pass and no axis drops > 0.3. Revert on regression; mark
pattern as context-dependent.
Write patterns that apply beyond this target to .planning/research/patterns.md.
Compute learning velocity:
velocity = Σ(delta across all attacked axes this cycle) / axes_attacked
Append to director-state.json velocity history.
Halt conditions (check in order):
--n cycles completed--budget reached (cumulative cost ≥ limit)velocity < 0.2 for 3 consecutive cycles AND no needs-evidence hypotheses remainOn velocity drop, before halting: attempt one axis-class switch — attack the highest-EV axis from a category not touched in the last 2 cycles. If velocity is still < 0.2 after that cycle, halt.
On level-up trigger (no axis improved > 0.5 for 2 loops, ≥ 3 loops run,
no programmatic failures): write level-up proposals to
.planning/rubrics/{target}-proposals.md, set status: level-up-pending in
director state, halt. The campaign resumes only after the human approves and
edits the live rubric.
On normal loop: increment cycle, compress prior cycle findings to continuation context, return to Phase 1.
No --n and no --budget = unlimited. Declare before starting: target, exit
conditions (all axes ≥ 9.0 OR velocity < 0.2 for 3 cycles), estimated cost
($12–18/cycle), spend so far ($0), and how to halt (type /stop or press Escape
to stop after the current cycle). At the end of every cycle report cycle spend,
cumulative spend, and velocity. Literal declaration and report templates:
docs/QUALITY_LOOPS.md#unlimited-mode-templates.
When context approaches compression territory (session duration > 30 min or
/compact recommended): write continuation checkpoint to director state,
surface the --continue command. The next session picks up exactly where this
one stopped.
For overnight / unattended runs: combine with /daemon. The director is
daemon-compatible — daemon calls /evolve {target} --continue each session.
Set --budget to cap total spend.
.planning/ does not exist: error — run /do setup first to initialize the harness state directory, then retry./improve {target} Phase 0 first. List available targets in .planning/rubrics/ as hint.needs-evidence: attack the top-EV axis anyway under low-confidence flag; record as exploratory. Mark result regardless.status: timed-out in the experiment log with confidence: 0. Proceed with the remaining returned scouts. Never let a hung scout block the cycle — if all scouts time out, treat as "all scouts return needs-evidence" and attack the top-EV axis under low-confidence flag.context-dependent, do not propagate further.level-up-pending. /evolve --continue after human approval resumes cycle numbering from where it stopped.--budget by > 20%, warn and confirm before dispatching fleet.--continue with no director state: error — no campaign to resume. Suggest /evolve {target} to start fresh..planning/rubrics/*.md targets.Disclosure:
Reversibility: Red. Cross-pollination modifies many files across the repo;
level-up rewrites rubric anchors permanently. Each commit is individually
revertable; high volume. Range: git revert {first}^..{last}.
Trust gates:
--status and --n=1 only; unlimited blocked--n=5; unlimited requires explicit --budget cap---HANDOFF---
- Target: {target} | Cycles: {n} | Spend: ${total} | Mode: {unlimited/n/budget}
- Axes improved: {list with deltas}
- Belief model: .planning/evolve/{target}/belief-model.jsonl ({N} confirmed, {M} rejected)
- Pattern library: .planning/evolve/{target}/pattern-library.md ({N} patterns)
- Global patterns: .planning/research/patterns.md
- Knowledge wiki: .planning/wiki/index.md (compiled via /learn --from-evolve after each cycle)
- Cycle digests: .planning/evolve/{target}/cycle-*-digest.md
- Halt reason: {ceiling/velocity/budget/n-complete/user-stop/level-up-pending}
- Level-up proposals: {path or N/A}
- Reversibility: red — {N} commits across {M} files; revert range: git revert {range}
- Recommended next: {level-up and re-run / new target / done}
---
npx claudepluginhub sethgammon/citadel --plugin citadelScores a target against a rubric and iteratively improves it until all axes score 8.0 or higher. Each loop re-scores from scratch, selects the highest-leverage axis, attacks it, and verifies.
Evolves any measurable artifact (prompt, skill, code) through autonomous mutation-evaluate-gate loops. Supports GT case suites and scalar metric loops with automatic keep/revert.
Extracts session patterns into reusable learnings. Three modes: analyze (extract from history), review (edit/manage), and list (display active learnings). Requires persistence enabled in session config.