From citadel
Runs autonomous loops to improve a target: scores against rubric, selects highest-leverage axis, attacks it, verifies changes, documents, and repeats until plateau or all axes >=8.0. Requires rubric file.
npx claudepluginhub sethgammon/citadel --plugin citadelThis skill uses the workspace's default tool permissions.
**Use when:** Scoring a target against a rubric and iteratively improving it. Rubric required at `.planning/rubrics/{target}.md` (Phase 0 creates one if missing).
Orchestrates multi-cycle code improvements on rubrics: forms hypotheses on low scores, validates with scouts, dispatches axis attacks, extracts patterns, persists belief model across sessions.
Builds interactive rubrics to evaluate artifacts with parallel multi-model scoring (Codex, Gemini, Claude), then iteratively improves one criterion at a time until threshold met.
Runs Karpathy-inspired autonomous iteration loops on any task: modify, verify, keep/discard, repeat. Subcommands for planning, debugging, fixing, security audits, shipping.
Share bugs, ideas, or general feedback.
Use when: Scoring a target against a rubric and iteratively improving it. Rubric required at .planning/rubrics/{target}.md (Phase 0 creates one if missing).
Don't use when: Refactoring without a rubric (use /refactor), one-time code review (use /review), or debugging a specific bug (use /systematic-debugging).
/improve {target} # Loop until plateau or all axes >= 8.0
/improve {target} --n=3 # Run exactly N loops then stop
/improve {target} --axis={name} # Force-attack a specific axis (skips scoring)
/improve {target} --score-only # Score and report, no attack
/improve {target} --continue # Resume from campaign state (used by daemon)
/improve citadel # Targets Citadel itself
target is a slug that maps to .planning/rubrics/{target}.md.
If no rubric exists, run Phase 0 first.
When invoked with --n or --continue, improve operates in campaign mode and
maintains a campaign file that daemon can attach to.
.planning/campaigns/improve-{target}.mdCreated automatically on the first invocation with --n. Format:
---
version: 1
id: "improve-{target}-{ISO-date-slug}"
status: active
type: improve
target: {target}
total_loops: {n or "unlimited"}
completed_loops: 0
current_level: {rubric level from frontmatter}
estimated_cost_per_loop: 12
started: "{ISO timestamp}"
---
# Campaign: Improve {target}
Status: active
Direction: Improve {target} for {n} loops at Level {level}
## Loop History
| Loop | Axis Attacked | Outcome | Score Movement |
|------|---------------|---------|----------------|
(populated after each loop)
## Continuation State
next_loop: 1
last_scorecard_log: (none)
last_outcome: (none)
phase_within_loop: not-started
level_up_triggered: false
Update phase_within_loop at each phase: scoring → selected-{axis} → attacking-{axis} → verifying → not-started.
On loop complete: increment completed_loops, update next_loop/last_scorecard_log/last_outcome, append Loop History row.
--continue flag.planning/campaigns/improve-{target}.md — error if missing or status not activecompleted_loops >= total_loops: mark completed, exitphase_within_loop is not not-started: restart current loop from Phase 1 (interrupted mid-loop)last_scorecard_log for delta comparison, then run Phase 1 onwardsRun only when .planning/rubrics/{target}.md does not exist.
.planning/research/ if available/research-fleet to survey comparable products if no research exists.planning/rubrics/{target}.mdScore every axis in the rubric. No shortcuts. No cached scores from the previous loop.
Execute the programmatic verification steps from the rubric. A programmatic failure caps that axis at 5 regardless of evaluator scores. Record raw results: which checks passed, which failed, what the failure was.
Execute structural checks from each axis's verification spec:
Spawn three evaluator agents in parallel. Each receives:
Each evaluator scores independently. For each axis:
needs-refinementneeds-refinement axes are logged but still scored. Do not halt on evaluator disagreement.
Axis | A | B | C | Prog | Final | Delta | Flag
------------|-----|-----|-----|-----------|-------|--------|-----
{axis_name} | {n} | {n} | {n} | PASS/FAIL | {n.n} | +{n.n} | cap
Final = min(A, B, C), then apply programmatic cap (sets Flag=cap). Delta = current − prior loop score (empty on loop 1).
Choose the single axis to attack this loop.
Selection formula:
score(axis) = (10 - current_score) × weight × effort_multiplier × recency_penalty
effort_multiplier: low = 1.0, medium = 0.7, high = 0.4recency_penalty: 0.5 if attacked in previous 2 loops, otherwise 1.0If --axis flag was set, skip selection and attack the specified axis.
Announce the selection:
Selected: {axis_name} (score: {n}/10, weight: {w}, effort: {e}, selection score: {s})
Rationale: {one sentence on why this axis now, not another}
Execute the improvement. Dispatch strategy depends on the axis category.
ISOLATION MANDATE: When dispatching to /experiment, /fleet, or /research-fleet, always use the Agent tool with isolation: "worktree". Sub-agents in worktrees get their own context windows; the orchestrator only receives their HANDOFF results.
technical axes (test_coverage, hook_reliability, api_surface_consistency):
/experiment for measurable improvements with before/after comparisonnode scripts/run-with-timeout.js 300 node scripts/test-all.js as the verification oracledocumentation axes (documentation_coverage, documentation_accuracy):
experience axes (onboarding_friction, error_recovery, command_discoverability):
positioning axes (differentiation_clarity, competitive_feature_coverage):
/research to verify current competitive landscape is accuratepresentation axes (demo_page_effectiveness, readme_quality, visual_coherence):
/live-preview or /qa to verify visual changes render correctlysecurity axes (security_posture):
When the attack involves trying multiple approaches:
APPROACH COMPARISON: [approach A] vs [approach B] — winner: [A] because [reason]After the attack, re-score only the targeted axis (not full re-score).
Run the four verification tiers from the rubric for the targeted axis:
/do command.
onboarding_friction, error_recovery, documentation_accuracy, command_discoverabilityPASS {wall_time} or FAIL at step {n}: {what broke}visual_coherence, api_surface_consistency)Regression check (run on all axes, not just targeted):
On abort: revert the changes, log the failure, treat as "no improvement this loop".
On pass: commit the changes with a descriptive message.
Write the loop log. Always. Even on abort.
Log path: .planning/improvement-logs/{target}/loop-{n}.md
# Improvement Loop {n}: {target}
> Date: {ISO date} | Loop: {n} | Selected axis: {axis_name} | Outcome: improved | no-change | aborted
## Scorecard
| Axis | Loop {n-1} | Loop {n} | Delta |
## Attack summary
**What was changed:** ... **Approach:** experiment / direct / research+update **Files:** ...
**APPROACH COMPARISON:** (if multiple tried) {A} vs {B} — winner: {A} because {reason}
## Verification results
**Programmatic:** PASS/FAIL **Structural:** PASS/FAIL
**Perceptual:** {score}/10 — {one-line rationale}
**Behavioral:** PASS {wall_time} | FAIL at step {n}: {reason} | SKIPPED
## Proposed axis additions
PROPOSED AXIS: {name} | Rationale | Category | Weight | Anchors: 0=... 5=... 10=...
(or: None proposed this loop.)
## What was learned
{2-3 sentences}
All proposals go to .planning/rubrics/{target}-proposals.md. Never to the live rubric.
Exit conditions (check in order):
--n flag was set and N loops have completed: exit, report scorecardOn Level-Up: do not exit. Escalate. See Level-Up Protocol section.
On ceiling (all >= 8.0): report the final scorecard and recommend a Level-Up run.
On normal loop: return to Phase 1. Re-score everything from scratch.
Campaign mode exit handling:
status: completed, move to completed/status: completed, move to completed/status: level-up-pending (daemon will pause, not retry)status: parkedstatus: parked with reasonstatus: pausedTriggers when no axis improved > 0.5 in the last 2 consecutive loops, no programmatic cap is active, and at least 3 loops have completed.
Step 1: Freeze the snapshot
Write .planning/rubrics/{target}-level-{n}-final.md with: date, loops completed, final scorecard, axes at ceiling (≥9.0 — their 10 anchors become Level {n+1}'s 5 anchors), and axes that plateaued below 9.0 with why.
Step 2: Write proposals
For each axis: propose Level {n+1} re-anchoring (current 10 → new 5, propose new 10). For plateaued axes: re-anchor, replace with measurable proxy, or retire.
Auto-include these three process axes if not already in the rubric: decomposition_quality, scope_appropriateness, verification_depth.
Write to .planning/rubrics/{target}-proposals.md: re-anchored axes (current 10 anchor, proposed 0/5/10), proposed new axes, axes proposed for retirement.
Step 3: Halt -- human approval required
Do not self-approve. Do not continue looping.
In campaign mode:
status: level-up-pendinglevel_up_triggered: trueawaiting: human approval of level-up proposalsReport:
The loop resumes only when the human edits the live rubric with approved proposals
and sets the campaign status back to active. Level {n+1} loops continue incrementing
the loop number (they do not reset to 1).
Step 4: Historical context for future evaluators
When the loop resumes after a level-up, every evaluator in Phase 1c receives:
needs-refinement, use minimum score, continue.--continue + no campaign file: error, suggest --n.--continue + level-up-pending: halt, point to proposals file, require human approval then status: active.--continue + completed: do not resume, report final scorecard.--n + existing active campaign: treat as --continue. If completed/parked: new campaign, incremented slug..planning/rubrics/{target}-proposals.md only. Human approval required.status: level-up-pending, not parked or active.Disclosure: State loop count, target, per-loop cost (~$12), total estimate. For --continue: loops remaining and spend so far. For unlimited: state exit conditions (plateau or all axes >= 8.0).
Reversibility: Green = --score-only | Amber = standard loops (each commits separately) | Red = level-up (rewrites rubric anchors permanently). Red requires explicit confirmation.
Proportionality: No rubric + no explicit request → suggest /review. All axes > 8.0 + --n=1 → suggest --axis. Cost > $50 → confirm.
Trust gating: Novice (0-4): --score-only / --n=1 only. Familiar (5-19): up to --n=5. Trusted (20+): no cap; confirm unlimited or cost > $50.
---HANDOFF---
- Target: {target} — Loop {n} of {n_total or "∞"} — Level {current_level}
- Outcome: {improved | plateau | ceiling | aborted | n-complete | level-up-triggered}
- Score movement: {axis} {before} → {after} (+{delta})
- Behavioral simulation: {PASS {wall_time} | FAIL | SKIPPED}
- Proposed rubric additions: {count} — written to .planning/rubrics/{target}-proposals.md
- Loop log: .planning/improvement-logs/{target}/loop-{n}.md
- Reversibility: amber -- each loop commits separately, revert individual loops with git revert
- Next recommended axis: {axis_name} (if not exiting)
- Level-up snapshot: .planning/rubrics/{target}-level-{n}-final.md (if level-up triggered)
---