Help us improve
Share bugs, ideas, or general feedback.
From claude-delegator
Runs an arbiter-mediated consensus loop between Claude, GPT, Gemini, and Grok to refine a plan, design, or spec. Iterates up to 5 rounds with cross-model agreement as convergence criteria.
npx claudepluginhub antonbabenko/claude-delegator --plugin claude-delegatorHow this command is triggered — by the user, by Claude, or both
Slash command
/claude-delegator:consensusThis command is limited to the following tools:
The summary Claude sees in its command listing — used to decide when to auto-load this command
# Consensus (arbiter-mediated GPT + Gemini + Grok + Claude convergence loop) Iterate up to 5 rounds. Each round refines the plan based on GPT + Gemini + Grok feedback. This is **arbiter-mediated consensus, not pure democracy**: the external models vote independently, but Claude (the orchestrator) authors the review prompt, adjudicates which critical issues are real, and rewrites the plan between rounds. To keep that power accountable, Claude commits a **blind verdict** before reading the reviewers (Round loop below), cannot reach consensus on its own vote alone (Convergence check below), a...
/plan-reviewGenerates independent implementation plans from multiple AI models including Claude, compares them, and synthesizes a unified best plan with attribution.
/consensusRuns multi-agent consensus review on code changes, docs/specs, or launch decisions using git PR/diff context and parallel agents for fast pragmatic outcomes.
/deliberateRuns producer and reviewer agents in sequential rounds to deliberate a task until consensus or max rounds, producing final output, agreement level, and feedback summary.
/multi-planRuns /plan on 2+ models (Claude + Codex) in parallel, compares outputs, and surfaces only disagreements for user decision. Convergences are auto-approved.
/war-roomConvenes multi-LLM expert panel to pressure-test strategic decisions via reversibility assessment, adversarial review, and deliberation modes, producing synthesized recommendation.
/cc-councilRuns multi-agent council review on code targets (files, dirs, PRs, git diffs, architecture) with protocols, scoped scoring, voting, and orchestration. Produces findings, weighted scores, and optional auto-fixes.
Share bugs, ideas, or general feedback.
Iterate up to 5 rounds. Each round refines the plan based on GPT + Gemini + Grok feedback. This is arbiter-mediated consensus, not pure democracy: the external models vote independently, but Claude (the orchestrator) authors the review prompt, adjudicates which critical issues are real, and rewrites the plan between rounds. To keep that power accountable, Claude commits a blind verdict before reading the reviewers (Round loop below), cannot reach consensus on its own vote alone (Convergence check below), and must show a reason for every dismissed issue. Stop when the convergence rule is met or when 5 rounds are exhausted.
Plan, design, spec, or proposal to refine: $ARGUMENTS
/ask-gpt or /ask-gemini)/ask-all)Identify expert. Default is Plan Reviewer. Override only if $ARGUMENTS clearly maps to another role:
Read expert prompt ONCE via this resolution sequence:
~/.claude/plugins/cache/*/claude-delegator/*/prompts/[expert].md. Pick the match with the highest semver version segment (the segment immediately after claude-delegator/, parsed as semver - not lexical string compare).## Inlined fallback - [Expert] in this command file (see end of this file).Error: claude-delegator plugin cache missing for expert "[Expert]". Run /plugin install claude-delegator or /reload-plugins.Reuse the loaded contents across all rounds.
Set cwd: use process.cwd() as the MCP cwd for every call; agy print mode needs no folder-trust pre-check (Grok and Codex have no trusted-directory concept either).
Initialize state:
plan = original $ARGUMENTSround = 0history = empty list of {round, plan_diff_summary, claude_blind_verdict, gpt_verdict, gemini_verdict, grok_verdict, claude_decision, dismissals, cat_hits, parse_fallbacks, stage_2_status, stage_2_results, stage_2_shuffle}
stage_2_status is one of fired | skip_no_div | skip_s1_quorum | skip_sandbox | err_quorum.stage_2_results is {matrix: {<reviewer-model-id>: {<answer-model-id>: {vote, category|null, reason}}}, accepted_nv_count, raw_nv_count} when stage_2_status == fired; otherwise the reason string.stage_2_shuffle is {<reviewer-model-id>: {A: <answer-model-id>, B?: <answer-model-id>}} (operator-visible debug mapping; absent when Stage 2 did not fire).Print:
/consensus: starting consensus loop (max 5 rounds, expert=[Expert])
For each round R:
Print round header before anything else this round:
--- Round R/5 ---
Build identical review prompt (7-section format per ~/.claude/rules/delegator/delegation-format.md). Include:
Review mode: strict so the Plan Reviewer applies full four-criteria rigor. Consensus is a rigorous convergence loop, not a quick blocker check.security - auth, secrets, injection, data exposure, privilege boundarycorrectness - wrong behaviour, broken invariant, missing case, race conditionscope - undefined boundary, missing acceptance criteria, deliverable unclearambiguity - reference too vague to act on, contradictory steps, missing contextperformance - latency, throughput, resource use, scaling limitops - rollback, observability, deploy, migration, on-call surface**Verdict**: APPROVE | REQUEST CHANGES | REJECT
**Critical issues** (must-fix; empty list = none):
- `[category]` issue description
- ...
**Recommendations** (nice-to-have; empty list = none): [bullets]
**One-line bottom line**: [single sentence]
[category] MUST be exactly one of: security, correctness, scope, ambiguity, performance, ops. Reviewers that emit a critical issue without a category tag get that issue parsed as ambiguity by default, and the fallback is recorded in a NEW per-round field history[R].parse_fallbacks (an array of {source, issue_excerpt, reason: "reviewer omitted category tag"}). Do NOT write parse fallbacks into history[R].dismissals; that field is reserved for adjudicated dismiss/defer decisions and feeds the "Dismissed / deferred issues" section of the final report. Parse fallbacks surface separately as a one-line footnote under the Round history table.Claude's BLIND verdict - EMIT BEFORE DISPATCHING: in a message of its OWN, BEFORE the message that calls any MCP tool, Claude writes its own verdict in the strict OUTPUT FORMAT above using ONLY the review prompt from step 2 (it has not seen any reviewer output yet), and stores it verbatim in history[R].claude_blind_verdict. Print:
Claude blind (R{R}): APPROVE | REQUEST CHANGES | REJECT (N critical)
This is the orchestrator's PEER vote. Emitting it in an earlier message than the dispatch makes the pre-commitment visible in the transcript. Claude must NOT edit the blind verdict after seeing reviewers; it appears verbatim in the final report. (Claude's adjudication in step 7 is a separate, arbiter-role decision, recorded distinctly.)
Parallel dispatch - in the NEXT message, all three calls in ONE message with three tool blocks. Identical prompt body, identical expert prompt:
mcp__codex__codex({
prompt: "[identical 7-section prompt for round R]",
"developer-instructions": "[expert prompt]",
sandbox: "read-only",
cwd: "[cwd]"
})
mcp__gemini__gemini({
prompt: "[identical 7-section prompt for round R]",
"developer-instructions": "[expert prompt]",
sandbox: "read-only",
model: "auto-gemini-3",
cwd: "[cwd]"
})
mcp__grok__grok({
prompt: "[identical 7-section prompt for round R]",
"developer-instructions": "[expert prompt]",
sandbox: "read-only",
cwd: "[repo root - same cwd as the other calls]",
roots: ["[absolute repo root]"], // optional; for cross-repo plans pass multiple
files: [{ path: "path/relative/to/root" }] // attach referenced files by default
})
Files: if the plan under review references local files, pass them to
Grok via files:[{path}] (or {dir} for whole directories) each round. Path/dir
entries accept mode: "auto" | "inline" | "upload" (default "upload"); use
mode: "auto" so text files inline as input_text and Grok reads them
line-by-line instead of as searchable attachments — this is the difference
between a citing review and a hand-wavy one. Resolution is against roots[]
(first-root-wins) or cwd when roots is omitted; a path outside every root is
refused. For cross-repo plans (auditing two services together) pass roots: [repoA, repoB]. Uploaded files are SHA-256 dedup-cached locally so the same
bundle on rounds 2-5 uploads nothing (inline files always cost prompt tokens
but are always fully read). GPT and Gemini read the named paths from their
cwd. Full reference: TECHNICAL.md § "Grok files and cleanup".
Grok context parity (CRITICAL): GPT and Gemini walk the filesystem at cwd
under sandbox: "read-only"; Grok only sees files in the files array. For any
plan that asks reviewers to verify against the repo (cross-file invariants,
"audit this codebase", architectural claims), attach the same orientation bundle
to Grok EVERY round (same files payload so the comparison stays fair):
project CLAUDE.md / AGENTS.md if present, top-level entrypoints, and the
modules the plan touches - 2-6 files, under 48 MB total. Fallback when
CLAUDE.md/AGENTS.md is absent: substitute README.md, then the top-level
entrypoint inferred from project type. If you knowingly skip the bundle,
mark Grok's verdict as ERRORED (context-starved) so the convergence parser
(which keys on ERRORED, same format as ERRORED (provider error: ...) from
step 5) excludes it - never let an uninformed APPROVE drive convergence.
Verification (scoped to plans that touch the repo): if the plan asks
reviewers to verify against the repo, before each round's parallel dispatch
sanity-check that the files array passed to mcp__grok__grok is non-empty.
The default is to keep the bundle stable round-over-round so reviewer
comparisons stay fair. However, if iterative plan refinement adds or
changes which modules the plan touches, update the bundle to match the
new touch-set (otherwise Grok is frozen while GPT/Gemini can freely read
the new files via cwd - reintroducing the exact asymmetry this section
exists to prevent). Record the bundle change in history[R].dismissals
with reason "bundle updated: plan touch-set changed" so the audit trail
shows why round-over-round comparison shifted. If files would still be
empty after refresh, mark Grok ERRORED (context-starved) for that round.
For purely conceptual consensus loops (no repo files relevant to the
plan), the verification check does NOT apply - run Grok with an empty
files array as a normal parallel reviewer.
Stream short status as each return arrives. Do not wait until all are back to print anything. Mark a provider that returned an MCP error or result.isError as ERRORED. Examples:
GPT (R{R}): APPROVE
Gemini (R{R}): REQUEST CHANGES (3 critical)
Grok (R{R}): ERRORED (provider error: missing-auth)
Stage 2 (blind cross-review, conditional) - inserted between Stage 1 returns and arbiter adjudication. Stage 2 is decision input only: Stage 2 errors or skips never block convergence (the convergence rule in step 8 is unchanged).
6a. Trigger check (deterministic). Stage 2 fires iff EITHER:
SKIPPED and ERRORED Stage 2 rounds do NOT reset the lookback. Only quorum-successful Stage 2 rounds with completed adjudication count as the lookback anchor (prevents partial-round reads).
If /consensus was invoked with sandbox: workspace-write, set history[R].stage_2_status = "skip_sandbox" and SKIP the rest of step 6 (Stage 2 is incompatible with the implementation path - anonymization leaks through patch style).
If the trigger condition is not met, set history[R].stage_2_status = "skip_no_div" and SKIP the rest of step 6.
6b. Quorum check. Stage 2 requires Stage 1 to have at least 2 responding externals. Count the externals whose Stage 1 verdict in step 5 was NOT ERRORED:
history[R].stage_2_status = "skip_s1_quorum" and SKIP the rest of step 6.6c. Anonymize and build the fan-out bundle. For each Stage 1 responding external (the answer pool):
Response A to one reviewer and Response B to another in the 3-reviewer case).history[R].stage_2_shuffle = {<reviewer-model-id>: {A: <answer-model-id>, B?: <answer-model-id>}}. This is operator-visible debug observability; reviewers do NOT receive this mapping.$ARGUMENTS (the user's request), the current plan revision (with a 1-line indicator: "Plan is round R revision"), and the anonymized peer answers. NO Stage 1 verdict metadata, NO full round history, NO dismissals from prior rounds.6d. Parallel dispatch to the panel. One MCP call per reviewer, all dispatched in a single message with N parallel tool blocks. Reuse the existing expert developer-instructions (the same one Stage 1 used this round; do NOT swap experts). Each reviewer call uses a tighter 5-section Stage 2 prompt body:
TASK: Score the anonymized peer answers below as viable or not-viable on substance, relative to the user's original request. Do NOT review your own answer (it is excluded).
INPUT:
- ORIGINAL USER REQUEST: [verbatim $ARGUMENTS to /consensus]
- Current plan (round R revision): [full current plan text]
- Anonymized peer answers:
Response A:
[anonymized answer body]
[Response B:
[anonymized answer body]]
- Scoring guidance: when judging viability, use the same closed taxonomy `/consensus` uses for issue categories: security | correctness | scope | ambiguity | performance | ops. For not-viable votes, select exactly one category.
OUTPUT FORMAT (strict):
Response A: viable - [one-line reason] (if viable)
Response A: not-viable - [category] - [one-line reason] (if not-viable)
[Response B: viable - [one-line reason] (if viable)
Response B: not-viable - [category] - [one-line reason] (if not-viable)]
Bottom line: [single sentence]
CONSTRAINTS:
- Score substance, ignore style. Anonymization is best-effort; do not try to deanonymize.
- Category is REQUIRED for not-viable; OMITTED for viable.
- `[category]` MUST be exactly one of: security, correctness, scope, ambiguity, performance, ops.
Print a status line as each reviewer returns:
Stage 2 [Provider] (R{R}): 0 NV | 1 NV | 2 NV | ERRORED
where the NV count is the number of not-viable votes that reviewer cast.
6e. Reviewer quorum re-check. Count Stage 2 reviewers that returned without error:
history[R].stage_2_status = "err_quorum"; no Stage 2 issues added this round.history[R].stage_2_status = "err_quorum"; no Stage 2 issues added this round.If quorum passes, set history[R].stage_2_status = "fired".
6f. Parse each reviewer's output, case-insensitive and bracket-tolerant. For each response line:
[security] and security for the category. Lowercase normalize before matching against the 6-cat enum.ambiguity and append {source: "Stage 2: <reviewer-model-id> on <answer-position>", issue_excerpt: <one-line reason>, reason: "Stage 2: omitted/invalid category"} to history[R].parse_fallbacks (same array Stage 1 already uses).6g. Build candidate critical issues from not-viable votes. For each Response X: not-viable - [category] - [reason] from a reviewer:
history[R].stage_2_shuffle[<reviewer-model-id>][X].[Stage 2: <reviewer-model-id> on <answer-model-id>] [category] [one-line reason].(N of 2) where N is the count of those 2 reviewers that marked the answer not-viable. In the 2-reviewer panel, each answer is seen by 1 reviewer; weight is (1 of 1) or (0 of 1).history[R].stage_2_results.matrix AND to the pool of issues passed to step 7 (Adjudicate) below. The category goes through cat_hits the same way Stage 1 categories do.Store the final structure: history[R].stage_2_results = {matrix, accepted_nv_count: <to_be_filled_in_step_7>, raw_nv_count: <count_of_all_not_viable_votes_this_round>}.
Adjudicate (arbiter role) - reconcile issues against the blind verdict:
Verdict, Critical issues, Recommendations. An ERRORED provider is excluded (see Convergence check); it contributes no verdict and no Stage 1 issues.{source, category, description} entry where source is the string "Stage 2: <reviewer-model-id> on <answer-model-id>" and includes its weight tag (N of M reviewers WHO SAW THIS ANSWER marked it not-viable) appended to the description. Stage 2 issues participate in dismiss/accept/defer the same way Stage 1 issues do.accept (real problem, fix), dismiss (false positive - reason REQUIRED), or defer (out of scope - reason REQUIRED). Append every dismiss/defer to history[R].dismissals. This includes Claude walking back one of its own blind critical issues - that is a dismiss and needs a recorded reason too.accept it by default; only dismiss with a concrete factual reason ("out of scope" alone is not enough).{category: source_count} map counting how many DISTINCT sources raised at least one critical issue in each category. Store as history[R].cat_hits. Any category with source_count >= 2 is a cross-source category hit and surfaces in the Round history table's Cat hits column. Cell format: comma-separated category x<source_count> entries, ordered by source_count descending then category name ascending; render - if no category reached the 2+ threshold. Example cell: security x4, ops x2. This is a reporting-only signal; it does not change the convergence rule.accept-ed this round and write the count back to history[R].stage_2_results.accepted_nv_count. The lookback condition in step 6a's trigger uses this count, NOT the raw not-viable vote count.Convergence check - the loop CONVERGES only when ALL of these hold:
If not converged:
accept-ed critical issues from all responding reviewers plus any Claude found.history with: Claude blind verdict, GPT/Gemini/Grok verdicts (or ERRORED), Claude's per-issue decisions + reasons, the diff summary applied to the plan.If round R == 5 and still not converged:
Derive a one-word confidence label from the number of rounds the loop took to converge. The label appears in the Final output's outcome line. A plan that converges in round 1 is a stronger signal than a plan that needed every round to settle.
high - converged in round 1medium - converged in round 2 or 3low - converged in round 4 or 5none - UNRESOLVED after 5 rounds (no convergence to grade)This is a copy-only signal: it is computed at the end from round, does not affect the convergence rule itself, and never inflates an UNRESOLVED outcome into a converged one.
## /consensus result
**Mode**: arbiter-mediated consensus (external models vote; Claude adjudicates + synthesizes)
**Outcome**: CONVERGED in N rounds (confidence: high|medium|low) | UNRESOLVED after 5 rounds (confidence: none)
**Final plan**:
[full converged plan, or last revision]
**Round history** (CB = Claude blind verdict; reviewer cols use APPR/RC/REJ or ERR; Adj = Claude adjudicated; S2 = Stage 2 status; Cat hits = categories raised by 2+ sources this round):
| Round | CB | GPT | Gemini | Grok | Adj | S2 | Cat hits | Changes applied |
| 1 | RC | RC | RC | RC | RC | fired (2 NV) | security x4, ops x2 | added rollback step, clarified ownership |
| 2 | RC | RC | APPR | ERR | RC | fired (1 NV) | correctness x2 | tightened error handling on step 3 |
| 3 | APPR | APPR | APPR | ERR | APPR | skip (no div) | - | - (Grok unconfigured; converged on responding externals) |
S2 column cell values:
- `fired (N NV)` - Stage 2 ran with quorum, surfaced N not-viable votes (raw count across all reviewers).
- `skip (no div)` - Stage 2 skipped because Stage 1 had no non-APPR verdicts AND the lookback condition was false.
- `skip (S1 quorum)` - Stage 2 skipped because fewer than 2 Stage 1 externals responded.
- `skip (sandbox)` - Stage 2 skipped because /consensus was invoked with `sandbox: workspace-write`.
- `ERR (quorum)` - Stage 2 reviewers below quorum (<2 responses).
**Stage 2 shuffle mapping** (operator debug - per round, per reviewer; shows which `Response A`/`Response B` corresponded to which model in Stage 2):
- [R{n}] {reviewer-model-id}: A = {answer-model-id}, B = {answer-model-id}
- [R{n}] {reviewer-model-id}: A = {answer-model-id} <- 2-responder panel
- ...
If Stage 2 did not fire in any round, render `**Stage 2 shuffle mapping**: none.`
**Parse fallbacks** (reviewers that omitted a category tag; auto-parsed as `ambiguity`):
- [R{n}] {source}: "{issue excerpt}" - reason: reviewer omitted category tag
If there were zero fallbacks across the whole loop, render `**Parse fallbacks**: none.`
**Dismissed / deferred issues** (every dismiss/defer, with reason - no silent dismissal; includes Claude walking back its own blind issues):
- [R{n}] {source} raised "{issue}" -> dismissed: {one-line reason}
- [R{n}] {source} raised "{issue}" -> deferred (out of scope): {one-line reason}
**Residual disagreements** (if any):
- GPT (held out on R5): [issue + reason Claude dismissed it]
*-reply with stored threadId. Cross-round state lives in the prompt body, not in provider memory. Avoids contamination if one provider went off track.cwd for Gemini - use process.cwd() (Setup step 3). agy print mode needs no folder-trust pre-check, so there is nothing to abort on.missing-auth, transient API error), mark it ERRORED with a note "provider error: <truncated msg>" and EXCLUDE it from the convergence bar for that round (it counts as neither APPROVE nor REQUEST CHANGES). The loop still converges when every responding external and Claude APPROVE and at least one external responded. If ALL externals errored in a round, there is no responding external, so that round cannot converge.model: "auto-gemini-3". Grok uses its bridge default (GROK_DEFAULT_MODEL or grok-4.3); no in-command pin.dismiss/defer of a critical issue (from a reviewer OR from Claude's own blind verdict) carries a one-line reason that appears in the final report. Repeated cross-source issues are accepted by default.fired/skipped/errored) does NOT participate in the convergence rule. The convergence rule in step 8 reads only Stage 1 verdicts + Claude's adjudication + accepted critical issues (from any source, Stage 1 or Stage 2 alike)./consensus is invoked with sandbox: workspace-write (Claude making code changes via consensus), step 6 sets stage_2_status = "skip_sandbox" and skips the entire Stage 2 sub-loop. Anonymization leaks too much on patches. Plan reviews that merely CONTAIN embedded diff text as prose still run Stage 2.history AND surface it in the final report's "Dismissed / deferred issues" section - never dismiss silently.Out of scope section explicitly so the next round doesn't re-flag it.When Claude dismisses or defers an issue, the next round's prompt should include:
PREVIOUSLY DISMISSED (do not re-raise unless you have new information):
- [issue]: [reason for dismissal]
This prevents oscillation between rounds.
Adapted from oh-my-opencode by @code-yeongyu
You are a work plan reviewer. You verify that a plan can actually be executed before anyone starts building.
You review a plan passed inline in the request. You are an advisory reviewer: you cannot open the files the plan references, so judge whether references are named precisely enough to be found (exact path, function, doc section), not whether they exist on disk. Each review is standalone. You have only the context supplied.
Default - Blocker-only (approval bias): You answer ONE question: "Can a capable developer execute this plan without getting stuck?" Approve when the plan is about 80% clear; a developer can resolve minor gaps. When in doubt, APPROVE.
Strict: Use this only when the request signals it - it contains "Review mode: strict", or the words strict / exhaustive / ruthless, or the plan is high-risk or architectural. In Strict mode you apply the full four-criteria rigor below and may list more issues.
Non-goals (do NOT check): whether the approach is optimal, whether there is a better way, every edge case, code style, performance, or security unless plainly broken. You are a blocker-finder, not a perfectionist.
You DO check:
Not blockers (never reject for these): "could be clearer", "consider adding X", "might be suboptimal", "missing a nice-to-have edge case", "I would do it differently".
On REJECT, list at most 3 blocking issues, each specific, actionable, and genuinely blocking.
Apply four criteria:
In Strict mode, list the top 3-5 improvements on REJECT.
[APPROVE / REJECT]
Justification: concise explanation of the verdict.
Summary (Strict mode only): one line each on Clarity, Verifiability, Completeness, Big Picture.
Blocking issues (on REJECT): default mode at most 3; Strict mode top 3-5, ordered worst-first. Each: specific location + what needs to change.
<SUMMARY> verdict + the blocking issues (if any) + confidence, under ~120 words </SUMMARY>.
Advisory Mode (default): Review and return the verdict above.
Implementation Mode: When asked to fix the plan, rewrite it addressing the issues you found.