Help us improve
Share bugs, ideas, or general feedback.
From review-council
Orchestrates multi-agent code review with Codex CLI, Gemini CLI, and five Claude specialist subagents (security, performance, logic, regression, robustness) then synthesizes findings into verified fixes. Use for deep reviews, second opinions, or council reviews on PRs, commits, or branches.
npx claudepluginhub yeameen/claude-code-review-councilHow this skill is triggered — by the user, by Claude, or both
Slash command
/review-council:review-councilThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run seven flagship-effort reviewers on the same change in parallel — Codex CLI, Gemini CLI, and five Claude specialist subagents (security, performance, logic, regression, robustness — the last one run blind) — then synthesize all of their feedback into a unified, verified report. Apply fixes only after you've confirmed each finding against the real code.
Enables cross-model code review where one AI model writes code and a different model reviews it, reducing self-approval bias. Handles setup, CLI subprocess invocation, and host detection.
Performs deep code quality reviews focusing on structure, refactoring, bugs, and best practices. Supports model flags for Claude reviewers and external backends.
Runs cross-LLM iterative code reviews with Codex or Gemini CLI peers, applying accepted fixes until consensus on improved code and report.
Share bugs, ideas, or general feedback.
Run seven flagship-effort reviewers on the same change in parallel — Codex CLI, Gemini CLI, and five Claude specialist subagents (security, performance, logic, regression, robustness — the last one run blind) — then synthesize all of their feedback into a unified, verified report. Apply fixes only after you've confirmed each finding against the real code.
Different model families miss different things, and even within one family, a focused specialist prompt finds issues that a holistic prompt sails past. Codex (GPT lineage) and Gemini bring vendor independence; the five Claude specialists bring axis depth. Their union catches substantially more real issues than any one alone — but they also produce false positives that look authoritative. The value is in the synthesis: dedupe across all seven streams, verify each claim by reading the code, decide which fixes to apply, and use the resume / re-prompt mechanisms to push back on dubious findings rather than blindly applying them.
One reviewer (robustness) deliberately runs blind — diff only, no intent briefing. Field experience: a context.md note saying "unresolvable values are dropped by design" anchored six intent-briefed reviewers away from the drop path, and a human reviewer without that framing then found real hardening gaps there (case-sensitive marker matching, curly-apostrophe lookups). Intent briefing catches divergence-from-intent; blindness catches what the framing declared uninteresting. You want both.
Do not just paste the seven reports together and call it done. The user wants your judgment layered on top.
Trigger on any of: "review this change/PR/commit", "second opinion", "council review", "multi-tool review", "review with codex and gemini", "what does codex/gemini think", "thorough review", "deep review". Also offer it proactively before risky merges (security-sensitive code, migrations, infra changes) if the user is pondering review depth.
Ask the user only if ambiguous. Otherwise infer from context:
| User intent | Scope |
|---|---|
| "review my changes" with dirty tree | uncommitted (staged + unstaged + untracked) |
| "review this branch" / "review the PR" | diff vs main (or whatever the PR's base is) |
| "review commit abc123" | that specific commit |
| GitHub PR number given | fetch with gh pr diff <num> |
Capture the scope as a short label (e.g. uncommitted, branch:feature/x vs main, commit:abc123) — you'll feed this to both tools and use it in the final report.
Create a workspace directory to hold both reports and any follow-up exchanges. This keeps everything inspectable and lets you reference the raw outputs later.
WS="/tmp/review-council-$(date +%s)"
mkdir -p "$WS"
Save the diff once so both tools see exactly the same input:
# Pick the right one for your scope
git diff HEAD > "$WS/changes.diff" # uncommitted (tracked)
git diff main...HEAD > "$WS/changes.diff" # branch vs main
git show <sha> > "$WS/changes.diff" # specific commit
gh pr diff <num> > "$WS/changes.diff" # GitHub PR
Also write a short context.md with: scope label, repo path, key files touched, and any project conventions worth flagging (CLAUDE.md highlights, e.g. "this repo uses dynaconf settings — no hardcoded thresholds"). Both reviewers will benefit from this framing.
This step is one single message with multiple tool calls so everything runs concurrently:
& and captures their PIDs (don't wait yet — let the Agent calls run in parallel with them).Codex/gemini take 3–8 minutes at flagship effort; the Claude specialists usually finish faster. Tell the user up front: "Kicking off seven reviewers — codex, gemini, and five Claude specialists (security, performance, logic, regression, robustness). Expect 3–8 min wall time." Don't go silent.
Codex's review subcommand writes a structured report with severity tags. Always use it in prompt-mode so Codex sees the same intent (PR description, project conventions, out-of-scope items) that Gemini and the intent-briefed Claude specialists get from context.md (the blind robustness agent is the deliberate exception). Without intent, Codex misses entire classes of "implementation diverges from stated rule" findings — verified empirically: in an A/B test on the same diff, prompt-mode caught 3 P1s (including a stated-rule violation) that no-prompt mode could not surface.
Important CLI constraint: --base/--commit/--uncommitted are mutually exclusive with [PROMPT]. The CLI rejects the combo with error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'. So when you want intent, inline the diff into the prompt yourself and drop the scope flag:
codex review \
--title "<commit subject or scope label>" \
-c model="gpt-5.5" \
-c model_reasoning_effort="xhigh" \
- <<EOF > "$WS/codex.out" 2> "$WS/codex.err" &
$(cat "$WS/context.md")
When reviewing, focus on whether the diff actually accomplishes its stated intent above. Severity-tag (P0–P3) as usual. Call out where the implementation diverges from what the description claims, AND respect any out-of-scope items the description says are deliberately deferred (don't re-flag those).
Diff:
$(cat "$WS/changes.diff")
EOF
CODEX_PID=$!
The structured P0/P1/P2/P3 output behavior is preserved when you use [PROMPT] — you don't lose the review template, you augment it.
When prompt-mode is impractical: if the diff is so large that inlining it would blow Codex's context window (rare — a few thousand lines fits), fall back to flag-only mode (codex review --base main ...) and accept that Codex will be intent-blind. Flag that limitation in the synthesis so the user knows the council was uneven.
Do not use codex exec for code review — codex review with a prompt gives you the same custom-prompt flexibility AND the structured severity-tagged output. codex exec is a generic agent runner and produces unstructured prose.
Gemini's non-interactive mode is -p. Use --yolo so it doesn't stall on tool-confirmation prompts when it wants to read related files for context:
gemini \
-m gemini-3.1-pro-preview \
--yolo \
-p "$(cat <<EOF
You are a senior code reviewer. Review the diff below against the repo
at $(pwd). Focus on correctness, security, performance, SOLID violations,
race conditions, and project-convention drift. Cite file:line for each
finding. Categorize as P0 (block merge) / P1 (fix before merge) / P2
(follow-up) / P3 (nit). Be specific — vague findings like "consider
adding tests" are not useful.
Scope: <scope label>
Diff:
$(cat "$WS/changes.diff")
EOF
)" > "$WS/gemini.out" 2> "$WS/gemini.err" &
GEMINI_PID=$!
In the same message that launches the bash backgrounded codex/gemini, also issue five Agent tool calls. Use subagent_type: "general-purpose" (no specialist subagent type exists for these axes; the focus comes from the prompt, not the type). Each gets the diff path and the context file. Pass run_in_background: false so the results return inline once they finish — but because they're issued in the same message as the others, they all start in parallel.
Fallback when the Agent tool is unavailable — if you're running this skill from inside a subagent, the Agent tool may not be in your toolset (nested-subagent restriction). In that case do not skip the specialist pass — instead run the five reviews in-context yourself. Read the diff and PR body once, then produce five sequential focused analyses keyed to security/performance/logic/regression/robustness with the same axis discipline described below (for the robustness pass, deliberately re-read the raw diff ignoring the intent notes), and save each to <workspace>/claude-<role>.md as if a real subagent had written it. The output quality is comparable; you just lose parallelism within this turn. Note the fallback in the final report so the user knows specialists ran serially.
For each subagent, the prompt should:
<workspace>/changes.diff), and the repo path.context.md, the PR description, or any "by design / out of scope" notes. Those notes anchor reviewers away from the very paths the author declared intentional, which is exactly where this agent must look.The five roles:
pyproject.toml / requirements) and to weight Spark-specific concerns accordingly.Ask each to write its report to <workspace>/claude-<role>.md so all seven streams land in the same place for synthesis.
After the message that kicks everything off, you'll have:
The Agent results return when each subagent finishes. Once you have all four, run a wait for the bash PIDs to drain codex/gemini:
wait $CODEX_PID $GEMINI_PID
If the Claude specialists finish well before codex/gemini, you can start verifying their findings against the code (Step 4 work) while waiting — useful work in dead time.
Read all seven outputs (codex.out, gemini.out, claude-security.md, claude-performance.md, claude-logic.md, claude-regression.md, claude-robustness.md). Build a single deduplicated finding list. For each finding, do all of the following:
[CODEX|GEMINI|sec|reg] listing every reviewer that flagged it (codex, gemini, and which Claude specialists by short name: sec, perf, logic, reg, rob). The more reviewers that independently flagged something, the higher confidence — but verify even the unanimous ones, because shared blindspots happen too.If a finding is suspicious — citation looks wrong, severity feels inflated, fix suggestion would break something — don't just dismiss it silently. Ask the original reviewer to defend itself. This is faster and more accurate than litigating from scratch.
Codex — verified syntax is codex exec resume --last "<prompt>". --last resumes the most recently recorded session without a session ID. Confirmed against codex exec resume --help.
codex exec resume --last \
"You flagged a P0 in pipeline/foo.py:42 about X. I read the code and
the function actually does Y, which handles that case. Am I missing
something, or should this be downgraded?" \
> "$WS/codex.followup1.out" 2>&1
Gemini — resume with -r latest, then push the follow-up via -p:
gemini -m gemini-3.1-pro-preview --yolo -r latest \
-p "You flagged X. On re-reading, the code does Y. Defend or retract." \
> "$WS/gemini.followup1.out" 2>&1
Claude specialists — Agent tool calls don't have a built-in resume. Either re-spawn a fresh Agent of the same role with the disputed finding + your counter-evidence in the prompt, or (if the original agent is still running in the background, which they aren't by default in this skill) use SendMessage to its name. For most cases, a fresh re-spawn is simplest and cheap:
Agent(subagent_type: "general-purpose",
description: "Re-examine flagged finding",
prompt: "Earlier review flagged <finding> in <file:line>. On re-reading
the code, <counter-evidence>. Defend or retract this finding
with specific code references. Under 200 words.")
Save each round to the workspace (*.followupN.out or claude-<role>.followupN.md) so the audit trail is intact.
Before applying any fixes, show the user a synthesized report. Structure:
# Review Council Report — <scope label>
## Summary
- 7 reviewers ran: codex (GPT-5.5 xhigh), gemini (3.1 Pro), claude-{security,performance,logic,regression,robustness(blind)}
- N findings total (X verified, Y dismissed as false-positive after re-reading)
- M items flagged by 3+ reviewers independently (high-confidence)
- K items where reviewers disagreed (called out below)
## P0 — Block merge
- [CODEX|GEMINI|sec] file.py:42 — <one-line issue>. <why it matters>. Fix: <concrete>.
- [reg] file.py:108 — schema-breaking change to `gold_posts.author_id` type;
consumed by narrative_aggregations.py:55. (Holistic reviewers missed this;
the regression specialist caught it via grep for callers.) Fix: <concrete>.
## P1 — Fix before merge
- [GEMINI|perf|logic] file.py:230 — N+1 query inside `for user in users:` loop;
perf agent flagged the latency, logic agent flagged that `user.posts` is also
loaded eagerly. Codex thought this was fine — its reasoning relied on a cache
that doesn't actually exist on this code path. Fix: <concrete>.
## P2 — Follow-up
...
## P3 — Nits
...
## Dismissed (false positives)
- [GEMINI] file.py:200 — claimed missing null check, but `foo` is non-null per
the type signature in bar.py:14. Skipped after asking gemini to defend (see
gemini.followup1.out — it retracted).
- [sec] file.py:300 — flagged SQL injection, but the value is a hardcoded enum
validated at the boundary in validators.py:22. Skipped.
## Proposed actions
1. Apply fix for P0 #1
2. Apply fix for P0 #2
3. Open follow-up issue for P2 #1
4. Skip the rest
Want me to apply these fixes?
Then stop and wait for the user. Do not apply fixes without confirmation — the user may have context that changes what should land.
Once approved, edit the files. After the edits:
pytest <touched paths>, mypy, etc.). Don't run a 20-minute full suite unless the user asked.Seven flagship-effort reviewers cost real money and 3–8 minutes of wall time. Don't run review-council for trivial changes (typo fixes, doc tweaks, single-file refactors with tests). Reach for it when the change is non-trivial: new business logic, security-sensitive code, infra/migrations, schema changes, or anything the user signals as high-stakes.
If the user invokes the skill on a trivial change, say so and offer the lighter /pr-review skill instead.
wait-ing on bash wastes wall-time.--yolo on gemini in non-interactive mode — without it the process can hang on tool-confirmation that will never come.TerminalQuotaError 429, ~24h reset). Time-box each at a few minutes; if one returns empty/errored, surface it clearly in the report's "Reviewer execution status" section and synthesize from whatever streams did produce output. The five Claude specialists are reliable enough to carry a review on their own when the holistic reviewers are degraded — do not block the user waiting for retries that won't change anything.# Codex review — prompt-mode (preferred: passes intent)
# Note: --base/--commit/--uncommitted are mutually exclusive with [PROMPT].
# Inline the diff in the prompt and drop the scope flag.
codex review \
--title "<commit subject>" \
-c model="gpt-5.5" -c model_reasoning_effort="xhigh" \
- <<<"$CONTEXT_AND_DIFF"
# Codex review — flag-only mode (fallback for huge diffs; loses intent)
codex review --base main -c model="gpt-5.5" -c model_reasoning_effort="xhigh"
codex review --uncommitted -c model="gpt-5.5" -c model_reasoning_effort="xhigh"
codex review --commit <sha> -c model="gpt-5.5" -c model_reasoning_effort="xhigh"
# Codex follow-up (verified syntax)
codex exec resume --last "<prompt>"
# Gemini non-interactive
gemini -m gemini-3.1-pro-preview --yolo -p "<prompt>"
# Gemini follow-up
gemini -m gemini-3.1-pro-preview --yolo -r latest -p "<prompt>"