From codex-peer-review
Performs symmetric two-AI peer reviews using OpenAI Codex CLI: independent blind reviews followed by structured per-issue debate for plans, code reviews, architecture, and recommendations.
npx claudepluginhub jcputney/agent-peer-review --plugin codex-peer-reviewThis skill uses the workspace's default tool permissions.
Symmetric two-AI peer review using OpenAI Codex CLI. Both AIs review independently, then debate per-issue with terminal states until convergence — not a one-shot validation.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Symmetric two-AI peer review using OpenAI Codex CLI. Both AIs review independently, then debate per-issue with terminal states until convergence — not a one-shot validation.
Core principle: Asymmetric "validate my answer" loops anchor on the proposer's framing. Symmetric blind passes catch ~2x as many issues because each AI examines the work without priming. The debate phase then resolves conflicts deterministically via per-issue terminal states.
This is the single source of truth for the peer review protocol. The agent file is a thin dispatcher that loads this skill.
@discussion-protocol.md — debate mechanics (round-by-round prompts) @escalation-criteria.md — when to skip the debate and go to external research @common-mistakes.md — anti-patterns and recovery
| Mode | Command | When |
|---|---|---|
| blind-debate (default) | /codex-peer-review | Symmetric blind pass + structured debate. Best signal. |
| classic (deprecated) | /codex-peer-review --mode classic | Old single-pass validation. Cheap, weaker signal. Will be removed. |
Auto-trigger (proactive validation before presenting plans/designs/reviews) always uses blind-debate.
Tested against codex-cli 0.118.0. Hard requirements:
codex exec for ALL machine-readable output (codex review does NOT support --json or -o in 0.118.0 — do not use it for parseable workflows)jq (fail fast if missing — required for session ID extraction and JSONL parsing)peer-review profile in ~/.codex/config.toml (see "Codex Profile Setup" below)Do NOT use --output-schema. It is unstable in 0.118.0 under --json (timeouts, runtime panics, no output written). Schema is enforced via prompt template instead, with regex/jq extraction as the parser.
Model selection lives in ~/.codex/config.toml, not in this plugin. This keeps CLI flags out of prompts and lets users tune without editing the plugin.
Run this once per machine (the plugin's init flow does this automatically if the profiles are missing):
# ~/.codex/config.toml
[profiles.peer-review]
model = "gpt-5.4"
model_reasoning_effort = "high"
[profiles.peer-review-summarizer]
model = "gpt-5.4-mini"
model_reasoning_effort = "low"
The plugin invokes Codex as codex exec --profile peer-review .... Never hardcode -m gpt-5.x-... in agent prompts.
gpt-5.3-codex-spark still exists in 0.118.0 but is legacy/niche. gpt-5.4-mini is the durable cheap-workhorse choice.
digraph blind_debate {
rankdir=TB;
node [shape=box];
start [label="User invokes /codex-peer-review\nOR Claude is about to present\na plan/design/review" shape=ellipse];
blind [label="ROUND 0: Blind pass\nClaude review ‖ Codex review\n(neither sees the other)"];
canon [label="Canonicalize issues\nid = sha1(file + normalized claim)\nMerge duplicates\nDrop severity:style"];
states [label="All issues start in state: proposed"];
debate [label="ROUND N (N≥1): Per-issue debate\nEach side responds: concede/defend/dismiss\nNew issues allowed (also start as proposed)"];
transition [label="Transition states based on responses"];
converged [label="All issues in terminal state?\n(accepted/rejected/merged/escalated/deferred)" shape=diamond];
cap [label="Round count >= cap?" shape=diamond];
extend [label="Ask user: extend cap?" shape=diamond];
synth [label="Verdict synthesis\nCritical / Important / Contested / Dismissed" shape=ellipse];
start -> blind;
blind -> canon;
canon -> states;
states -> debate;
debate -> transition;
transition -> converged;
converged -> synth [label="yes"];
converged -> cap [label="no"];
cap -> debate [label="no"];
cap -> extend [label="yes"];
extend -> debate [label="yes"];
extend -> synth [label="no — escalate remaining"];
}
Default cap: 3 rounds total (1 blind + 2 debate). The peer review found that 5 (the dg default) is too expensive for serious code review and risks rationalization loops. Allow extension only when new evidence appears in the final round.
Both AIs review the same scope with the same prompt, neither seeing the other's work.
| User input | Scope |
|---|---|
/codex-peer-review (no args) | Use AskUserQuestion to select: changes vs branch / uncommitted / specific commit |
/codex-peer-review --base X | git diff X...HEAD |
/codex-peer-review --uncommitted | Staged + unstaged + untracked |
/codex-peer-review --commit SHA | Single commit |
/codex-peer-review <question> | Question text — no diff, validate the answer |
| Auto-trigger from Claude's plan | Plan text + affected files |
Never guess the base branch. Always ask via AskUserQuestion if not specified.
Both Claude and Codex receive this exact template (variables filled in):
You are performing an independent code review. Another AI is reviewing the same
work in parallel. You will not see their findings until after this pass.
## Scope
{scope_description}
## Files / Diff
{files_or_diff}
## Review lenses (apply BOTH)
1. CRITIC LENS: For each issue you raise, you MUST provide ONE of:
- A concrete exploit path or attack scenario
- A failing test case (input + expected vs actual)
- A specific failure mode (e.g., "concurrent writes to map at handler.go:42 will panic under load")
Vague concerns ("could be improved", "might be fragile") are REJECTED.
2. DEFENDER LENS: Before raising an issue, check whether:
- An existing test covers it
- A codebase invariant or convention makes it impossible
- It is intentional per a comment, ADR, or commit message
If yes, do not raise it.
## Output format (strict)
Emit a single fenced code block tagged `findings` containing JSONL — one finding per line:
```findings
{"id":"<sha1(file+claim)>","file":"path:line","severity":"critical|high|medium|low|style","claim":"<one sentence>","evidence":"<exploit/test/failure mode>","category":"security|correctness|performance|maintainability|style"}
{"id":"...","file":"...","severity":"...","claim":"...","evidence":"...","category":"..."}
If you find no issues, emit an empty findings block.
After the block, write a 2-3 sentence summary of the work's overall quality and your confidence level.
### Codex invocation
```bash
codex exec --profile peer-review --sandbox read-only \
-o /tmp/codex_round0.txt \
--json 2>&1 | tee /tmp/codex_round0.jsonl <<'EOF'
[blind-pass prompt above]
EOF
-o writes the final assistant message only — that is what we parse for the findings block. The JSONL stream is for progress polling and session ID extraction.
After both sides emit findings, normalize and merge:
# Concatenate both findings blocks
jq -s '.' /tmp/claude_findings.json /tmp/codex_findings.json > /tmp/all_findings.json
# Merge by id; if both AIs reported the same id, mark source="both"
jq '
group_by(.id)
| map({
id: .[0].id,
file: .[0].file,
severity: .[0].severity,
claim: .[0].claim,
evidence: (map(.evidence) | unique | join(" || ")),
category: .[0].category,
source: (if length == 2 then "both" else .[0]._source end),
status: "proposed"
})
' /tmp/all_findings.json > /tmp/canonical_issues.json
Issue IDs are content-hashed, not positional. The same finding from both AIs collapses to one row with source: "both" (a strong signal — these usually become accepted immediately).
Drop severity:style from the debate. Style issues never converge through debate; they're resolved by project conventions. Surface them in the final report as a separate "style notes" section, not in the verdict.
Each side sees:
Each side emits per-issue stances for every non-terminal issue, plus any new issues (which start as proposed).
┌──────────┐
│ proposed │ ◄── new in this round
└────┬─────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌────────────┐
│accepted │ │ rejected │ │ escalated │
└─────────┘ └──────────┘ └─────┬──────┘
▲ ▲ │
│ │ ▼
│ │ (next round)
│ │
┌────┴────┐ ┌────┴────┐
│ merged │ │deferred │
└─────────┘ └─────────┘
For each non-terminal issue, after both sides respond in round N:
| Claude stance | Codex stance | New state |
|---|---|---|
| concede | concede | rejected (both withdrew) |
| defend | defend | escalated (carry to next round) |
| dismiss | dismiss | rejected |
| concede / dismiss | accept / push | accepted (one side conceded the other's claim) |
| accept / push | concede / dismiss | accepted |
| proposed (new evidence merges with existing) | — | merged into target id |
Terminal states: accepted, rejected, merged, deferred. Issues in these states drop out of the debate.
escalated issues survive to round N+1, but require new evidence to remain alive past round 2. An issue defended only by re-asserting the same point in round 3 auto-transitions to deferred (presented to user as "contested, both held position").
converged = (no issues in state ∈ {proposed, escalated})
This is derived, not declared. No top-level converged: true flag from the model — that just invites premature claims of consensus.
You are continuing the peer review debate. The other AI has emitted findings.
You must respond to each non-terminal issue with a stance.
## Canonical issue table (current state)
{table_of_issues_with_states}
## The other AI's round N response
{other_ai_response}
## Your task
1. For each non-terminal issue, emit ONE stance line:
```stances
{"id":"<id>","stance":"concede|defend|dismiss|accept","reasoning":"<one sentence>","new_evidence":"<optional — required to keep defend alive past round 2>"}
concede = you no longer believe this issue is realdefend = you maintain this issue (must include new_evidence in round 3+)dismiss = you believe the OTHER AI's claim is wrongaccept = you accept the OTHER AI's claim that you originally raised was wrong (use when conceding YOUR OWN issue under their dismiss)After the stances block, you may emit a findings block with NEW issues only.
New issues must follow all CRITIC LENS rules (concrete evidence required).
Apply both lenses (critic and defender) as before.
### Session resume (optimization, not correctness)
```bash
# Round 1 — extract session ID from JSONL
SESSION_ID=$(jq -r 'select(.type=="thread.started") | .thread_id' /tmp/codex_round0.jsonl | head -1)
# Round 2 — try to resume
if [ -n "$SESSION_ID" ]; then
codex exec --profile peer-review --sandbox read-only resume "$SESSION_ID" \
-o /tmp/codex_round1.txt --json 2>&1 | tee /tmp/codex_round1.jsonl <<'EOF'
[round prompt]
EOF
fi
# If resume fails (session store error, missing ID), fall back to fresh exec with full canonical table re-injected
Session resume is best-effort. Each round prompt re-injects the canonical issue table, so a session-store error degrades latency but not correctness.
After convergence (or cap exhausted), categorize all issues by terminal state:
| Verdict | Source states | Meaning |
|---|---|---|
| Critical | accepted AND severity ∈ {critical, high} | Real bugs, ship as required fixes |
| Important | accepted AND severity = medium | Strong recommendation |
| Contested | escalated (cap hit) OR deferred | Both held positions — present both views, user decides |
| Dismissed | rejected | Raised but withdrawn — informational only |
| Style notes | severity = style | Bypassed debate entirely — informational |
## Peer Review Result — {scope}
**Mode:** blind-debate
**Rounds:** N (converged | cap reached)
**Issues canonical:** X total, Y from both AIs, Z unique to one side
### Critical
- `file:line` — {claim}
- Evidence: {evidence}
- Source: both | claude | codex
### Important
- `file:line` — {claim}
- Evidence: {evidence}
### Contested (both AIs held position)
- `file:line` — {claim}
- Claude's view: {summary}
- Codex's view: {summary}
- Recommendation: {how user should decide}
### Dismissed (raised but withdrawn)
- `file:line` — {claim} — {why withdrawn}
### Style notes
- `file:line` — {note}
### Process notes
- {anything notable about the debate, e.g., "Codex raised 3 security issues Claude missed"}
- {immediate-escalation triggers fired, if any}
Some classes of finding skip the per-issue debate and go straight to external research arbitration:
If either AI flags a security concern in the blind pass, escalate that issue immediately to external research. The peer review skill is agnostic about which research tool to use — pick the best one available (web search, an MCP research tool, vendor docs, etc.). See @escalation-criteria.md.
Always run via the codex-peer-reviewer subagent. The main conversation must never see raw Codex output, JSONL streams, or per-round transcripts — only the final synthesized verdict.
The agent file (agents/codex-peer-reviewer.md) is a thin dispatcher that loads this skill and runs the protocol.
# Required
command -v codex >/dev/null || { echo "ERROR: install codex CLI: npm i -g @openai/codex"; exit 1; }
command -v jq >/dev/null || { echo "ERROR: install jq: brew install jq"; exit 1; }
# Verify peer-review profile exists
grep -q '\[profiles.peer-review\]' ~/.codex/config.toml || {
echo "ERROR: ~/.codex/config.toml missing [profiles.peer-review]. Run: codex-peer-review init"
exit 1
}
# Verify auth
codex login --check 2>/dev/null || echo "WARNING: run 'codex login'"
| Scenario | Action |
|---|---|
| About to present design/plan | Auto-trigger blind-debate, no args |
| About to present code review | Auto-trigger blind-debate, scope = the diff |
| User asks broad question | blind-debate with question as scope, no diff |
User runs /codex-peer-review | Ask for scope, run blind-debate |
User runs /codex-peer-review --mode classic | Single-pass validation (deprecated) |
| Security issue surfaces | Immediate escalation, skip per-issue debate |
| Cap hit, issues still escalated | Mark as Contested, present both views |
Key changes from prior versions:
codex review --json removed (the flag does not exist in 0.118.0)gpt-5.3-codex-spark removed (use Codex profiles instead)converged: bool removed (now derived from per-issue states)--output-schema, which is unstable)