From ck
Implements peer review patterns with a second AI agent critiquing primary builder code via six modes (Diff Critique, Design Challenge, etc.), MCP setup, iteration loops, and Codex reviewer. Uncovers missed issues and blind spots.
npx claudepluginhub juliusbrussee/cavekitThis skill uses the workspace's default tool permissions.
Use a second AI agent to review and challenge the first agent's work. The peer reviewer exists to find
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Guides agent creation for Claude Code plugins with file templates, frontmatter specs (name, description, model), triggering examples, system prompts, and best practices.
Use a second AI agent to review and challenge the first agent's work. The peer reviewer exists to find what the builder missed -- not to agree, not to be polite, and not to rubber-stamp. This is the single most effective quality gate you can add beyond automated tests.
The peer reviewer's job is to find what the builder missed, not to agree.
A review that says "looks good" is a wasted review. The peer review model should be given explicit instructions to be critical, to challenge assumptions, and to look for what is not there rather than what is.
LLMs have blind spots. Every model has patterns it over-relies on, edge cases it misses, and architectural assumptions it makes implicitly. A second model -- or the same model with a different prompt and role -- catches a different set of issues.
The analogy: In traditional engineering, code review exists because the author has cognitive blind spots about their own work. The same principle applies to AI agents, but the blind spots are different: they are systematic patterns in training data, context window limitations, and prompt interpretation biases.
What peer review catches that automated tests miss:
| Mode | Timing | Mechanism |
|---|---|---|
| Diff Critique | After implementation completes | A second model inspects the changeset with a fault-finding prompt; the builder incorporates valid fixes |
| Design Challenge | During the planning phase | A second model proposes alternative designs; the builder evaluates both against spec requirements and selects the stronger option |
| Threaded Debate | When exploring complex trade-offs | Multiple exchanges occur on a persistent conversation thread so context accumulates across turns |
| Delegated Scrutiny | For substantial review tasks | A dedicated teammate agent manages the full peer review interaction and delivers a consolidated findings report to the lead |
| Deciding Vote | When two approaches conflict | The lead presents both options to the peer review model, which analyzes trade-offs and recommends a path forward |
| Coverage Audit | During the validation phase | Test coverage data and gap analysis are fed to the peer review model for independent assessment of testing thoroughness |
Need peer review
├─ Reviewing completed code?
│ ├─ Small changeset (< 500 lines) → Diff Critique
│ └─ Large changeset or full feature → Delegated Scrutiny
├─ Designing architecture?
│ ├─ Single decision point → Deciding Vote
│ └─ Full system design → Design Challenge
├─ Debating trade-offs?
│ ├─ Need extended back-and-forth → Threaded Debate
│ └─ Need a decisive answer → Deciding Vote
└─ Validating test quality?
└─ Coverage Audit
Any AI model that exposes an MCP server interface can serve as an peer reviewer. The setup is model-agnostic -- the pattern works with any model that supports the MCP protocol.
Add the peer review model as an MCP server in your project's .mcp.json:
{
"mcpServers": {
"peer reviewer": {
"command": "{ADVERSARY_CLI}",
"args": ["mcp-server"],
"env": {
"API_KEY": "{ADVERSARY_API_KEY}"
}
}
}
}
Replace {ADVERSARY_CLI} with the CLI command for your chosen model (e.g., any model's CLI tool
that supports MCP server mode) and {ADVERSARY_API_KEY} with the appropriate credentials.
Most peer review model MCP servers expose two tools:
Start session -- Begin a new conversation with the peer review model
Reply to session -- Continue an existing conversation
The thread/session identifier is critical -- it allows multi-turn conversations where the peer reviewer builds on previous context.
Tool: peer reviewer.start_session
Parameters:
prompt: "Review the following code changes for bugs, security issues,
missing edge cases, and spec compliance. Be critical -- your
job is to find problems, not to agree. Here are the changes:
{DIFF_CONTENT}"
model: "{ADVERSARY_MODEL}"
Tool: peer reviewer.reply_to_session
Parameters:
thread_id: "{THREAD_ID_FROM_PREVIOUS}"
message: "Good findings. Now focus specifically on error handling paths.
For each function that can fail, verify there is explicit
error handling and that errors propagate correctly."
When: After a builder agent completes implementation of a feature or fix.
Process:
git diff {BASE_BRANCH}...HEADReview Prompt Template:
You are a senior code reviewer. Review the following code changes critically.
## What to look for:
- Bugs, logic errors, off-by-one errors
- Security vulnerabilities (injection, auth bypass, data exposure)
- Missing error handling and edge cases
- Performance issues (N+1 queries, unnecessary allocations, blocking calls)
- Cavekit compliance: does this implementation match the requirements?
- Code quality: naming, structure, unnecessary complexity
## What NOT to do:
- Do not say "looks good" unless you genuinely found zero issues
- Do not suggest stylistic changes unless they affect readability significantly
- Do not rewrite the code -- describe the problem and where it is
## Cavekit requirements for this feature:
{CAVEKIT_REQUIREMENTS}
## Code changes:
{DIFF_CONTENT}
## Output format:
For each finding:
- **Severity:** CRITICAL / HIGH / MEDIUM / LOW
- **File:** path and line range
- **Issue:** what is wrong
- **Why:** why this matters
- **Suggestion:** how to fix it
When: During the planning phase, before implementation begins.
Process:
Architecture Review Prompt Template:
You are a systems architect reviewing a proposed design. Your goal is to
find weaknesses, over-engineering, missing considerations, and better
alternatives.
## Kits (what must be built):
{CAVEKIT_CONTENT}
## Proposed architecture:
{PLAN_CONTENT}
## Evaluate:
1. Does this architecture satisfy all cavekit requirements?
2. Is it over-engineered for the scope?
3. Are there simpler alternatives that meet the same requirements?
4. What failure modes exist? How does the system recover?
5. What are the scaling bottlenecks?
6. What dependencies introduce risk?
When: Complex design discussions that require extended back-and-forth.
Process:
Key consideration: Thread-based conversations accumulate context. Keep the conversation focused on a single topic to avoid context dilution.
When: Large tasks where the peer review itself is substantial.
Process:
Why delegate: The peer review back-and-forth can consume significant context window. Delegating it to a dedicated teammate preserves the team lead's context for coordination.
When: The builder agent and human (or two agents) disagree on an approach.
Process:
Tie-Breaking Prompt Template:
Two approaches have been proposed for the same problem. Evaluate both
critically and recommend one.
## Context:
{PROBLEM_DESCRIPTION}
## Approach A:
{APPROACH_A}
## Approach B:
{APPROACH_B}
## Evaluation criteria:
- Correctness: which approach is more likely to be correct?
- Simplicity: which is easier to understand and maintain?
- Performance: which performs better for the expected use case?
- Risk: which has fewer failure modes?
## Your recommendation:
Pick one and explain why. If neither is clearly better, say so and
explain what additional information would break the tie.
When: During validation, after tests have been generated and run.
Process:
Instead of a simple build-then-review, run alternating convergence loops where each iteration alternates between building and reviewing.
Iteration 1: Builder runs against spec → produces code
Iteration 2: Reviewer runs against code + spec → produces findings
Iteration 3: Builder runs against spec + findings → fixes code
Iteration 4: Reviewer runs against updated code + spec → produces new findings
...repeat until findings converge to zero (or trivial)
Create two prompt files:
prompts/build.md -- The builder prompt:
Implement the requirements in the cavekit. Read implementation tracking for
context on what has been done. Read any review findings and address them.
Input: kits/, plans/, impl/, review-findings.md (if exists)
Output: source code, updated impl tracking
Exit: all cavekit requirements implemented, all review findings addressed
prompts/review.md -- The reviewer prompt:
Review the current implementation against the cavekit. Be critical. Find
bugs, missing requirements, security issues, and quality problems.
Input: kits/, plans/, source code, impl/
Output: review-findings.md
Exit: all source files reviewed against all cavekit requirements
# Terminal 1: Builder convergence loop
{LOOP_TOOL} prompts/build.md -n 5 -t 2h
# Terminal 2: Reviewer convergence loop (staggered by 30 min)
{LOOP_TOOL} prompts/review.md -n 5 -t 2h -d 30m
The builder and reviewer share the same git repository. The reviewer reads the
builder's latest committed code; the builder reads the reviewer's latest
review-findings.md. They converge naturally through git.
The peer review loop has converged when:
Problem: The peer review model says "looks good" without finding real issues. Fix: Explicitly instruct the peer reviewer to find problems. Add to the prompt: "If you find zero issues, explain what areas you checked and why you believe they are correct. An empty review is suspicious."
Problem: The peer reviewer provides complete rewrites instead of identifying issues. Fix: Instruct the peer reviewer to describe problems and locations, not to write code. "Your output is a list of findings, not a pull request."
Problem: The builder agent dismisses peer reviewer findings without addressing them. Fix: Require the builder to explicitly respond to each finding: "For each review finding, either fix it and explain the fix, or explain why the finding is not valid. You may not skip any finding."
Problem: Builder and reviewer keep going back and forth without converging. Fix: Set a maximum iteration count. After N iterations, escalate to human. If the disagreement persists, it likely indicates an ambiguous spec that needs human clarification.
Problem: Using the same model with the same prompt for both building and reviewing. Fix: At minimum, use different prompts with different roles. Ideally, use a different model or a different model version. The value of peer review comes from diverse perspectives.
| Mode | Key Prompt Instruction |
|---|---|
| Diff Critique | "Find bugs, security issues, missing edge cases. Do not say 'looks good'." |
| Design Challenge | "Find weaknesses and simpler alternatives. Evaluate failure modes." |
| Threaded Debate | "Continue the discussion. Build on previous context." |
| Delegated Scrutiny | "Own the peer reviewer interaction. Summarize findings for the lead." |
| Deciding Vote | "Evaluate both approaches. Recommend one with explicit reasoning." |
| Coverage Audit | "Identify untested edge cases and spec requirements without tests." |
Peer review fits into the Hunt lifecycle at multiple points:
| Hunt Phase | Peer Review Role |
|---|---|
| Draft | Review kits for completeness, ambiguity, missing edge cases |
| Architect | Architecture Review: challenge the plan before implementation begins |
| Build | Code Review: review implementation against kits after each feature |
| Inspect | Peer Review iteration loop: alternate build/review convergence |
| Monitor | Test Coverage Review: validate that monitoring covers all failure modes |
The most impactful point is during Inspect -- peer review iteration catches issues that neither automated tests nor single-agent convergence loops find.
The most rigorous automated quality process available: run a Cavekit cavekit through a Ralph Loop where Claude builds and Codex adversarially reviews every few iterations. A completely different model (different training data, different biases, different blind spots) challenges your implementation.
| Factor | Single-Model Loop | Codex Loop Mode |
|---|---|---|
| Blind spots | Same model, same blind spots every iteration | Two models catch different classes of issues |
| Cavekit drift | Builder may silently deviate from cavekit | Peer reviewer checks cavekit compliance explicitly |
| Quality floor | Converges to "good enough for one model" | Converges to "survives cross-examination" |
| Dead ends | May retry failed approaches | Peer reviewer flags repeated patterns |
┌─────────────────────────────────────────────────────┐
│ Ralph Loop │
│ (Stop hook feeds same prompt each iteration) │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Claude │───▶│ Build from │───▶│ Commit │ │
│ │ (Build) │ │ cavekit │ │ changes │ │
│ └──────────┘ └──────────────┘ └──────┬─────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Fix │◀──│ Parse │◀──│ Codex CLI │ │
│ │ findings │ │ findings │ │ (Review) │ │
│ └──────────┘ └──────────────┘ └────────────┘ │
│ │
│ Completion: all cavekit requirements met + │
│ no CRITICAL/HIGH findings │
└─────────────────────────────────────────────────────┘
scripts/codex-review.sh calls codex directly in --approval-mode full-auto with a structured review prompt. Faster, no MCP server overhead. Findings parsed and appended to context/impl/impl-review-findings.md..mcp.json. Claude calls the MCP tool on review iterations. Used only when Codex CLI delegation is unavailable.setup-build.sh auto-detects: if codex-review.sh is present and codex CLI is available, CLI delegation is used. Otherwise, falls back to MCP configuration.
/ck:make --peer-review # activates Codex Loop Mode (default interval: every 2nd iteration)
/ck:make --peer-review --review-interval 1 # review every iteration (maximum rigor)
/ck:make --peer-review --codex-model gpt-5.4-mini # faster, cheaper reviewer
--peer-review Does.mcp.json if CLI delegation is unavailable.source scripts/codex-review.sh
bp_codex_review --base main
The CLI path produces structured findings with severity levels (P0–P3) and handles fallback gracefully if Codex is unavailable.
{
"mcpServers": {
"codex-reviewer": {
"command": "codex",
"args": ["mcp-server", "-c", "model=\"gpt-5.4\""]
}
}
}
Iteration 1: BUILD — Read cavekit, implement first requirement
Iteration 2: REVIEW — Call Codex CLI (or MCP fallback), get findings, fix CRITICAL/HIGH
Iteration 3: BUILD — Continue implementing, address remaining findings
Iteration 4: REVIEW — Call Codex CLI again, new findings on new code
...
Iteration N: BUILD — All requirements met, all findings fixed
→ outputs <promise>CAVEKIT COMPLETE</promise>
Default review interval: every 2nd iteration. --review-interval 1 = review every iteration.
Review findings tracked in context/impl/impl-review-findings.md:
# Peer Review Findings
## Latest Review: Iteration 4 — 2026-03-14T10:30:00Z
### Reviewer: Codex (gpt-5.4)
| # | Severity | File | Issue | Status |
|---|----------|------|-------|--------|
| 1 | CRITICAL | src/auth.ts:L42 | Missing input validation on token | FIXED |
| 2 | HIGH | src/auth.ts:L67 | Race condition in session refresh | FIXED |
| 3 | MEDIUM | src/auth.ts:L15 | Unused import | NEW |
| 4 | LOW | src/auth.ts:L3 | Comment typo | WONTFIX |
## History
### Iteration 2
| # | Severity | File | Issue | Status |
|---|----------|------|-------|--------|
| 1 | CRITICAL | src/auth.ts:L20 | SQL injection in login query | FIXED |
The loop exits when the completion promise is output. The prompt instructs Claude to ONLY output it when ALL of these are true:
For reviewing existing code against a cavekit without building:
/ck:review --codex # single Codex-only review (see /ck:review command)
Each iteration calls Codex to review existing code against the cavekit, then fixes issues found.
npm install -g @openai/codexcodex login or env var).The peer review loop has converged when:
If the loop hits max iterations without converging:
context/impl/impl-review-findings.md for persistent issues./ck:revise --trace to trace issues back to kits.