How It Works · Installation · Mission Lifecycle · FAQ
TandemKit
Describe your goal, approve the spec, then step away — Claude and Codex loop together until it's right.
TandemKit is a Claude Code plugin that runs three sessions — Planner, Generator, and Evaluator — with two of them pairing Claude and Codex as independent reviewers. You are only needed at two points: during planning (questions and spec approval) and at review (when evaluation passes and you give feedback or call it done). Between those two points, the Generator implements and the Evaluator verifies in a tight loop, with no manual review or copy-pasting from you. In both the Planner and Evaluator sessions, Claude automatically launches Codex as a background task using the official Codex plugin, so two different models independently investigate and converge on a result — everything inside Claude Code.
Why TandemKit?
Who Is It For?
You have a Claude Max subscription (which includes Claude Code) and a ChatGPT subscription (which includes Codex). You work on tasks complex enough to warrant the extra cost — TandemKit is not recommended for simple, small, or mechanical tasks, since the multi-session loop uses more tokens than a regular Claude session.
The Reasoning
Anthropic's Harness article (March 2026) identified the core problem with agentic sessions: Claude stops too early. A single session anchors on its own work, declares "looks good!" prematurely, and misses real bugs. The fix is a separate evaluator session that verifies independently rather than rubber-stamping its own output.
The article puts it well: "tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work." That separation is TandemKit's foundation. On top of that, it pairs Claude + Codex in both planning and evaluation — two models that approach problems differently. Codex tends to explore more files and dig into details Claude passes over, and in practice it finds real bugs Claude has already marked as passing.
Concrete verification tools matter too: the Harness article showed that without them, evaluators guess from surface impressions. /tandemkit:init sets up project-type-specific tools (build, run, navigate, screenshot) so the Evaluator can do what a human reviewer would.
How It Works
All three sessions are Claude Code sessions. Claude orchestrates everything — including launching Codex as a background task when a second opinion is needed.
USER ── step 1: planning
│
└──> [1] Planner Session
Claude ───────────────► Codex (launched by Claude, runs in background)
│ ◄──── findings ── │
└─────── converge ─────┘
│
Spec.md ◄── you review and approve before continuing
USER ── step 2: open both sessions in parallel — they coordinate autonomously from here
│
├──> [2] Generator Session (reads Spec.md)
│ Claude implements, commits at milestones
│
└──> [3] Evaluator Session
Claude ───────────────► Codex (launched by Claude, runs in background)
│ ◄──── findings ── │
└─────── converge ─────┘
│
├──> FAIL ──> Generator fixes & resubmits ──> back to [3]
└──> PASS ──> Review Briefing ──> User
Three Claude Code sessions for the entire mission:
- Planner — Claude investigates the codebase and launches Codex in the background to do the same. Both produce independent findings. Claude reads Codex's results, merges them, and they iterate until converged. You answer questions and approve the final spec.
- Generator — Claude implements against the spec, committing at milestones. Fully autonomous — no Codex needed here since the Evaluator handles verification.
- Evaluator — Claude evaluates independently while Codex does the same in the background. They converge on a verdict. On FAIL, the Generator fixes and resubmits. On PASS, you review the result.
You are active during planning, then step away while Generator and Evaluator loop autonomously. When evaluation passes, you receive a Review Briefing — approve it or give feedback, and the loop continues with updated requirements until you're satisfied.