Plugin

tandemkit

Name: tandemkit
Author: flinedev

Describe your project goal to Claude, approve the AI-generated Spec.md, then step away as Claude and Codex autonomously loop: investigate codebase, generate code with milestone git commits, evaluate output with Codex for verification, and iterate via bash watchers until specs pass or you intervene.

npx claudepluginhub flinedev/tandemkit

Component Overview

Commands

Skills

Component Details

Commands (1)

TandemKit — Project Initialization

/init

Initialize TandemKit in this project. Sets up the coordination folder, configures roles, installs tools, and verifies Codex access.

Skills (3)

evaluator

/evaluator

TandemKit Evaluator — verify the Generator's work against the spec with Codex as a second opinion. Fully autonomous. Invoked explicitly.

generator

/generator

TandemKit Generator — implement a mission's spec, commit at milestones, signal the evaluator, and present the review briefing. Invoked explicitly.

planner

/planner

TandemKit Planner — investigate, plan with Codex second opinion, and produce a Spec.md. Invoked explicitly by the user.

README

How It Works · Installation · Mission Lifecycle · FAQ

TandemKit

Describe your goal, approve the spec, then step away — Claude and Codex loop together until it's right.

TandemKit is a Claude Code plugin that runs three sessions — Planner, Generator, and Evaluator — with two of them pairing Claude and Codex as independent reviewers. You are only needed at two points: during planning (questions and spec approval) and at review (when evaluation passes and you give feedback or call it done). Between those two points, the Generator implements and the Evaluator verifies in a tight loop, with no manual review or copy-pasting from you. In both the Planner and Evaluator sessions, Claude automatically launches Codex as a background task using the official Codex plugin, so two different models independently investigate and converge on a result — everything inside Claude Code.

Why TandemKit?

Who Is It For?

You have a Claude Max subscription (which includes Claude Code) and a ChatGPT subscription (which includes Codex). You work on tasks complex enough to warrant the extra cost — TandemKit is not recommended for simple, small, or mechanical tasks, since the multi-session loop uses more tokens than a regular Claude session.

The Reasoning

Anthropic's Harness article (March 2026) identified the core problem with agentic sessions: Claude stops too early. A single session anchors on its own work, declares "looks good!" prematurely, and misses real bugs. The fix is a separate evaluator session that verifies independently rather than rubber-stamping its own output.

The article puts it well: "tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work." That separation is TandemKit's foundation. On top of that, it pairs Claude + Codex in both planning and evaluation — two models that approach problems differently. Codex tends to explore more files and dig into details Claude passes over, and in practice it finds real bugs Claude has already marked as passing.

Concrete verification tools matter too: the Harness article showed that without them, evaluators guess from surface impressions. /tandemkit:init sets up project-type-specific tools (build, run, navigate, screenshot) so the Evaluator can do what a human reviewer would.

How It Works

All three sessions are Claude Code sessions. Claude orchestrates everything — including launching Codex as a background task when a second opinion is needed.

USER  ── step 1: planning
  │
  └──> [1] Planner Session
             Claude ───────────────► Codex (launched by Claude, runs in background)
              │    ◄──── findings ── │
              └─────── converge ─────┘
                            │
                         Spec.md ◄── you review and approve before continuing

USER  ── step 2: open both sessions in parallel — they coordinate autonomously from here
  │
  ├──> [2] Generator Session (reads Spec.md)
  │          Claude implements, commits at milestones
  │
  └──> [3] Evaluator Session
             Claude ───────────────► Codex (launched by Claude, runs in background)
              │    ◄──── findings ── │
              └─────── converge ─────┘
                            │
                            ├──> FAIL ──> Generator fixes & resubmits ──> back to [3]
                            └──> PASS ──> Review Briefing ──> User

Three Claude Code sessions for the entire mission:

Planner — Claude investigates the codebase and launches Codex in the background to do the same. Both produce independent findings. Claude reads Codex's results, merges them, and they iterate until converged. You answer questions and approve the final spec.
Generator — Claude implements against the spec, committing at milestones. Fully autonomous — no Codex needed here since the Evaluator handles verification.
Evaluator — Claude evaluates independently while Codex does the same in the background. They converge on a verdict. On FAIL, the Generator fixes and resubmits. On PASS, you review the result.

You are active during planning, then step away while Generator and Evaluator loop autonomously. When evaluation passes, you receive a Review Briefing — approve it or give feedback, and the loop continues with updated requirements until you're satisfied.

View full README on GitHub

Similar Plugins

director-mode-lite

Use Claude Code like a Director, not a Programmer. Complete toolkit with 26 commands, 14 agents, 31 skills, and TDD-based Auto-Loop.

1mo

Stats

Version1.4.0

Stars18

Forks2

MaintenanceExcellent

LicenseMIT

Last CommitMay 2, 2026

AddedApr 24, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Help us improve

Share bugs, ideas, or general feedback.

Back to Plugins

tandemkit

TandemKit

Describe your goal, approve the spec, then step away — Claude and Codex loop together until it's right.

Why TandemKit?

Who Is It For?

The Reasoning

How It Works

All three sessions are Claude Code sessions. Claude orchestrates everything — including launching Codex as a background task when a second opinion is needed.

USER ── step 1: planning │ └──> [1] Planner Session Claude ───────────────► Codex (launched by Claude, runs in background) │ ◄──── findings ── │ └─────── converge ─────┘ │ Spec.md ◄── you review and approve before continuing USER ── step 2: open both sessions in parallel — they coordinate autonomously from here │ ├──> [2] Generator Session (reads Spec.md) │ Claude implements, commits at milestones │ └──> [3] Evaluator Session Claude ───────────────► Codex (launched by Claude, runs in background) │ ◄──── findings ── │ └─────── converge ─────┘ │ ├──> FAIL ──> Generator fixes & resubmits ──> back to [3] └──> PASS ──> Review Briefing ──> User

Three Claude Code sessions for the entire mission:

Planner — Claude investigates the codebase and launches Codex in the background to do the same. Both produce independent findings. Claude reads Codex's results, merges them, and they iterate until converged. You answer questions and approve the final spec.

Generator — Claude implements against the spec, committing at milestones. Fully autonomous — no Codex needed here since the Evaluator handles verification.

Evaluator — Claude evaluates independently while Codex does the same in the background. They converge on a verdict. On FAIL, the Generator fixes and resubmits. On PASS, you review the result.

tandemkit

Component Overview

Component Details

Commands (1)

Skills (3)

README

TandemKit

Why TandemKit?

Who Is It For?

The Reasoning

How It Works

Similar Plugins

director-mode-lite

Help us improve

Help us improve

tandemkit

Component Overview

Component Details

Commands (1)

Skills (3)

README

TandemKit

Why TandemKit?

Who Is It For?

The Reasoning

How It Works

Similar Plugins

director-mode-lite

Help us improve

cc-best

samocode

spec-first

conductor

hoyeon