The Bridge · 12 min mission

Plan, Execute, Red-Team: Splitting Roles Across Two Models

Let one model plan, the other implement, and a fresh model red-team what self-review misses.

tandemreviewworkflowFact-checked 2026-06-13
On this page

One model, one session, doing everything — plan, write, and judge its own work — has a structural flaw you can feel by the third hour: the model that wrote the code is the worst possible reviewer of it. It is anchored to its own choices. Ask it "is this correct?" and it explains why it is, fluently, because justifying the output it just produced is the path of least resistance. The bug it planted in step two is invisible in step nine because step nine is reading through step two's eyes.

The fix is division of labor across models. Let one model plan — turn a fuzzy task into a numbered, checkable spec. Let another model execute — follow that plan section by section, running tests as it goes. Then bring in a third, fresh model to red-team — review the diff cold, with no memory of why any line was written, framed as an adversary instead of an author. Three roles, three contexts, one of them deliberately hostile.

This maps onto the two tools you already have. Claude Code's plan mode is a natural planner; Codex's exec non-interactive runner is a natural executor; and either tool, opened in a new session on a branch it did not write, is a natural red-team. The art is in the handoffs — and the handoffs are manual, which this guide is honest about throughout. [P]

Why the planner/executor split works

The reason this is not just ceremony: a good plan changes what the executor is being asked to do. Without a plan, "build the rate limiter" is an open-ended generation task — the model invents architecture and implementation simultaneously, and architectural mistakes get baked into code before anyone can object. With a plan, the same request becomes "implement step 3 of this 8-step plan," which is a constrained task. The hard, reversible decisions (where state lives, what the interface is, what the failure mode is) are made and reviewed while they are still cheap words, not expensive diffs.

Numbered plans matter specifically because executors follow them reliably. A list of imperative steps — "1. Add the TokenBucket class to lib/ratelimit.ts. 2. Wire it into the middleware. 3. Add a test that exhausts the bucket and asserts a 429." — is a checklist the executor can work through one item at a time, run tests after each, and report against. Prose like "build a robust, production-grade rate limiter" gives it nothing to check itself against. The plan is the contract; the executor's job shrinks to fulfilling it. [P]

A second, underrated benefit: the plan is reviewable on its own, before a single line is written. A wrong plan caught at the planning stage costs you one message. The same wrong plan caught after implementation costs you the implementation. We come back to reviewing the plan (not just the code) later — it is one of the highest-leverage moves in the whole loop.

One model does everything vs. roles split across models

Single model, single session

  • Plans, writes, and reviews in one context.
  • Review is self-review — anchored to its own output, biased toward defending it. [P]
  • Architecture decisions are made inside the code-writing step, so mistakes land as diffs, not words.
  • Fast for trivial tasks. Degrades as the task grows: the bug from step 2 is invisible by step 9.

Plan / Execute / Red-team across models

  • Planner turns the task into a numbered, reviewable spec.
  • Executor follows the plan step by step, tests after each. [P]
  • Red-team is a fresh model on a branch it did not write — no anchoring, framed as an adversary. [P]
  • More moving parts and manual handoffs, but each role does the one thing it is good at.

Handoff mechanics: PLAN.md from one tool, implementation from the other

The concrete pipeline most practitioners converge on is plan in Claude Code, implement in Codex, with a file as the contract between them. The file is the handoff. There is no API between the tools — you write the plan to disk and you point the executor at it. [P]

Step 1 — plan, and write it to a file. Start Claude Code in plan mode and ask it to produce a numbered plan saved as PLAN.md. Plan mode is a real Claude Code feature [V]: claude --permission-mode plan begins the session in a read-only planning mode (the docs list plan among the accepted values for --permission-mode, alongside default, acceptEdits, auto, dontAsk, and bypassPermissions). In that mode Claude researches and proposes but does not edit, so you can iterate on the plan cheaply before any code exists.

Step 2 — execute the plan, section by section. Hand PLAN.md to Codex's non-interactive runner. codex exec (alias codex e) runs Codex headless against a prompt [V]; --sandbox workspace-write is the exact sandbox value that lets it write inside the workspace directory while staying contained [V] (the other two values are read-only and the high-risk danger-full-access). Tell it to implement one section at a time and run the test suite after each file — that cadence is the practitioner part [P]; the sandbox flag and the runner are the product part [V].

Step 3 — keep them out of each other's way with git worktrees. Run the planner and the executor (or two competing executors) in separate git worktree checkouts of the same repo, each on its own branch. Worktrees are plain git [V]; using them so two agents never fight over the same working tree is the practice [P]. Each tool gets its own directory, its own dirty state, its own test run — and you diff the branches when you are ready.

plan in Claude Code → implement in Codex
… scroll to run this session
Plan mode writes PLAN.md; codex exec implements it in a sandbox and runs tests after each file. The cp of PLAN.md between worktrees is the manual handoff — no tool does it for you.

Cross-Context Review: why a fresh model catches what the author misses

The red-team step is the one with actual research behind it. A reviewer anchored to its own output tends to defend rather than critique — when the model reviewing the code is the same model, in the same session, that produced it, it is reading its own reasoning and is primed to confirm it. Switching to a different model in a fresh session breaks that anchoring, and the breakage is the point.

This is the finding of arXiv:2603.12123, "Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions" (Tae-Eun Song) [P, research]. The paper tests four review conditions over 30 artifacts seeded with 150 injected errors. Cross-Context Review (CCR) — reviewing in a separate session with no access to the production conversation — reached an F1 of 28.6%, beating same-session self-review. The sharpest detail: reviewing twice in the same session did not help. The gain comes from context separation itself, not from more passes. Repetition inside the original context does not break the anchoring; a clean context does.

The practitioner translation is direct: don't ask the author to re-read its own work. Open a new session — ideally a different model — give it only the artifact, and it evaluates without the cognitive anchoring of having written it. That is the entire justification for handing a Claude-authored branch to Codex, or a Codex-authored branch to Claude, instead of asking either to grade its own homework.

The adversarial code-review loop

Now make the red-team hostile. A neutral "please review this" invites a neutral, agreeable read. Framing the reviewer as someone who did not write the code — and is looking for what is wrong with it — pulls in the critical findings. Two concrete ways to run it, both built on real, verified tooling [V] wrapped in a method that is practice [P]:

Option A — pipe a diff into Claude with adversarial framing. Claude Code is composable; the official docs show git diff main --name-only | claude -p "review these changed files for security issues" verbatim [V]. Take that pattern and load the content of the diff plus a hostile system instruction:

bash
git diff main...feat/ratelimit-impl | \
  claude -p --append-system-prompt \
  "You did NOT write this code. A teammate did. Your job is to find what is wrong with it: bugs, races, missing tests, security holes. Be specific and skeptical." \
  "Review this diff and list concrete problems, most severe first."

claude -p (its long form --print) runs headless and prints the response [V]; --append-system-prompt "…" appends your framing to the system prompt [V]. The "you did not write this" line is the [P] part — it is prompt craft, not a feature, but it reliably shifts the model from author-mode to critic-mode.

Option B — run Codex /review on a Claude-authored branch. Codex ships a real /review slash command [V]: it asks for a working-tree review and summarizes issues, focusing on behavior changes and missing tests. Check out the branch Claude wrote, open Codex fresh, and run /review. Because Codex never saw the code being written, it is structurally cross-context — the arXiv condition, for free.

Run 2–3 rounds. Feed the red-team's findings back to the executor, let it fix, then review again — but in a new red-team session each round so you keep the context separation that makes the review work. Two to three rounds is where most practitioners see diminishing returns. [P]

The adversarial review loop, end to end

  1. 1. Freeze the work on a branch

    Implementation lands on feat/ratelimit-impl. The red-team will only ever see this branch — never the session that produced it. That separation is the whole mechanism. [P]

  2. 2. Red-team it cold

    Either git diff main...feat/ratelimit-impl | claude -p --append-system-prompt "You did NOT write this…" [V command, P framing], or check out the branch and run Codex /review [V]. Different model from the author is ideal.

  3. 3. Triage and fix

    Hand the findings back to the executor (the model that wrote it) to fix — fixing is an author task, not a review task. Keep the planner and red-team out of this step.

  4. 4. Re-review in a fresh session

    Round 2 (and maybe 3) starts a new red-team session — not a continuation. Reviewing again in the same session does not help (arXiv:2603.12123); a clean context does. Stop at 2–3 rounds. [P]

Get a second opinion on the PLAN, not just the code

Here is the move most people skip, and it is the cheapest high-value review you can run: red-team the plan before anyone writes code. Reviewing a diff catches implementation bugs. Reviewing a plan catches the far more expensive category — wrong approach, missing requirement, an interface that will be painful in six months. A flawed plan that ships produces a flawed implementation that passes its own tests, because the tests were written to the flawed plan.

The technique is the same cross-context idea, applied earlier. Hand a fresh model only PLAN.md plus the original requirements — deliberately not the planning conversation, and not any code. Stripped to plan-and-requirements, the reviewer cannot anchor on the planner's reasoning; it has to check the plan against the goal from scratch. Ask it: "Here is the requirement and here is a proposed plan. What is wrong, missing, or risky in this plan?" [P]

bash
cat REQUIREMENTS.md PLAN.md | \
  codex exec --sandbox read-only \
  "You are reviewing someone else's plan against the stated requirements. \
   List gaps, wrong assumptions, and risks. Do not write code."

Note --sandbox read-only [V] — a plan reviewer has no business writing files, so deny it the ability entirely. Catching one bad architectural decision here saves the entire implementation built on top of it.

The four-phase loop: prototype → review → refactor → polish

For a non-trivial feature, sequence the roles into four passes instead of one big push. Each phase has a different goal and a different model posture, and crucially you do not skip ahead — a polished version of the wrong prototype is wasted polish. [P]

  1. Prototype. Executor builds the simplest thing that runs end to end. Optimize for learning whether the approach works, not for quality. Throwaway-grade is fine.
  2. Review. Fresh red-team model reads the prototype cold (cross-context) and reports what is actually wrong — design flaws, not nits. This is where you decide keep the approach or pivot.
  3. Refactor. Executor reshapes the kept prototype into the real structure, guided by the review. Tests stay green throughout; behavior is held constant while structure changes.
  4. Polish. Final pass for edge cases, error handling, docs, and the last red-team round. Only now do nits matter.

The discipline is matching effort to phase: no polishing in phase 1, no architectural pivots in phase 4. The review in phase 2 is the hinge — it is the moment the cross-context model earns its keep, because pivoting after a prototype is cheap and pivoting after polish is not.

Spec → parallel implementations → merge

The most advanced pattern uses both tools as competing executors on the same spec, then picks the better result — or fuses them. It is the most expensive workflow here and the most powerful for genuinely hard or ambiguous problems, where there is real value in seeing two independent attempts. [P]

  1. Write one SPEC.md. A single, unambiguous spec is the shared contract — both implementations are judged against it.
  2. Run both tools in isolated worktrees. git worktree add ../impl-claude -b impl/claude and git worktree add ../impl-codex -b impl/codex; point Claude Code at the first and codex exec --sandbox workspace-write at the second [V], each implementing the same SPEC.md with no knowledge of the other. Isolation is what makes them genuinely independent. [P]
  3. Diff the two branches against each other. git diff impl/claude..impl/codex shows exactly where the two models disagreed — and those disagreement points are usually the interesting parts of the problem, the spots where the spec was ambiguous or the design was non-obvious.
  4. Pick or synthesize. Keep the better implementation wholesale, or cherry-pick the stronger half of each. A fresh red-team session reading both, cross-context, is well placed to judge which is sounder — and to flag where both got it wrong, which is the signal that your spec, not the models, was the problem.
two implementations, one spec, diffed head to head
… scroll to run this session
Both tools implement the same SPEC.md in isolated worktrees, then git diff between the branches surfaces exactly where they disagreed — the interesting part of the problem.
WorkflowPlannerExecutorRed-teamThe manual handoff
Plan → executeclaude --permission-mode planPLAN.mdcodex exec --sandbox workspace-writeYou cp PLAN.md into the executor worktree
Adversarial reviewEither tool, on a branchclaude -p --append-system-prompt … or Codex /reviewYou pipe git diff / check out the branch
Plan reviewProduced PLAN.mdFresh model, fed PLAN.md + requirements onlyYou strip out the planning chat and code
Spec → parallelShared SPEC.mdBoth tools, isolated worktreesFresh model reads both branchesYou git diff branchA..branchB and pick
The role-split workflows at a glance. The tooling column is verified [V]; the role assignment and handoff are practitioner method [P].

A staff engineer's real-world run

A payments team needs an idempotency layer so retried charge requests never double-charge. High blast radius, easy to get subtly wrong — the exact profile where role-splitting pays for itself.

Plan. The lead opens Claude Code with claude --permission-mode plan and iterates on the approach until PLAN.md reads as eight numbered, testable steps: idempotency-key schema, storage with TTL, the dedup check, the race between concurrent retries, and a test per step. Plan mode never touched a file — the debate happened in words. Plan review: before any code, they pipe REQUIREMENTS.md + PLAN.md into a fresh Codex session with --sandbox read-only. It flags that the plan never specifies behavior when the same key arrives with a different request body — a real gap, fixed in the plan in two minutes. Caught in code, that is a production incident.

Execute. PLAN.md is copied into an impl worktree; codex exec --sandbox workspace-write implements it section by section, running the suite after each file and halting on the first red. Step 4 (the concurrent-retry race) fails its test twice before passing — exactly the step the plan made explicit so it could not be skipped.

Red-team, 2 rounds. Round 1: git diff main...impl | claude -p --append-system-prompt "You did NOT write this code…" — a different model, cold. It finds the TTL is refreshed on every read, so a hammered key never expires: a slow memory leak no test covered. Fixed. Round 2 in a new session (not a continuation — the arXiv point) comes back clean. The layer ships. The bug that would have paged someone at 3am was caught by a model that never wrote a line of it.

Knowledge check

You want a second model to red-team a branch Claude Code just wrote. Which setup best matches the Cross-Context Review finding (arXiv:2603.12123)?

Build your own plan / execute / red-team pipeline

Compose a cross-model pipeline

Tag each stage with the model that runs it, wire the typed artifacts between them, and copy the handoff as a shell script or a GitHub Actions workflow. One model plans, another implements, a fresh one red-teams.

Reasonable practice. These handoffs are not automatic — a human or a CI script passes PLAN.md and the git diff between the tools. The CLI commands are verified [V]; the planner / executor / red-team split is a method [P].

Plan

Write the numbered plan

PLAN.md

Execute

Implement the plan

git diff

Review

Red-team the diff

Red-team round 2 runs on Claude — the other model. A fresh context that never saw the plan breaks the self-defense bias of single-model review.

Red-team round

Add a second review pass run by the OTHER model for cross-context review.

Output format

Ship it as a local shell script or wire it into CI as a workflow.

plan-execute-review.sh
#!/usr/bin/env bash
# Plan -> Execute -> Review handoff across two models.
# NOTE: not automatic — this script is the human/CI glue that moves
# PLAN.md and the diff between the tools.
set -euo pipefail
# 1. PLAN (Claude) -> writes PLAN.md
claude --permission-mode plan 'draft a numbered implementation plan' > PLAN.md
# 2. EXECUTE (Codex) -> reads PLAN.md, edits files
codex exec 'implement PLAN.md sections 1-4; run tests after each file' --sandbox workspace-write
# capture the work for the reviewer (the typed artifact on the edge)
git diff > diff.patch
# 3. REVIEW (Codex) -> red-teams the diff
codex exec /review --sandbox read-only # red-team the working tree
# 4. RED-TEAM round 2 (Claude, the other model) ->
# a fresh context that never saw the plan breaks the self-defense bias
claude -p 'you did NOT write this code; find the bugs, security holes, and missed edge cases' < diff.patch
Assign each role to a tool and command, place the manual handoffs, and see the exact sequence you would run — every arrow a cp, git diff, or check-out you do by hand.

Reach the end and this star joins your charted sky.