The Forge · 11 min mission

Programmatic Codex: SDK & Subagents

Drive Codex from TypeScript or Python and fan tasks out to parallel sub-agents.

sdksubagentscodexFact-checked 2026-06-13

On this page

When the SDK beats codex exec
TypeScript: @openai/codex-sdk
Resuming threads and streaming progress
Python: openai-codex, sync and async
Structured output: get JSON back, not prose
Subagents: one orchestrator, many workers
spawnagentsoncsv: batch the same task over a table
Senior scenario: an async PR-review bot over 50 repos
The throughline

codex exec is great until you need a program around the agent. The moment you want to loop over 50 repositories, parse the agent's answer as typed data, run ten reviews at once, or wire Codex into a webhook handler, shelling out to a CLI and scraping its stdout stops being clever and starts being a liability.

The Codex SDK is the same agent — same model, same sandbox, same config — exposed as a library you import. You get a thread object, you call run(), you get a result back in your own process. No subprocess parsing, no fragile string munging, and you can fan the work out across subagents that run in parallel. This guide is about that: when the SDK beats the CLI, the exact TypeScript and Python surface, how to get structured output back, and how to orchestrate a swarm of agents without melting your machine.

When the SDK beats codex exec

codex exec "..." is the right tool for one-shot automation: a CI step, a git hook, a shell pipeline. It streams progress to stderr and prints only the final agent message to stdout [V], so codex exec "summarize the diff" | pbcopy just works. Reach for the SDK the moment any of these is true:

You need the result as typed data, not prose to re-parse — a JSON object with fields you can branch on.
You need to run many agents concurrently and collect their results (the async/fan-out case).
The agent is one step in a larger program — a server, a queue worker, a bot — where managing a child process per request is the wrong abstraction.
You want to resume a long-lived thread across separate invocations and keep its context.

The dividing line is simple: if a single string in and a single string out is enough, use codex exec. If you need control flow, types, or parallelism around the agent, use the SDK.

codex exec vs. the SDK

codex exec (CLI)

Shape: one prompt in, final message on stdout.

codex exec "fix the failing test" \
  --sandbox workspace-write \
  --output-schema ./schema.json \
  -o result.json

Perfect for CI steps, git hooks, and shell pipelines. You get JSON Lines with --json and a final message you can pipe. But control flow lives in bash, and parallelism means juggling child processes.

@openai/codex-sdk / openai-codex

Shape: a thread object in your process; run() returns a typed result.

const codex = new Codex();
const thread = codex.startThread();
const turn = await thread.run("fix the failing test");
console.log(turn.finalResponse);

Control flow lives in your language. Loop, await, Promise.all, branch on a parsed object. This is the only sane path once you have dozens of tasks or need the answer as data.

TypeScript: @openai/codex-sdk

Install @openai/codex-sdk [V] and you get three calls that cover almost everything: construct a client, start a thread, run a turn.

import { Codex } from "@openai/codex-sdk";
 
const codex = new Codex();                  // uses your existing Codex auth/config
const thread = codex.startThread();         // a fresh conversation
const turn = await thread.run("Make a plan to diagnose and fix the CI failures");
 
console.log(turn.finalResponse);            // the agent's final message
console.log(turn.items);                    // every item it produced this turn

A thread is a conversation with memory; a turn is one run() and the items it produced [V]. run() resolves to a turn object whose finalResponse is the agent's last message and whose items array holds everything that happened — reasoning, command executions, file changes [V]. Authentication and configuration are inherited from your normal Codex setup, so a script that runs locally needs no extra wiring; in CI you pass credentials through the environment the SDK reads [P].

startThread() takes options to pin the thread to a project and relax the git guardrail:

const thread = codex.startThread({
  workingDirectory: "/path/to/project",
  skipGitRepoCheck: true,
});

Both workingDirectory and skipGitRepoCheck are verified SDK options [V] — the second is the SDK equivalent of the CLI's --skip-git-repo-check, which you need when the agent runs somewhere that isn't a git repo.

Resuming threads and streaming progress

Threads are persisted to ~/.codex/sessions [V]. If your process restarts — or a webhook fires a follow-up an hour later — you do not lose the conversation. Reconstruct it from its id and keep going:

// First invocation
const thread = codex.startThread();
await thread.run("Start refactoring the auth module");
const savedThreadId = /* persist this id somewhere durable */;
 
// Later, in a fresh process
const thread2 = codex.resumeThread(savedThreadId);
await thread2.run("Now add tests for what you changed");

resumeThread(threadId) reconnects to an existing thread by id and returns a thread you can run() again, with all prior context intact [V].

For long turns where you want to react to intermediate progress — show a tool call, stream tokens to a UI, surface file diffs as they happen — use runStreamed() instead of run() [V]. It hands back an async iterable of events:

const { events } = await thread.runStreamed("Audit the codebase for N+1 queries");
for await (const event of events) {
  switch (event.type) {
    case "item.completed":
      // a reasoning step, command, or file edit finished
      break;
    case "turn.completed":
      // the whole turn is done
      break;
  }
}

The same event vocabulary (thread.started, item.completed, turn.completed, error) is what the CLI emits with --json [V] — runStreamed is that stream, in your language, without parsing JSON Lines by hand.

Python: openai-codex, sync and async

The Python package is openai-codex [V]. It ships two clients used as context managers: Codex (synchronous) and AsyncCodex (asyncio). You start a thread with thread_start(...), which takes model and sandbox directly:

from codex import Codex, Sandbox
 
with Codex() as codex:
    thread = codex.thread_start(model="gpt-5.4", sandbox=Sandbox.workspace_write)
    result = thread.run("Make a plan to diagnose and fix the CI failures")
    print(result.final_response)

Note the snake_case: the Python result exposes final_response [V] where TypeScript exposes finalResponse. The sandbox argument takes one of the presets below — Sandbox.workspace_write lets the agent edit files inside the workspace, which is what you want for "fix this" tasks.

The async client is the one that matters for scale. AsyncCodex [V] gives you awaitable thread_start and run, which means you can launch dozens of independent agents and gather their results with asyncio.gather — no subprocess pool, no thread pool, just coroutines. That is the senior scenario later in this guide.

SDK preset (Python)	CLI flag	What the agent can touch	Use it for
`Sandbox.read_only`	`--sandbox read-only` (default)	Read files only — no writes, no network	Review, audit, Q&A over a repo
`Sandbox.workspace_write`	`--sandbox workspace-write`	Read and write inside the workspace	Fixes, refactors, codegen — the common case
`Sandbox.full_access`	`--sandbox danger-full-access`	Unrestricted filesystem and network	Only in throwaway/controlled environments

Sandbox presets — the same three across the CLI and SDK, named workspace-write on the command line and Sandbox.workspace_write in Python. Default is read-only.

Structured output: get JSON back, not prose

The single biggest reason to drive Codex programmatically is to stop parsing English. Both the CLI and the SDK can enforce a JSON Schema on the final answer so run() hands you data your code can branch on.

On the CLI, --output-schema <file> points at a JSON Schema and the final message is guaranteed to match it [V]:

{
  "type": "object",
  "properties": {
    "project_name": { "type": "string" },
    "languages":    { "type": "array", "items": { "type": "string" } }
  },
  "required": ["project_name", "languages"]
}

codex exec "Extract this repo's metadata" --output-schema ./schema.json -o output.json

In the TypeScript SDK the same idea is a per-turn option — pass outputSchema to run() and the agent's answer conforms to it [V]:

const schema = {
  type: "object",
  properties: { severity: { type: "string" }, files: { type: "array", items: { type: "string" } } },
  required: ["severity", "files"],
  additionalProperties: false,
};
 
const turn = await thread.run("Triage this PR's risk", { outputSchema: schema });
const verdict = JSON.parse(turn.finalResponse); // typed, branchable

You do not have to hand-write the schema: generate it from a Zod schema with zod-to-json-schema (target "openAi") and keep one source of truth for both validation and the agent contract [V].

Two knobs shape how hard the agent thinks before it answers. Reasoning effort is set on the CLI with -c reasoning_effort=<level> [V] (e.g. low, medium, high) and via config in the SDK; higher effort buys deeper analysis at the cost of latency and tokens. The practical pattern [P] is a two-phase pipeline: a cheap, low-effort pass to classify or filter (does this PR even need review?), then a high-effort pass with a strict outputSchema only on the items that survived. You spend your expensive reasoning where it changes a decision, not on everything.

Subagents: one orchestrator, many workers

A single thread is one worker. Subagents let a primary agent spawn child agents that run in their own context and report back — so a big task splits into parallel pieces instead of one long serial slog. Codex ships three built-in agents [V]:

default — the general-purpose fallback agent.
worker — an execution-focused agent for implementation and fixes.
explorer — a read-heavy agent tuned for codebase exploration.

The mental model: an explorer maps the territory (where does auth live? which files import this?), a worker changes it (apply the fix, write the test), and the orchestrator stitches their results together. You spawn explorers to investigate in parallel without polluting the main thread's context, then hand the findings to workers.

You define custom agents as standalone TOML files — ~/.codex/agents/ for personal agents, .codex/agents/ for project-scoped ones you commit with the repo [V]. Each file needs name (how it's spawned), description (when to use it), and developer_instructions (the core behavior), with optional model, model_reasoning_effort, sandbox_mode, mcp_servers, and skills.config [V]:

# .codex/agents/migration-checker.toml
name = "migration-checker"
description = "Audits a service for a specific framework migration and reports gaps."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
You audit one repository for the v2 migration. Check imports, config keys, and
deprecated calls. Report only concrete, file-anchored findings — never speculate.
"""

Verified fact:The two numbers that govern the swarm: max_threads and max_depth

Orchestration limits live under [agents] in your Codex config. max_threads defaults to 6 [V] — the number of subagent threads that can run concurrently, your parallelism ceiling. max_depth defaults to 1 [V], which "allows a direct child agent to spawn but prevents deeper nesting" — i.e. the orchestrator can spawn workers, but those workers cannot spawn workers of their own. Raise max_threads to widen the fan-out (bounded by your machine and rate limits); raise max_depth only if you genuinely need agents that spawn agents, because every level multiplies cost and the blast radius of a bad prompt. There is also job_max_runtime_seconds (no default; falls back to 1800s per worker) [V] to keep a stuck worker from running forever.

Key	Default	What it controls
`max_threads`	`6`	Max subagent threads running concurrently — your fan-out width
`max_depth`	`1`	Nesting depth; `1` = orchestrator spawns workers, workers cannot spawn
`job_max_runtime_seconds`	(none → 1800/worker)	Per-worker wall-clock cap so a stuck agent dies

The [agents] orchestration knobs and their defaults, verified against the subagents docs.

spawn_agents_on_csv: batch the same task over a table

For the "run one prompt over every row of a spreadsheet" shape, Codex has an experimental built-in: spawn_agents_on_csv [V]. You give it a CSV and a worker prompt with {column_name} placeholders, and it spins up one worker per row, bounded by your concurrency settings.

Its parameters [V]: csv_path (the source table), instruction (the worker prompt, with {column} substitutions per row), optional id_column (a stable per-item identifier), output_schema (a JSON Schema each worker's result must match), plus job control — output_csv_path, max_concurrency, and max_runtime_seconds. The hard rule: each worker must call report_agent_job_result exactly once [V], or that row is marked as an error in the exported CSV. It is the cleanest path when your fan-out is genuinely tabular — one row, one task, one structured result.

Senior scenario: an async PR-review bot over 50 repos

You run platform engineering for an org with 50 services and a shared @company/auth library that just shipped a breaking v2. You need a same-day report: for each repo, is it still on v1, and what exactly has to change? Doing this by hand is a day of grep. codex exec in a bash loop is serial and gives you 50 blobs of prose to read. The right tool is AsyncCodex + asyncio.gather — 50 read-only agents, each scoped to one repo, each returning a typed verdict.

pr_review_bot.py — fan 50 read-only reviewers out with asyncio

import asyncio
from codex import AsyncCodex, Sandbox
 
REPOS = [f"/srv/checkouts/{name}" for name in load_repo_names()]  # 50 paths
 
SCHEMA = {
    "type": "object",
    "properties": {
        "on_v1": {"type": "boolean"},
        "blocking_changes": {"type": "array", "items": {"type": "string"}},
        "risk": {"type": "string", "enum": ["none", "low", "medium", "high"]},
    },
    "required": ["on_v1", "blocking_changes", "risk"],
    "additionalProperties": False,
}
 
# Bound concurrency so we never exceed the agent thread budget / rate limits.
gate = asyncio.Semaphore(6)  # mirrors agents.max_threads default of 6
 
async def review(codex: AsyncCodex, repo: str) -> dict:
    async with gate:
        thread = await codex.thread_start(model="gpt-5.4", sandbox=Sandbox.read_only)
        result = await thread.run(
            f"Audit {repo} for the @company/auth v2 migration. "
            "Report whether it still uses v1 and the exact blocking changes.",
            output_schema=SCHEMA,
        )
        return {"repo": repo, **json.loads(result.final_response)}
 
async def main() -> None:
    async with AsyncCodex() as codex:
        reports = await asyncio.gather(*(review(codex, r) for r in REPOS))
    blocked = [r for r in reports if r["on_v1"] and r["risk"] in ("medium", "high")]
    print(f"{len(blocked)}/{len(reports)} repos need urgent migration work")
 
asyncio.run(main())

Three things make this production-grade rather than a toy. Sandbox.read_only means no agent can mutate a repo while reviewing it — a reviewer that writes is a bug. The asyncio.Semaphore(6) caps in-flight agents at the same default max_threads ceiling, so you do not open 50 connections at once and get rate-limited into failure; tune it to your account's limits, not your optimism. And output_schema turns every agent's answer into a dict you can filter, sort, and gate a deploy on — blocked is computed, not eyeballed. Swap gather for as_completed if you want results to stream into a dashboard as each repo finishes, and add a job_max_runtime_seconds-style timeout per task so one pathological repo cannot stall the batch [P].

running the fan-out

… scroll to run this session

Fifty read-only reviewers, six at a time, each returning schema-validated JSON. The bot prints a computed verdict, not 50 paragraphs to read.

Designing your own fan-out

Pick the weakest sandbox that works
Read-only for review/audit/Q&A; workspace_write only when agents must edit. Never full_access in a fan-out — one bad prompt multiplies across every worker.
Bound concurrency to your real limits
Start at the max_threads default of 6 and a matching asyncio.Semaphore. Raise it only after watching for rate-limit errors; parallelism past your account ceiling makes throughput worse, not better.
Make every worker return a schema
Give each agent an output_schema / outputSchema so results are dicts, not prose. Compute the final verdict in code — the whole point of going programmatic is that the decision is deterministic.
Cap runtime per worker
Set a per-task timeout (or job_max_runtime_seconds for the built-in CSV job) so one stuck agent cannot hold the whole batch hostage.
Keep nesting shallow
Leave max_depth at 1 unless you have a concrete need for agents that spawn agents. Deeper nesting multiplies cost and makes a runaway much harder to reason about.

Orchestrate the swarm

Watch delegation happen

The orchestrator hands a slice of work to each subagent. Every subagent runs in its own context window, does the noisy part — searching, reviewing, running tests — and returns only a short summary. Dispatch them and watch the work fan out, then the results pulse home.

orchestrator · main thread

ready

The roster

exploreridle

Searches and maps the codebase without editing.

model · Haiku 4.5

revieweridle

Read-only pass for bugs, style, and risk.

model · Sonnet 4.6

testeridle

Runs the suite and reports failures.

model · Haiku 4.5

implementeridle

Writes the focused change end to end.

model · Opus 4.8

Dial max_threads and max_depth, assign explorer/worker/default roles, and watch how the fan-out width and nesting change which tasks run in parallel.

Knowledge check

You write a bot that reviews 50 repos with AsyncCodex and asyncio.gather. Codex config has agents.max_threads = 6 and max_depth = 1, and each reviewer is started with sandbox=Sandbox.read_only. You spawn all 50 reviewers at once and they all hit the API simultaneously. What is the most likely problem, and the cleanest fix?

The throughline

Going programmatic is not about replacing codex exec — it is about earning types, parallelism, and control flow when a single string in and out is no longer enough. Drive a thread with startThread() / thread_start, get data back with outputSchema, persist and resumeThread across invocations, and when the work is wide, fan it out with subagents or AsyncCodex under a sane concurrency bound. Pick the weakest sandbox that still works, keep max_depth shallow, and make every worker answer with a schema. Do that and the agent stops being a chat box and becomes a component you can build a system around.

Reach the end and this star joins your charted sky.