
Using LangGraph and LangChain to Orchestrate Codex and Claude Code in a Multi-Agent Engineering Workflow
How I used LangGraph and LangChain to coordinate Codex and Claude Code as separate planning, implementation, review, risk, and evidence agents inside the TradeX engineering workflow.
On this page
- The Core Idea
- Why One Coding Assistant Was Not Enough
- The TradeX Architecture as an Orchestration Map
- The Agent Roles I Used
- The State Object Became More Important Than the Prompt
- LangGraph: The Workflow Control Layer
- LangChain: The Agent Contract Layer
- How Codex Fit Into the Workflow
- How Claude Code Fit Into the Workflow
- The Main Loop: Planner -> Codex -> Claude Code -> Fixer -> Validator
- Where the Research Swarm Fits
- Strategy Construction: Where Review Matters Most
- Deployment Decision: The Gate That Should Be Boring
- The Knowledge Layer: Memory as Auditability
- The Technical Pattern I Would Reuse
- What Worked
- What Did Not Work
- The Main LangGraph Learning
- The Main LangChain Learning
- The Main Codex and Claude Code Learning
- Final Reflection
I started the TradeX project with a wrong assumption.
I thought the hard part would be getting an AI coding assistant to write enough code fast enough.
That was not the hard part.
Codex could generate code. Claude Code could inspect the repository and reason about design choices. Both were useful. The problem appeared after the first few implementation cycles: code generation was only one part of the job.
The harder questions were:
Who checks whether the task was scoped correctly?
Who verifies that the implementation stayed inside the architecture?
Who challenges the assumptions behind a strategy or risk rule?
Who checks whether the tests prove behavior or only exercise code paths?
Who decides whether a change should move forward, loop back, or stop?That is where the project changed direction.
TradeX is an experimental trading bot, but the real subject of this work was not trading. The real subject was multi-agent orchestration: using LangGraph and LangChain to coordinate Codex and Claude Code as specialized engineering agents inside a controlled workflow.
The bot became the testbed because it had enough pressure points to make weak orchestration visible: market data, strategy logic, risk gates, paper trading, monitoring, validation, and feedback loops.
This article is not financial advice. It is not a claim about trading performance. TradeX is a paper-first engineering environment I used to learn how to structure agentic software development with clearer boundaries, review loops, and evidence gates.
The Core Idea
Multi-Agent Engineering Workflow architectureClick to inspect full size
The workflow I wanted was not:
Ask AI -> get code -> accept codeThat is too loose.
The workflow I wanted was closer to this:
Plan -> implement -> test -> review -> fix -> review again -> validate -> approve or blockCodex and Claude Code were not replacing the engineering process. They were workers inside it.
LangGraph handled the workflow: state, nodes, transitions, conditional routing, loops, and human checkpoints.
LangChain handled the agent layer: model calls, tools, retrieval, prompts, structured outputs, and memory.
The split looked like this:
LangGraph
= orchestration layer
= workflow state, routing, branches, retries, stop conditions
LangChain
= agent composition layer
= tools, prompts, model calls, retrieval, structured outputs
Codex
= implementation agent
= code changes, tests, refactoring, fixes
Claude Code
= review and architecture agent
= critique, domain reasoning, codebase review, safety checksThe important shift was moving from "better prompts" to "better control flow."
Why One Coding Assistant Was Not Enough
A single coding assistant works well when the task is small and local.
For example:
rename a field
add a unit test
fix a UI layout
refactor duplicate code
update an interfaceThe risk is manageable because the blast radius is small.
TradeX had a different problem. A change could look correct in one file but still break the system behavior.
A few examples:
A risk gate exists, but one promotion path bypasses it.
A backtest runs, but the test does not check for look-ahead assumptions.
A strategy has clean output, but the experiment parameters were not logged.
A monitoring screen shows "healthy," but it is not checking the failure mode that matters.
A reviewer says "tests pass," but the tests only cover the happy path.The issue was not that Codex or Claude Code were bad. The issue was role mixing.
When one assistant plans, writes, reviews, explains, and recommends the next step, the same blind spot can travel through the full loop.
I needed separation.
One agent should plan.
Another should implement.
Another should challenge the implementation.
Another should check evidence.
Another should decide whether the workflow continues or stops.
LangChain's own multi-agent documentation makes this point in a practical way: multi-agent systems are useful for specialized components and complex workflows, but not every complex task needs multiple agents. It also highlights context management, specialization, parallelization, and sequential constraints as reasons to use multi-agent patterns. (LangChain Docs)
That matched my experience. Multi-agent orchestration was useful only when separation of responsibility actually reduced risk.
The TradeX Architecture as an Orchestration Map
TradeX multi-agent research loop for adaptive tradingClick to inspect full size
The architecture image for TradeX has seven stages:
1. Data Inputs
2. Orchestrator
3. Research Swarm
4. Synthesis and Ranking
5. Strategy Construction
6. Deployment Decision
7. Paper / Live StageUnder that sits the knowledge layer:
Research Memory
Strategy Archive
Experiment Log
Market Context HistoryAt first glance, this looks like a trading architecture. I started reading it as an agent orchestration architecture.
The mapping became clear:
| TradeX block | Multi-agent engineering equivalent |
|---|---|
| Data Inputs | Codebase, docs, tests, logs, market data, experiment results |
| Orchestrator | LangGraph workflow controller |
| Research Swarm | Specialized agents for planning, review, risk, evidence, critique |
| Synthesis Engine | Ranking and reducing findings into decisions |
| Strategy Builder | Codex implementation node |
| Backtesting / Stress Test | Test and validation nodes |
| Risk Gate | Safety and approval node |
| Deployment Decision | Approve, revise, block, or escalate |
| Knowledge Layer | Memory, experiment history, previous findings, architecture rules |
That became the central design.
The bot was not just something agents worked on. The bot itself suggested the orchestration pattern.
The Agent Roles I Used
I did not want a swarm of vague agents.
I wanted a small number of roles with clear responsibilities.
Planner Agent
Defines the task, scope, constraints, expected behavior, and acceptance criteria.
Codex Builder
Implements the change, updates tests, refactors code, and fixes narrow findings.
Claude Code Reviewer
Reviews the change against architecture, intent, scope, and risk assumptions.
Risk Reviewer
Checks whether the change touches execution, paper promotion, risk gates, or portfolio assumptions.
Evidence Reviewer
Checks whether the behavior is measurable, reproducible, logged, and validated.
Fixer Agent
Applies only the required fixes from review findings.
Decision Node
Approves, blocks, loops back, or escalates to me.The key rule was simple:
No agent approves its own work.The builder builds.
The reviewer challenges.
The evidence agent asks for proof.
The orchestrator routes the workflow.
The State Object Became More Important Than the Prompt
Before this project, I thought mostly in prompts.
After working with LangGraph-style orchestration, I started thinking in state.
The state object became the contract between agents.
A simplified version looked like this:
from typing import Literal, TypedDict, List, Optional
class Finding(TypedDict):
severity: Literal["critical", "major", "minor"]
area: str
finding: str
required_fix: str
class AgentState(TypedDict):
task_id: str
goal: str
scope: List[str]
out_of_scope: List[str]
changed_files: List[str]
test_status: Literal["not_run", "passed", "failed"]
findings: List[Finding]
risk_status: Literal["unknown", "passed", "blocked"]
evidence_status: Literal["unknown", "passed", "insufficient"]
decision: Literal["continue", "fix", "validate", "block", "approve"]
human_approval_required: bool
notes: Optional[str]This changed the workflow.
Instead of asking:
What should I ask the model next?I started asking:
What is the current state?
Which node owns the next step?
What output schema should this node produce?
What condition routes the workflow forward?
What condition sends it back?
What condition blocks the change?That is the point where LangGraph became useful.
LangChain's agent documentation describes agents as systems that combine language models with tools, run iteratively toward a goal, and stop when they reach a final output or iteration limit. It also notes that create_agent uses a graph-based runtime with nodes and edges. (LangChain Docs)
For TradeX, I needed that graph thinking, but with my own workflow nodes and decision rules.
LangGraph: The Workflow Control Layer
LangGraph was the right mental model because my workflow was not linear.
A linear workflow would be:
plan -> code -> review -> doneThe real workflow was:
plan
|
implement
|
run checks
|
review
|
critical findings?
|-- yes -> fix -> review again
|-- no -> evidence review
|
evidence sufficient?
|-- no -> block
|-- yes -> approve or request human approvalThat requires branching, loops, and stop conditions.
In LangGraph terms, the design was:
nodes
= planner, builder, test runner, reviewer, fixer, evidence reviewer, decision node
edges
= allowed transitions between nodes
conditional edges
= routing based on state
state
= shared task record
checkpoints
= persisted workflow progress
interrupts
= human approval pointsLangGraph is positioned around state, memory, and human-in-the-loop workflows, and the LangGraph material specifically emphasizes learning how to use state, memory, and human-in-the-loop for agents. (LangChain) LangChain's memory docs also describe LangGraph short-term memory as part of agent state, persisted through a checkpointer so a thread can be resumed. (LangChain Docs)
That matters because engineering work is not always one clean run. A review may block the workflow. I may need to inspect the state. A fix may need another review. A promotion decision may require human approval.
A simplified routing function looked like this:
def route_after_review(state: AgentState) -> str:
critical_findings = [
finding for finding in state["findings"]
if finding["severity"] == "critical"
]
if critical_findings:
return "fix_findings"
if state["test_status"] != "passed":
return "run_tests"
return "evidence_review"
def route_after_evidence(state: AgentState) -> str:
if state["evidence_status"] == "insufficient":
return "block"
if state["risk_status"] == "blocked":
return "block"
if state["human_approval_required"]:
return "human_review"
return "approve"This looks simple, but that simplicity is the value.
The complexity should be in the review and evidence collection, not in a mysterious approval step.
LangChain: The Agent Contract Layer
LangChain helped around each node.
For me, the useful parts were:
prompt templates
tool access
retrieval
structured output
agent-specific context
memory integrationThe most important one was structured output.
A reviewer should not return this:
The implementation looks mostly good, but there are some concerns.That is hard to route.
I needed reviewers to return predictable data:
{
"status": "blocked",
"critical_findings": [
{
"area": "promotion_gate",
"finding": "Manual paper-trading promotion can bypass the risk gate.",
"required_fix": "Route all promotion paths through the same pre-promotion guard."
}
],
"minor_findings": [
{
"area": "logging",
"finding": "experiment_id is missing from one warning log path."
}
],
"next_action": "fix_findings"
}LangChain structured output is built for this kind of problem: agents can return data in a predictable format, such as JSON objects, Pydantic models, or dataclasses, instead of forcing the application to parse natural language. (LangChain Docs)
That one feature changes the workflow.
Once review output is structured, the orchestrator can route it.
def has_blocking_findings(review: dict) -> bool:
return review["status"] == "blocked" or len(review["critical_findings"]) > 0The workflow stops depending on a model's tone.
It starts depending on explicit fields.
How Codex Fit Into the Workflow
Codex was the implementation worker.
I used it for tasks where the desired change could be described with enough precision:
implement this gate
update this interface
add these tests
refactor this duplicated logic
fix these review findings
wire this state into the UIThe prompt quality mattered, but not in the generic "write a better prompt" sense. Codex worked best when the planner had already produced a narrow engineering task.
A weak task was:
Improve the paper trading flow.A usable task was:
Implement a pre-promotion gate for paper trading.
Requirements:
- All paper-trading promotion paths must call the same guard.
- Promotion must be blocked if the latest validation result is missing.
- Promotion must be blocked if robustness checks failed.
- Promotion must write an audit event containing experiment_id, strategy_id, timestamp, decision, and blocking reason.
- Do not change strategy scoring.
- Do not change signal generation.
- Add tests for approved, blocked, and missing-evidence paths.Codex was fast when the boundary was clear.
It was less reliable when the task mixed product intent, architecture judgment, risk policy, and implementation details into one prompt.
So I stopped giving it mixed tasks.
The planner created the work packet.
Codex implemented the work packet.
Claude Code reviewed the result.
How Claude Code Fit Into the Workflow
Claude Code was more useful as a domain-aware reviewer than as another generic builder.
I used it to challenge the implementation against the system design.
A typical review instruction looked like this:
Review this change as an architecture and risk critic.
Focus only on:
- scope drift
- hidden execution path changes
- missing risk gates
- weak validation evidence
- mismatch between documentation and implementation
- tests that pass but do not prove the intended behavior
Do not rewrite the implementation.
Do not suggest cosmetic improvements.
Return critical findings first.
If there are no critical findings, explain what evidence supports approval.That framing mattered.
The LangChain blog index lists "How to turn Claude Code into a domain specific coding agent" as a LangChain article under agent architecture, which matched the direction I wanted: Claude Code should not only see code; it should operate with domain-specific rules, repository context, and task constraints. (LangChain)
The better Claude Code understood the domain constraints, the better the review became.
For TradeX, those constraints included:
paper trading promotion must be gated
risk logic must not be bypassed
strategy scoring changes require evidence
experiment results must be reproducible
logs must include experiment identity
monitoring must check behavior, not only process healthClaude Code was strongest when I asked it to be narrow and strict.
Not:
Review this PR.But:
Review whether this PR can promote a strategy to paper trading without validated evidence or risk approval.That second prompt produces a different kind of review.
The Main Loop: Planner -> Codex -> Claude Code -> Fixer -> Validator
The workflow I kept returning to was this:
Planner
|
Codex implementation
|
Automated checks
|
Claude Code adversarial review
|
Fix findings
|
Review again
|
Evidence validation
|
Approve / block / escalateThe key is that the reviewer is adversarial.
Not rude. Not theatrical. Just strict.
The reviewer's job is not to say the change is impressive. The reviewer's job is to find the reason it should not be accepted yet.
Example reviewer finding:
{
"severity": "critical",
"area": "risk_gate",
"finding": "The new promotion endpoint checks validation status, but the scheduled promotion job still calls promoteStrategy() directly.",
"required_fix": "Move the guard into promoteStrategy() itself or enforce a shared promotion service used by all callers."
}That is the kind of finding that matters.
It does not complain about style.
It identifies a bypass path.
Then Codex receives a narrow fix task:
Fix only this finding:
Move the pre-promotion guard into the shared promotion service so both manual and scheduled promotion paths use it.
Do not change strategy scoring.
Do not change validation criteria.
Add regression coverage for both callers.Then Claude Code reviews again.
The loop continues until one of these conditions is true:
No critical findings remain.
The same finding repeats and needs human intervention.
Tests fail and cannot be fixed within scope.
Evidence is insufficient.
The change expanded beyond the approved scope.
A human approval point is reached.Without those stop conditions, an agent loop can waste time and produce noise.
With stop conditions, the loop becomes usable.
Where the Research Swarm Fits
In the architecture image, the research swarm includes:
Hypothesis Agent
Market Regime Agent
Signal Discovery Agent
Risk Research Agent
Event Interpretation Agent
Strategy Critic AgentI did not use this as "agents magically discover profitable strategies."
That is the wrong framing.
I used it as a way to separate research questions.
The hypothesis agent asks:
What is the idea being tested?
What behavior should appear if the idea is valid?
What would falsify it?The market regime agent asks:
Does this behavior depend on volatility, trend, liquidity, or macro conditions?
Does the result disappear in another regime?The signal discovery agent asks:
Is the signal measurable?
Was the signal available at decision time?
Is the signal stable enough to test?The risk research agent asks:
What can go wrong?
Does this increase concentration?
Does it behave badly during drawdown periods?The event interpretation agent asks:
Was the move caused by a scheduled event?
Should this sample be treated differently?The strategy critic asks:
Why should we reject this?
What assumption is weakest?
What evidence is missing?The synthesis engine then reduces the output.
That reduction step is important. Without synthesis, "multi-agent" becomes "many agents producing many opinions."
The synthesis node should produce something like this:
{
"candidate": "mean_reversion_intraday_v2",
"decision": "reject",
"reasons": [
"Signal requires confirmation data not available at decision time.",
"Stress window coverage is incomplete.",
"Risk impact during high-volatility periods is not measured."
],
"next_action": "archive_candidate_with_rejection_reason"
}The goal is not more text.
The goal is a decision with evidence.
Strategy Construction: Where Review Matters Most
The strategy construction path in TradeX has four main blocks:
Strategy Builder
Backtesting Engine
Robustness / Stress Test
Risk GateThis is the part of the system where I wanted the strictest control.
A strategy candidate should not move forward because the model says it looks reasonable.
It should move forward only when the workflow can answer concrete questions:
What data was used?
Was the data available at the time of decision?
Were transaction costs modeled?
Was slippage considered?
Which time windows were tested?
Which stress periods were tested?
What breaks the strategy?
What risk gate blocks it?
What experiment ID records the result?
Can another run reproduce the same result?A weak system asks:
Did the backtest pass?A stronger system asks:
What exactly did the backtest prove?That distinction matters.
The evidence reviewer's job was to prevent this kind of false confidence:
The test passed, but it only checked that the function returned a result.
It did not check whether the result was based on valid historical data.That is the type of issue a coding assistant can easily miss if it is only asked to make tests pass.
Deployment Decision: The Gate That Should Be Boring
The deployment decision should not be clever.
It should be strict and understandable.
For TradeX, even paper trading promotion should require a clear gate:
tests passed
validation result exists
robustness checks passed or limitations are documented
risk gate passed
experiment ID exists
audit event written
monitoring configured
critical review findings closedThe decision function can be simple:
def decide_promotion(state: AgentState) -> str:
if state["test_status"] != "passed":
return "reject_revise"
if state["risk_status"] != "passed":
return "reject_revise"
if state["evidence_status"] != "passed":
return "reject_revise"
critical = [
finding for finding in state["findings"]
if finding["severity"] == "critical"
]
if critical:
return "reject_revise"
if state["human_approval_required"]:
return "human_review"
return "promote_to_paper"This is intentionally boring.
Approval logic should be easy to inspect.
If the workflow is hard to understand, I do not trust it for risk-sensitive changes.
The Knowledge Layer: Memory as Auditability
The bottom layer of the architecture contains:
Research Memory
Strategy Archive
Experiment Log
Market Context HistoryThis was not optional.
Without memory, every agent run starts from zero.
That causes practical problems:
A rejected strategy appears again under a new name.
A known data issue is forgotten.
A previous review finding is reintroduced.
A stress-period failure is treated as new information.
A backtest result cannot be reproduced.
A decision is made without knowing why the previous decision was blocked.For TradeX, memory was less about personalization and more about auditability.
A useful experiment log needs fields like:
{
"experiment_id": "exp_2026_04_29_014",
"strategy_id": "mean_reversion_intraday_v2",
"code_version": "git_sha_here",
"dataset_version": "market_data_snapshot_id",
"parameters": {
"lookback_window": 20,
"entry_threshold": 1.5,
"max_position_size": 0.05
},
"cost_model": "paper_cost_model_v1",
"validation_status": "failed",
"failure_reason": "stress window missing",
"review_findings": [
"Signal availability not proven at decision time"
],
"decision": "reject"
}This gives future agents something real to retrieve.
The next time a similar candidate appears, the system should not rediscover the same flaw from scratch.
The Technical Pattern I Would Reuse
The reusable pattern is not specific to trading.
It is this:
1. Capture task intent.
2. Convert intent into a scoped plan.
3. Route implementation to a coding agent.
4. Run deterministic checks.
5. Send the diff to an adversarial reviewer.
6. Convert review output into structured findings.
7. Fix only the findings.
8. Review again with fresh context.
9. Validate behavior with evidence.
10. Approve, block, or escalate.In graph form:
START
|
create_plan
|
implement_with_codex
|
run_tests
|
review_with_claude_code
|
has_critical_findings?
|-- yes -> fix_findings -> run_tests -> review_with_claude_code
|-- no -> evidence_review
|
evidence_sufficient?
|-- no -> block
|-- yes -> risk_review
|
risk_passed?
|-- no -> block
|-- yes -> approve_or_human_reviewThat is the part I would reuse in any serious agentic software workflow.
The domain could be trading, mapping, data pipelines, internal tools, or testing infrastructure.
The workflow discipline stays the same.
What Worked
The biggest improvement came from making each agent narrower.
Codex did better when it received a scoped implementation task.
Claude Code did better when it received a strict review role.
The evidence reviewer did better when it did not have to comment on architecture.
The risk reviewer did better when it only checked risk-sensitive behavior.
The orchestrator did better when every node returned structured state.
The second improvement came from forcing review loops.
A change was not accepted just because code existed.
It had to survive:
tests
architecture review
risk review
evidence review
promotion decisionThe third improvement came from making stop conditions explicit.
Agent loops are only useful when they know when to stop.
What Did Not Work
The workflow had problems.
Too many agents created noise.
Some reviews were generic until I made the reviewer scope narrower.
Some fixes changed more than they should have.
Some agents produced long reasoning but weak decisions.
Some structured schemas were too loose and still required interpretation.
Some tasks did not need orchestration at all.
That last point matters.
A full multi-agent workflow is not free. It costs time, tokens, and attention.
I ended up using different paths depending on risk:
Small UI copy change
-> Codex + light review
Normal refactor
-> planner + Codex + tests + review
Strategy logic change
-> planner + Codex + tests + Claude review + evidence review
Risk, execution, or paper-promotion change
-> full graph + risk review + human approvalThis was more practical than forcing every task through the same machinery.
The Main LangGraph Learning
The main LangGraph learning was this:
State first.
Agents second.
Routing third.
Model choice fourth.The graph should not exist because "multi-agent" sounds advanced.
The graph should exist because the workflow has real decisions.
approve
revise
retry
block
escalate
archive
promoteIf there are no real decisions, a simple chain may be enough.
If there are loops, risk gates, review stages, and human approval points, a graph becomes useful.
The Main LangChain Learning
The main LangChain learning was that agent behavior improves when the contract around the agent is clear.
For each agent, I needed to define:
What context does it get?
What tools can it use?
What should it ignore?
What output schema must it return?
What decision is it allowed to make?
What decision is it not allowed to make?This is where LangChain concepts such as tools, retrieval, structured outputs, and agent-specific context became useful.
The goal was not to make every agent more powerful.
The goal was to make each agent more constrained.
That sounds counterintuitive, but it worked better.
The Main Codex and Claude Code Learning
Codex and Claude Code worked best when I stopped treating them as the same kind of assistant.
My practical split was:
Codex
= implementation, refactoring, tests, narrow fixes
Claude Code
= architecture review, domain critique, adversarial analysis, safety reviewThe loop was:
Codex builds.
Claude Code challenges.
Codex fixes.
Claude Code reviews again.
LangGraph decides the next route.
LangChain keeps the agent outputs structured.That is the workflow that made sense for me.
Final Reflection
This project did not teach me that AI agents remove the need for engineering judgment.
It taught me the opposite.
The more agentic the workflow becomes, the more important the engineering process becomes.
You need clearer scopes.
You need stricter review boundaries.
You need structured state.
You need evidence.
You need stop conditions.
You need to know when the system should block itself.
TradeX was useful because it exposed weak assumptions quickly. A trading bot has enough moving parts to make shallow automation look bad. That made it a good testbed for learning how to orchestrate Codex and Claude Code with LangGraph and LangChain.
The final result was not "a smarter bot."
The useful result was a better engineering loop:
clearer planning
faster implementation
sharper review
narrower fixes
stronger validation
more explicit promotion decisionsThat is the real value I took from the project.
Not more agents.
Better orchestration.
Hussam Ahmed
Building large-scale systems by day, exploring the universe by night.
Keep reading
Claude Code Dynamic Workflows: A Practical Guide to the New Orchestration Feature
A practical guide to Claude Code dynamic workflows: what the new feature does, when to use it, how to trigger it, and how to design workflows that split, verify, loop, and synthesize real engineering work.
Read articleLearning Agentic AI the Hard Way: Building a Trading Bot That Refuses to Guess
TradeX is my test bed for agentic AI: a trading research system that has to investigate, challenge, backtest, and prove ideas before it is allowed to act.
Read articleFeatured project
See the Map Knowledge Graph reason about a live driving scene.
An interactive simulator with scenario switching, graph traversal, and step-by-step decision playback.
Follow new posts
I share build logs on AI systems, execution, and astrophotography as they ship — no schedule, only substance.