AI systemsApr 23, 202611 min read

This Loop Keeps Finding Bugs My AI Swears Don’t Exist

Fresh sessions turned out to be a better reviewer than another round in the same chat. A small planner-reviewer-fixer loop cut my AI design blind spots faster than better prompting ever did.

AI systems AI orchestration Adversarial review Code review loops Reliability Developer workflow

On this page

What was actually going wrong
The small idea that fixed it
Why the loop works
The loop
The actual script
One change that adds real technical value
What changed in practice
Where this breaks
One upgrade that actually matters
Adversarial check on this article

Back to blogHussam Ahmed

I was three tabs deep in OpenAI Codex, arguing with my own plan.

Tab one said the design was clean. Tab two said it had five obvious gaps. Tab three pointed out three more issues that neither of the first two mentioned. Same model, same prompt skeleton, same laptop. Different answers.

At that point the plan was not the only thing with bugs. My workflow had a few too.

I kept doing the same ritual on medium-sized system tasks. I would ask for a clean design for something like a minimal orchestrator with planner -> reviewer -> implementer. Codex would return a tidy plan with numbered steps, maybe 20 to 30 lines, nothing alarming, exactly the kind of answer that makes you sit back and think, yes, this looks annoyingly competent.

Then I would open a fresh session and paste only the plan. No history. No explanation. No polite preamble. Just the plan and one instruction: attack it.

That second pass would come back with very concrete issues:

missing stop condition after iteration 3
no handling for token growth beyond 8k context
unclear separation between planner and fixer roles
no crash recovery if the process dies mid-loop

None of those showed up in the first answer.

Then I would fix the plan, open another fresh session, and run the same adversarial review again. New issues:

reviewer bias increases after seeing prior findings
implementation step is too vague for reproducibility
no test strategy for empty reviewer output
retry logic is undefined when the API times out

By iteration 3, the plan barely resembled iteration 1. It was sharper, more boring, and much more likely to survive contact with reality.

The annoying part was the workflow. I was doing manual copy-paste across tabs like it was 2005, except this time the office printer was replaced by four glowing chat windows and my own impatience.

What was actually going wrong

The plan was not the whole problem. The thread was.

A long chat session quietly pushes the model toward continuity. It keeps inheriting its own framing, its own assumptions, and its own blind spots. Even when you ask for a review, the model often behaves like an editor of its previous answer, not a hostile outsider trying to break it.

That creates three predictable failure modes:

anchoring, where the first design becomes the default truth
reviewer contamination, where later passes inherit the tone and structure of the earlier answer
context inflation, where every iteration spends tokens preserving history instead of challenging it

Once I started looking at it that way, the fix was embarrassingly simple.

The small idea that fixed it

Stop pretending a chat thread is a system.

Each AI call should behave like opening a brand new window, pasting only what matters, and closing it immediately. No shared memory, no invisible baggage, no context that survived simply because the model sounded confident last turn.

That is what gives you the fresh perspective.

Three roles. Each one isolated.

planner sees only the problem
reviewer sees only the current plan
fixer sees only the plan plus findings

Nothing else leaks in.

Why the loop works

The quality jump does not come from a magically wiser model on iteration 3. It comes from reducing correlated error.

If the same context produces the plan and the review, you get one mind grading its own homework. If each pass starts fresh, you force the model to rebuild its reasoning from a narrower input. That resets assumptions, exposes weak boundaries, and makes missing steps more visible.

It also makes the workflow easier to observe. Each step has one job. Each output can be saved. Each failure has a smaller blast radius.

That matters more than the cleverness of the prompt.

The loop

plan = planner(problem)
 
for i in range(3):
    review = reviewer(plan)
 
    if review["clean"] or review["severity"] < 2:
        break
 
    plan = fixer(plan, review)
 
implementation = implementer(plan)

That loop is the entire trick.

The actual script

This runs locally with Python 3.11 or newer. I tested the workflow on macOS. Total runtime depends on model latency and how much the reviewer decides to ruin your evening.

Step 1: Create a project folder

mkdir ai-review-loop
cd ai-review-loop

Step 2: Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate

Step 3: Install dependencies

Use python3 -m pip instead of pip. On macOS, pip is often not available directly in the shell.

python3 -m pip install --upgrade pip
python3 -m pip install openai python-dotenv

Step 4: Create your environment file

Create a file named .env:

touch .env

Add this inside it:

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5.4

OPENAI_MODEL is optional. If you do not set it, the script uses gpt-5.4.

Step 5: Create the script

Create a file named ai_review_loop.py:

touch ai_review_loop.py

Paste this code:

import json
import os
import time
from pathlib import Path
from typing import Any
 
from dotenv import load_dotenv
from openai import OpenAI
 
 
load_dotenv()
 
client = OpenAI()
 
MODEL = os.getenv("OPENAI_MODEL", "gpt-5.4")
RUN_DIR = Path("runs")
RUN_DIR.mkdir(exist_ok=True)
 
 
def save_text(run_id: str, filename: str, content: str) -> None:
    path = RUN_DIR / run_id
    path.mkdir(parents=True, exist_ok=True)
    (path / filename).write_text(content, encoding="utf-8")
 
 
def save_json(run_id: str, filename: str, data: dict[str, Any]) -> None:
    path = RUN_DIR / run_id
    path.mkdir(parents=True, exist_ok=True)
    (path / filename).write_text(
        json.dumps(data, indent=2, ensure_ascii=False),
        encoding="utf-8",
    )
 
 
def call_model(prompt: str) -> str:
    response = client.responses.create(
        model=MODEL,
        input=prompt,
    )
 
    output = response.output_text.strip()
 
    if not output:
        raise RuntimeError("Model returned an empty response.")
 
    return output
 
 
def call_model_json(prompt: str) -> dict[str, Any]:
    response_text = call_model(prompt)
 
    try:
        return json.loads(response_text)
    except json.JSONDecodeError:
        return {
            "clean": False,
            "severity": 5,
            "findings": [
                {
                    "title": "Reviewer returned invalid JSON",
                    "impact": "The stop condition cannot safely evaluate the review.",
                    "fix": "Tighten the reviewer prompt or add structured output enforcement.",
                    "raw_response": response_text,
                }
            ],
        }
 
 
def planner(problem: str) -> str:
    return call_model(
        f"""
You are a planner.
 
Problem:
{problem}
 
Create a concrete, step-by-step plan.
 
Rules:
- Be specific.
- Include assumptions.
- Include failure handling.
- Include testing.
- No fluff.
"""
    )
 
 
def reviewer(plan: str) -> dict[str, Any]:
    return call_model_json(
        f"""
You are an adversarial reviewer.
 
Review this plan as if it is wrong.
 
Plan:
{plan}
 
Find:
- missing steps
- incorrect assumptions
- edge cases
- operational risks
- unclear ownership
- weak stop conditions
- missing test coverage
 
Return only valid JSON using this exact structure:
 
{{
  "clean": false,
  "severity": 1,
  "findings": [
    {{
      "title": "Short issue title",
      "impact": "Why this matters",
      "fix": "Concrete fix"
    }}
  ]
}}
 
Severity scale:
1 = minor
2 = useful improvement
3 = important issue
4 = serious issue
5 = blocking issue
 
If the plan is clean, return:
 
{{
  "clean": true,
  "severity": 1,
  "findings": []
}}
 
Do not wrap the JSON in markdown.
Do not add commentary outside the JSON.
Treat this as a fresh review with no previous context.
"""
    )
 
 
def format_review(review: dict[str, Any]) -> str:
    findings = review.get("findings", [])
 
    if not findings:
        return "No findings."
 
    lines = []
    for index, finding in enumerate(findings, start=1):
        lines.append(f"{index}. {finding.get('title', 'Untitled finding')}")
        lines.append(f"   Impact: {finding.get('impact', '')}")
        lines.append(f"   Fix: {finding.get('fix', '')}")
 
    return "\n".join(lines)
 
 
def fixer(plan: str, review: dict[str, Any]) -> str:
    findings_text = format_review(review)
 
    return call_model(
        f"""
You are fixing a plan.
 
Current plan:
{plan}
 
Reviewer findings:
{findings_text}
 
Produce an improved version that resolves all findings.
 
Rules:
- Preserve good parts of the current plan.
- Fix the issues directly.
- Add missing stop conditions, persistence, retries, and tests where needed.
- Output only the improved plan.
"""
    )
 
 
def implementer(plan: str) -> str:
    return call_model(
        f"""
You are an implementer.
 
Plan:
{plan}
 
Produce working code or exact execution steps.
 
Rules:
- Be specific.
- Include file names.
- Include commands.
- Include validation steps.
- Mention assumptions only when needed.
"""
    )
 
 
def should_stop(review: dict[str, Any]) -> bool:
    clean = bool(review.get("clean", False))
    severity = int(review.get("severity", 5))
 
    return clean or severity < 2
 
 
def run(problem: str, max_iters: int = 3) -> tuple[str, str]:
    run_id = time.strftime("%Y%m%d-%H%M%S")
 
    save_text(run_id, "problem.txt", problem)
 
    plan = planner(problem)
    save_text(run_id, "plan-0.md", plan)
 
    for i in range(max_iters):
        review = reviewer(plan)
        save_json(run_id, f"review-{i + 1}.json", review)
 
        print(f"\nIteration {i + 1} review:")
        print(json.dumps(review, indent=2, ensure_ascii=False))
 
        if should_stop(review):
            print(f"\nStopping after iteration {i + 1}.")
            break
 
        plan = fixer(plan, review)
        save_text(run_id, f"plan-{i + 1}.md", plan)
 
    implementation = implementer(plan)
    save_text(run_id, "final-plan.md", plan)
    save_text(run_id, "implementation.md", implementation)
 
    return plan, implementation
 
 
if __name__ == "__main__":
    problem = "Build a stateless AI orchestrator with planner, reviewer, fixer roles."
 
    final_plan, implementation = run(problem)
 
    print("\nFinal plan:\n")
    print(final_plan)
 
    print("\nImplementation:\n")
    print(implementation)

Step 6: Run it

python3 ai_review_loop.py

The script creates a runs/ folder and saves:

problem.txt
plan-0.md
review-1.json
plan-1.md
review-2.json
plan-2.md
review-3.json
final-plan.md
implementation.md

That matters because the loop is no longer just text flying around in the terminal. Every plan and review is saved, so you can inspect what changed between iterations.

One change that adds real technical value

The weakest part of the first version was is_clean. String matching on phrases like "no issues" is a polite way to invite false confidence.

The updated version forces the reviewer into a small contract:

{
  "clean": false,
  "severity": 4,
  "findings": [
    {
      "title": "No persistence between iterations",
      "impact": "A crash on iteration 2 loses the current plan and findings.",
      "fix": "Write each step to disk before moving to the next call."
    }
  ]
}

Now the stop condition is measurable:

def should_stop(review):
    return review["clean"] or review["severity"] < 2

That is still not perfect. The reviewer can still be wrong. But it is better than hoping the phrase "nothing critical" means the same thing every time.

It also makes it easier to log results, diff findings between iterations, and build a basic dashboard later if you want to see which classes of bugs keep showing up.

What changed in practice

Before this, I spent around 30 to 40 minutes on a medium task, mostly switching tabs, comparing versions manually, and trying to remember whether iteration 2 found the timeout issue or the role-boundary issue.

After this, the same task usually takes about 10 to 15 minutes. The script runs the review cycles, prints everything, and removes the administrative nonsense. I do not waste mental energy preserving context because the whole point is to destroy it on purpose.

More important, the failure pattern changed.

Before, I missed structural issues like missing retries, undefined ownership between components, or a review step that sounded specific but could not be repeated by another engineer.

After, the remaining issues were usually narrower:

what happens if the API times out after 12 seconds
what happens if the reviewer returns an empty string
what happens if iteration 2 produces a worse plan than iteration 1
what happens if the final implementer ignores a non-blocking warning that should have stayed attached

That is a much better class of problem.

Where this breaks

This is not magic.

You are still using one model unless you deliberately mix providers. The independence comes from context isolation, not from different intelligence.

Token growth becomes real once your plan and findings get wordy. A 1,500-token plan plus 1,500-token findings hits limits faster than people expect, especially if your fixer keeps making the plan longer instead of clearer. You need trimming, summarization, or hard length limits if you scale this beyond toy examples.

The basic version now persists each plan and review to disk, but that still does not make it production-grade. You would still need retry handling, proper logging, cost tracking, and stronger output validation if you wanted to rely on it heavily.

The reviewer can also be inconsistent. Removing temperature avoids one common API compatibility problem, but it does not make the model deterministic. Two runs on the same input can still produce different findings, which means you should treat the loop as a review aid, not a theorem prover.

One upgrade that actually matters

Use a different reviewer.

For example:

planner and implementer in OpenAI Codex
reviewer in Claude

Same loop, different failure patterns.

Claude tends to be stricter on reasoning gaps. Codex tends to give me more concrete implementation detail. Together they cover more ground than either one does alone.

When I tried this on a small orchestration project with 4 modules and 12 functions, the mixed setup caught two concurrency issues that Codex alone missed. That does not make one model superior in general. It just means blind spots are not evenly distributed, so the system gets stronger when the reviewers disagree for different reasons.

Adversarial check on this article

The loop assumes that resetting context improves quality. That is observable, but not guaranteed. Weak prompts still produce weak work. You just get three weak answers faster.

The roles here are intentionally simple. Real systems usually need stronger constraints:

the reviewer should output structured findings
the fixer should preserve known-good sections instead of rewriting the entire plan
the implementer should produce testable code with explicit inputs and outputs
the run should persist each iteration to disk

Once those constraints are explicit, the loop stops feeling like prompt theater and starts feeling like an actual engineering workflow.

Hussam Ahmed

Building large-scale systems by day, exploring the universe by night.

Keep reading

AI systemsJun 7, 2026

Claude Code Dynamic Workflows: A Practical Guide to the New Orchestration Feature

A practical guide to Claude Code dynamic workflows: what the new feature does, when to use it, how to trigger it, and how to design workflows that split, verify, loop, and synthesize real engineering work.

Read article

AI systemsMay 11, 2026

AI Coding’s New Bottleneck Is Control, Not Code

Fast code generation is useful. Controlled software delivery needs specs, task graphs, behavior frameworks, orchestration, verification, and governance.

Read article

Featured project

See the Map Knowledge Graph reason about a live driving scene.

An interactive simulator with scenario switching, graph traversal, and step-by-step decision playback.

Open simulator

Follow new posts

I share build logs on AI systems, execution, and astrophotography as they ship — no schedule, only substance.

Follow on LinkedIn Browse all articles