The Navigator · 11 min mission

Test-Driven Loops & Systematic Debugging

Give Claude a check it can run, then let the loop close itself to green.

tdddebuggingverificationFact-checked 2026-06-15

On this page

Hands-off gates: /goal and Stop hooks
Debugging: reproduce as a failing test, then fix
Independent verification

How to give Claude Code a pass/fail check it can read, so it verifies its own work instead of stopping at "looks done." Covers the four verification gates, the red→green TDD cycle, reproduce-then-fix debugging, and independent review. After this you can write a test-first task, set an automatic gate that iterates to green, reproduce a bug as a failing test, and review a diff with a model that did not write it.

Gate	How it works	Best for
In one prompt	Ask Claude to run the check and iterate in the same message: "…run the test suite and fix any failures."	A single focused task you are watching
`/goal` condition	A small fast model (Haiku by default) re-checks the condition after every turn; Claude keeps working until it holds, then the goal clears itself.	A multi-turn push to one measurable end state
`Stop` hook	A script runs the check and blocks the turn from ending until it passes (`exit 2`). Deterministic, not advice.	A guarantee that runs in every session
Second opinion	A subagent or `/code-review` refutes the result in a fresh context, so the writer is not the grader.	Catching overfitting and missed edge cases

The four verification gates (best-practices), least to most setup. Higher rows free up more attention for longer unattended runs. Use the lightest one that fits.

The red → green → verify cycle

Write the test — state it is test-first
Tell Claude what behavior to verify and that the implementation does not exist yet, with concrete cases. The docs prefer verification criteria over vague prompts: "write a validateEmail function. example test cases: user@example.com is true, invalid is false, user@.com is false. run the tests after implementing." Claude examines existing test files to match the style, frameworks, and assertion patterns already in use.
Confirm it fails — red for the right reason
Run the suite in shell mode (! npm test) or have Claude run it. You want a clean assertion failure, not a syntax or import error — that is what makes the test a real verification signal.
Commit the red test
Commit the failing test before implementing. This pins a version-controlled definition of "done" that the implementation step cannot quietly edit out from under you.
Implement to green — without editing the test
Switch out of plan mode and state the constraint: implement the behavior, run the tests after each change, and do not modify the tests to make them pass.
Verify with fresh eyes
Before treating the task as done, have a subagent (or /code-review) review the diff in a fresh context — it sees only the diff and your criteria, not the reasoning that produced the change, so it grades the result on its own terms.

red → green in one session

… scroll to run this session

Shell mode (`! npm test`) adds the command and its real output to the conversation context, so Claude reads the same red and green you do.

Hands-off gates: `/goal` and Stop hooks

When a task needs several turns — get a suite green, clean the lint, make the build pass — /goal (Claude Code v2.1.139+) sets a completion condition and Claude keeps working toward it without per-turn prompting. After each turn, a small fast model (Haiku by default) checks whether the condition holds; if not, Claude starts another turn. Setting a goal starts a turn immediately, a ◎ /goal active indicator shows, and the goal clears automatically once the condition is met.

The evaluator does not run commands or read files on its own — it only judges what Claude has already surfaced in the transcript. So the condition must be demonstrable from Claude's own output: Claude has to actually run npm test and let the result land in the conversation. Write one measurable end state, a stated check (e.g. npm test exits 0), the constraints that must not change, and a runtime bound (or stop after 20 turns). Max condition length is 4,000 characters.

Setting, inspecting, and clearing a /goal

# set a well-formed condition (one end state + a stated check + a bound)
/goal all tests in test/auth pass and "npm test" exits 0, the lint step is clean,
and the public session API is unchanged — or stop after 20 turns
 
# inspect: condition, duration, turns evaluated, token spend, evaluator's last reason
/goal
 
# clear it (aliases: stop, off, reset, none, cancel; /clear also clears it)
/goal clear
 
# non-interactive: run the loop to completion (needs workspace trust accepted)
claude -p "/goal CHANGELOG.md has an entry for every PR merged this week"

When a check must run every time — no exceptions, no relying on Claude to remember — use a Stop hook directly. A hook is a user-defined command that fires when Claude finishes responding and can block the turn from ending to keep it working. Exit 0 lets the turn end; exit 2 blocks it and feeds the stderr message back to Claude as the reason to keep going. Define hooks in ~/.claude/settings.json (all projects), .claude/settings.json (committable per-project), or .claude/settings.local.json (gitignored); browse them read-only with /hooks; disable all with "disableAllHooks": true.

.claude/settings.json — block the turn until tests pass

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "npm test --silent || { echo 'Tests are failing — fix them before stopping.' >&2; exit 2; }"
          }
        ]
      }
    ]
  }
}

Event	Fires	Can block?
`Stop`	When Claude finishes responding — not only at task completion; not on user interrupts.	Yes — `exit 2` or `{"decision":"block","reason":"…"}` keeps Claude working.
`SubagentStop`	When a subagent finishes.	Yes — same blocking capability as `Stop`.
`PreToolUse`	Before a tool call executes.	Yes — `hookSpecificOutput.permissionDecision`: `allow` / `deny` / `ask` / `defer`.
`PostToolUse`	After a tool call succeeds (it already ran).	No — but can `{"decision":"block"}` the conversation or rewrite output via `updatedToolOutput`.

Hook events relevant to verification gates (hooks-guide / hooks reference). Exit codes: `0` = no objection; `2` = block (stderr fed back to Claude); any other = non-blocking error.

Debugging: reproduce as a failing test, then fix

Debugging is the same loop pointed at a bug. The common-workflows "fix bugs efficiently" recipe: share the error ("I'm seeing an error when I run npm test"), give Claude the command to reproduce it and the stack trace, state the repro steps, and say whether the failure is intermittent or consistent. Make the bug reproducible first so "fixed" has a re-runnable meaning. Two phrasing rules, both from the best-practices prompt tables: describe the symptom, not the fix, and demand the root cause so a patch does not just silence the symptom.

Two ways to report the same bug

Symptom-then-guess

"Login is broken after a while. Add a null check in refresh()."

You hand Claude your hypothesis as the task. If the real cause is elsewhere, it patches the wrong place and the bug survives behind a green-looking change.

Reproduce-then-fix (docs prompt)

"users report that login fails after session timeout. check the auth flow in src/auth/, especially token refresh. write a failing test that reproduces the issue, then fix it. address the root cause, don't suppress the error. Keep the session API unchanged."

Supplies outcome, location, repro, verification, and constraint — Claude locates the real defect instead of patching a guess.

Work a real failure to its root cause

Symptom → fix

Pick what you're seeing. A question or two later you land on the exact command to run — every step pulled straight from the Claude Code troubleshooting docs.

Where does it hurt?

When the session itself misbehaves — a slow run, a permission denial, a hook that never fires — pick the symptom and follow it to the fix. Same reproduce-then-fix steps, applied to your setup.

Independent verification

The model that just wrote the code is biased toward it: per the docs, "a fresh context improves code review since Claude won't be biased toward code it just wrote." Two ways to review in a fresh context:

/code-review — a bundled skill that reviews the current diff for bugs in a fresh subagent and returns findings to the session.
A plan-conformance subagent — "review the diff against PLAN.md. Check that every requirement is implemented, the listed edge cases have tests, and nothing outside the task's scope changed. Report gaps, not style preferences."

A reviewer told to find gaps will usually report some even when the work is sound, and chasing every finding leads to over-engineering — extra abstraction layers, defensive code, tests for cases that cannot happen. Tell the reviewer to flag only gaps that affect correctness or the stated requirements and treat the rest as optional.

Compose an unattended TDD-to-green run

Plan first
Enter plan mode (Shift+Tab to cycle default → acceptEdits → plan, prefix one prompt with /plan, or start with claude --permission-mode plan). Claude researches and proposes without editing source. Press Ctrl+G to open the plan in your editor; refine, then approve.
Approve into a hands-off mode
Approving exits plan mode and switches the session to the mode the approve option describes. For an unattended run choose Approve and start in auto mode (auto mode requires v2.1.83+) or Approve and accept edits (acceptEdits).
Write and confirm the failing tests
Use ! npm test to land a real red result in context — a clean assertion failure, for the right reason.
Set a gate to drive to green
Set a /goal tied to npm test exiting 0 (and a turn bound), or install a Stop hook so Claude iterates without per-turn prompting.
Close with a fresh-eyes pass
Run /code-review or a plan-conformance subagent. Each stage leaves evidence — red test, green run, clean review — so you read proof, not Claude's word.

Knowledge check

You ask Claude to "make all auth tests pass" and set a /goal condition: "all tests in test/auth pass." It reports success after a few turns, but when you run the suite yourself, two tests still fail. What most likely went wrong?

Reach the end and this star joins your charted sky.

Hands-off gates: /goal and Stop hooks

Debugging: reproduce as a failing test, then fix

Independent verification

Hands-off gates: `/goal` and Stop hooks