The Navigator · 10 min mission

Cost, Limits & Performance Optimization

Get more done per token and per dollar — without losing quality.

costperformanceoptimizationFact-checked 2026-06-13

On this page

Two ways to pay, one thing being measured
How subscription limits reset
The token economics of an agentic session
Fast mode, and what "performance" really costs
Efficiency patterns that compound
Headless billing changed on June 15, 2026
Hard cost controls for print mode
1M-token economics: big context, no premium — but watch the surprise

Every Claude Code session spends something. On a subscription it spends from a usage allowance that refills on a clock; on the API it spends real dollars per token. Either way, the meter that matters is the same underlying quantity — tokens — and almost everything that makes Claude Code feel expensive or hit a wall traces back to how many tokens each turn carries.

The good news is that token spend is not random. It follows a handful of legible rules, and once you can see those rules you can cut waste without cutting capability. This guide separates the two billing models, explains how limits reset, and then teaches the economics and the habits that compound into a faster, cheaper, more reliable agent.

Two ways to pay, one thing being measured

There are two billing relationships behind Claude Code, and confusing them is the source of most cost anxiety. With a subscription — Pro or Max — your Claude Code usage is included in the plan. You do not get a per-message invoice; you get a usage allowance, and when you exhaust it you wait for it to reset or upgrade. With the Claude API, you are billed pay-as-you-go, per token, and the bill is the sum of every request your sessions made.

Both models meter the same thing. The official docs are explicit that Claude Code charges by API token consumption under the hood; the subscription simply wraps that consumption in a flat allowance instead of a line-item bill. So the skills that stretch an API dollar are the exact same skills that stretch a Max allowance — there is one efficiency game, played two ways.

Dimension	Pro / Max subscription	Claude API
How you pay	Flat monthly fee; Claude Code usage included	Pay-as-you-go, per token consumed
What runs out	A usage allowance that resets on a clock	Nothing — you keep being billed (subject to limits you set)
Max tiers	`5x` or `20x` more usage than Pro	Not applicable — usage tiers govern rate limits, not allowance
Where you watch it	`/usage` and `/status` in the CLI; Settings → Usage	`/cost` per session; the Usage page in the Claude Console
Best for	Steady individual / team developer use	Automation, CI, scripted agents, metered cost control

Subscription vs API billing at a glance. Subscription tiers include Claude Code usage; the API bills per token. Plan names and inclusions per the Claude pricing page; verify current details there before relying on them.

How subscription limits reset

Subscription limits are not one bucket — they are layered, and understanding the layers tells you when to push and when to pace yourself. Pro and Max plans carry a rolling session limit that resets every five hours, plus a weekly limit that resets at a fixed time each week assigned to your account. The five-hour window is the one you brush against during an intense afternoon; the weekly limit is the backstop that governs heavy, sustained weeks.

Max adds a wrinkle worth knowing: it carries two weekly limits — one that applies across all models, and a second that applies to Opus usage specifically. The practical reading is that the plans are tuned so heavy users spread work across models rather than running everything on the largest one. You can always see your next reset time in Settings → Usage.

When you do hit a limit, Claude Code does not crash — it warns you as you approach capacity and, at the wall, offers your choices: wait for the reset, enable usage credits to keep going, or upgrade. Knowing the five-hour window exists changes how you plan: a giant migration started ten minutes before your reset is a different decision than one started right after it.

The token economics of an agentic session

Here is the mental model that makes everything else click. A normal chat sends a short prompt and gets a short answer. An agentic session is a loop — Claude reads files, runs commands, observes output, and decides again — and every turn resends the accumulated context. The conversation so far, the files it has read, the tool results: all of it is re-billed as input tokens on the next request. Context is not a one-time cost; it is a recurring one, paid on every step of the loop.

That single fact explains why agentic work can get expensive: a session that reads a 2,000-line file early keeps paying for those 2,000 lines on every subsequent turn until that context is cleared or compacted. It also explains why the highest-leverage cost lever is not "use a cheaper model" — it is keep the context small. The docs put it plainly: token costs scale with context size, so the strategies that keep context lean are the strategies that cut cost. Output tokens matter too — and extended thinking is billed as output — but for long coding sessions, runaway input context is usually the silent budget-killer.

Fast mode, and what "performance" really costs

"Performance" pulls in two directions, and it helps to name them. One is latency — how fast answers arrive. The other is throughput per dollar — how much real work each token buys. They can trade against each other.

Fast mode is the clearest example. It is a research-preview feature that delivers significantly faster output for the top Opus models (Opus 4.8, 4.7, and 4.6) — but at premium pricing, applied across the full context window. Fast mode buys you latency with money. That is sometimes exactly the right trade (a tight interactive loop where waiting breaks your flow) and sometimes a waste (a long batch job you will read later). It is not available with the Batch API or on Claude Platform on AWS, and prompt-caching and data-residency multipliers stack on top of it.

The deeper point: raw speed is rarely the bottleneck. A session that takes the wrong approach fast is slower and costlier than one that takes the right approach deliberately. Most real performance wins come not from a faster mode but from spending fewer tokens on the wrong path — which is what the rest of this guide is about.

watching the meter

… scroll to run this session

On a subscription, /usage shows what counts against your plan; /status shows remaining allocation. On the API, /cost shows the running spend for this session — an estimate computed locally from token counts.

Efficiency patterns that compound

The biggest savings do not come from one clever flag — they come from a few habits that each shrink context a little and stack. Because every turn re-bills the context, a habit that trims tokens early pays off again on every subsequent turn of the loop. That compounding is the whole game.

Clear between tasks. Stale context from a finished task wastes tokens on every later message. /clear starts fresh when you switch to unrelated work; auto-compaction summarizes history when you approach the limit, and /compact Focus on the API changes lets you steer what survives. Fresh task, fresh context.

Scope your prompts. Vague requests like "improve this codebase" trigger broad scanning — Claude reads half the repo to figure out what you meant, and you pay for all of it. A specific request like "add input validation to the login function in auth.ts" lets Claude work with minimal reads. Specificity is a cost control, not just a clarity nicety.

Delegate verbose work to subagents. Running a noisy test suite or processing a long log can dump tens of thousands of tokens into your main conversation, where they sit and re-bill forever. Hand those to a subagent: the verbose output stays in the subagent's context, and only a short summary returns to your main thread. Per the docs, agent teams can use roughly 7x more tokens than a standard session because each teammate runs its own context window — so keep teams small and clean them up when work is done.

Keep CLAUDE.md lean. Your CLAUDE.md loads into context at the start of every session, so every line is a recurring tax paid even on unrelated work. The official guidance is to aim for under 200 lines and move specialized, situational instructions into skills, which load on demand only when invoked. A tight memory file plus on-demand skills beats one giant always-on document.

A cost-aware session, start to finish

Right-size the model first
Sonnet handles most coding well and costs less than Opus; reserve the largest models for genuinely hard reasoning, and use model: haiku for simple subagent tasks. Switch mid-session with /model. The cheapest token is the one a smaller model spends just as well.
Plan before you edit on anything complex
Press Shift+Tab to enter plan mode. Claude explores and proposes an approach for your approval before writing code — which prevents the expensive re-work of discovering, three files in, that the whole direction was wrong.
Keep context scoped as you go
Name files and functions instead of asking Claude to hunt for them. Delegate noisy operations to subagents. Glance at /context to see what is consuming space, and disable MCP servers you are not using with /mcp.
Clear or compact at task boundaries
When you finish a unit of work and move to something unrelated, /clear. For a long single task brushing the context limit, /compact with an instruction about what to preserve. Both reset the recurring per-turn cost back down.
Check the meter, then decide
Run /usage (or /status on a subscription, /cost on the API) to see where you stand. If you are near a five-hour reset on Pro/Max, that informs whether to start a big job now or wait ten minutes for the window to roll over.

Two ways to ask for the same fix

Token-hungry

"Something is wrong with our auth, can you look into it and fix it?"

Claude scans broadly to locate the problem, reads many candidate files, and every one of them re-bills on every later turn. The session is slow, expensive, and easy to derail.

Token-thrifty

"Login returns a 401 after password reset — the bug is in auth/reset.ts. Reproduce with npm test auth, fix it, and keep the session API unchanged."

Scoped target, a verification command, a constraint. Claude reads what it needs, fixes in one pass, and the context stays small — so the cache stays warm and the bill stays low.

Knowledge check

You are deep in a Claude Code session that read a large log file an hour ago. You have since moved on to an unrelated feature, but everything feels slower and your usage is climbing fast. What is the most effective first move?

Headless billing changed on June 15, 2026

Interactive Claude Code and scripted Claude Code are now billed from different buckets on a subscription. [V] Per the official docs, starting June 15, 2026, Agent SDK and claude -p usage on subscription plans draw from a new monthly Agent SDK credit that is separate from your interactive usage limits. That is a meaningful change in mental model: the /clear-and-pace discipline that protects your five-hour and weekly interactive windows no longer protects your automation, because automation no longer spends from those windows. A CI job that runs claude -p on a schedule draws down the Agent SDK credit pool instead — so a runaway script can exhaust that pool without ever touching the allowance you use for hands-on coding, and vice versa. Plan the two pools separately.

This also means the cheapest way to not get surprised by automation spend is to put a ceiling on every non-interactive run, which Claude Code supports directly through print-mode flags.

Hard cost controls for print mode

Two flags turn "I hope this script doesn't run away" into a guarantee, and both are print-mode (-p) only. [V]

--max-budget-usd N is a hard spend cap: it is the maximum dollar amount to spend on API calls before stopping. Pass claude -p --max-budget-usd 5.00 "query" and the run halts once it has spent five dollars of model usage, full stop. This is the single most direct guard against a prompt that loops forever or a tool result that explodes the context — the meter cannot run past the number you set.

--max-turns N caps the agent loop instead of the dollar figure: it limits the number of agentic turns and exits with an error when the limit is reached (there is no limit by default). Because every turn re-bills the accumulated context, bounding turns bounds the worst case before money is even involved — claude -p --max-turns 3 "query" can attempt at most three read-act cycles. Use --max-turns to fail fast on a task that should be quick, and --max-budget-usd to put an absolute floor under the bill regardless of how the turns play out. [P] In CI the robust pattern is both at once: a turn ceiling to catch logic that spins, and a dollar ceiling to catch anything the turn count alone would miss.

To measure rather than cap, ask for JSON. With --output-format json, the response payload includes a total_cost_usd field and a per-model cost breakdown, so a scripted caller can record the exact spend of each invocation without opening the usage dashboard. [V] That makes per-run cost a value you can log, assert on, or sum across a pipeline — the foundation for any real budget discipline in automation.

Flag	What it does	Example
`--max-budget-usd N`	Hard cap: maximum dollars to spend on API calls before stopping	`claude -p --max-budget-usd 5.00 "query"`
`--max-turns N`	Caps agentic turns; exits with an error at the limit. No limit by default	`claude -p --max-turns 3 "query"`
`--output-format json`	Emits `total_cost_usd` + a per-model cost breakdown for the run	`claude -p ... --output-format json \| jq '.total_cost_usd'`

Print-mode (`-p`) cost controls, verified against the Claude Code CLI reference. The first two stop a run; the third reports its cost for logging. All require print mode.

capping and measuring a scripted run

… scroll to run this session

A hard dollar cap plus a turn cap on a non-interactive run, then reading the per-invocation cost back out of the JSON payload with jq. total_cost_usd is reported by Claude Code itself.

1M-token economics: big context, no premium — but watch the surprise

The largest models can run with a 1M-token context window, and the pricing is simpler than people expect: the 1M context window uses standard model pricing with no premium for tokens beyond 200K. [V] The docs make the point concretely — a 900K-token request is billed at the same per-token rate as a 9K-token request — and prompt-caching and batch discounts still apply at standard rates across the full window. [V] There is no "long-context surcharge" the way some providers meter it; a token past the 200K mark costs exactly what a token before it costs.

That sounds like pure upside, and for capability it is. The catch is the interaction with the recurring-context rule from earlier in this guide. A 1M window does not change the price per token — but it removes the natural ceiling that a 200K window imposed on how many tokens a runaway session can accumulate. Same rate, far more rope. A session that quietly grows toward a million tokens of input, and then re-bills that input on every turn, can run up real spend without any single step looking expensive. The window is generous; the discipline still has to come from you.

For automation specifically, that rope is a hazard. A scheduled claude -p job that occasionally trips into 1M context can cost an order of magnitude more than its typical run while every per-token price stays "standard." Two environment variables give you control. [V]

Set CLAUDE_CODE_DISABLE_1M_CONTEXT=1 to disable 1M context entirely — this removes the 1M model variants from the model picker. In CI, where you want every run to cost the same predictable amount and never silently balloon into a million-token request, exporting this in the job environment caps the context window at 200K and takes the surprise off the table. [P] It is the cleanest way to make "this job cannot accidentally enter 1M-context pricing territory" a property of the environment rather than a thing you remember to check.

Set DISABLE_PROMPT_CACHING=1 to turn prompt caching off for all models (it takes precedence over the per-model DISABLE_PROMPT_CACHING_* switches). Caching normally saves money, so you would not disable it to cut a bill. Its value is billing isolation: with caching off, every run bills the full input deterministically instead of paying a variable mix of cache writes (1.25x/2x) and cache reads (0.1x) that depends on timing and what ran just before. When you need a clean, reproducible per-invocation cost — to benchmark a prompt, attribute spend precisely, or compare two approaches without cache state confounding the numbers — turning caching off removes that variable. Reach for it deliberately and temporarily, not as a default.

Reach the end and this star joins your charted sky.