First Principles · 11 min mission
Context & Prompt Engineering for Coding Agents
Treat the context window as a finite budget and learn the tool-agnostic craft that makes every coding agent sharper.
On this page
Context engineering is the practice of curating the exact set of tokens a model sees during inference — system prompt, tool definitions, memory files, retrieved files, tool results, and prior turns — not just the prompt you type. The concrete moves: budget the window, write prompts at the right altitude, design token-efficient tools, ground just-in-time over MCP, decompose into durable artifacts, and keep long sessions coherent. Every technique is tool-agnostic, with the exact commands, config keys, and version pins for Claude Code, Codex, Gemini CLI, and IDE agents.
| Term | Anthropic's definition | Scope |
|---|---|---|
| Prompt engineering | "Methods for writing and organizing LLM instructions for optimal outcomes" | The instructions you author |
| Context engineering | "The set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference" | Every token in the window, including those that arrive outside your prompt |
| What lands in the window | Cost profile | How to keep it lean |
|---|---|---|
| System prompt / instructions | Fixed — every turn | Aim for the "right altitude"; remove contradictions |
| Tool definitions | Fixed — every turn | Consolidate and namespace; cut overlapping tools |
Memory file (CLAUDE.md, AGENTS.md, GEMINI.md) | Fixed — every turn | Layered and curated, not a junk drawer |
| Retrieved files / pasted data | Variable — grows fast | Prefer just-in-time retrieval over up-front dumps |
| Tool results + prior turns | Variable — grows fastest | Clear stale results; compact long sessions |
Write the system prompt at the right altitude
Instructions fail in two directions. Too brittle: hardcoded if-this-then-that logic that breaks the moment reality deviates from the script. Too vague: lofty guidance with no concrete behavioral signal to steer on. The target is the right altitude between them — specific enough to steer, flexible enough to hand the model heuristics rather than a flowchart. Structure the prompt into sections with XML tags or Markdown headers (<background_information>, <instructions>, <context_gathering>, <persistence>, or a ## Tool guidance header) so each part is findable and weightable.
The same instruction at three altitudes
Too brittle / too vague
Brittle: "If the file ends in .test.ts, run jest; if it ends in .spec.ts, run vitest; if there is a Makefile, run make test; otherwise…" — a flowchart that breaks on the first unforeseen case.
Vague: "Write good, well-tested code." — true and unactionable. No signal to steer on.
Right altitude
"Match the existing test setup. Find how tests are run in this repo (check package.json scripts and any Makefile), run the relevant suite before claiming a fix works, and report the command you used."
Concrete enough to steer, flexible enough to survive an unanticipated layout.
Tool-definition hygiene (Anthropic, writing-tools-for-agents)
Consolidate, don't proliferate
One tool that performs several related operations under the hood beats ten near-duplicates. It shrinks the fixed-overhead definitions and removes the "which tool do I call?" ambiguity that stalls agents.
Namespace related tools under a common prefix
Group tools with a shared prefix (e.g.
issues_*) so their boundaries are obvious.Return high-signal results
Prefer human-interpretable names over low-level identifiers like
uuid,256px_image_url, ormime_type. A result the model can read is a result it can use.Add a verbosity control
Offer a
response_formatenum ("concise"vs"detailed"), plus pagination, range selection, filtering, and truncation with sensible defaults, so the agent spends tokens only when it needs detail.Write error messages that steer
Emit messages that guide the model toward correct usage, not cryptic codes.
Name semantically and design evaluation-first
Make names and arguments "as semantically correct as possible" —
semantic_searchbeats an ambiguoussearch(OpenAI). A tool wrapping a terminal command works best when its output mirrors the real command (keeps the model in-distribution). Measure tools on real tasks, then refine descriptions from what you observe.
// LOW-SIGNAL: vague name, dumps raw rows, no verbosity control, no limits.
server.registerTool(
'search',
{ description: 'Search', inputSchema: { q: z.string() } },
async ({ q }) => ({
// Returns every column, including uuids and internal mime types,
// for an unbounded number of rows — straight into the window.
content: [{ type: 'text', text: JSON.stringify(await db.searchAll(q)) }],
}),
);
// HIGH-SIGNAL: semantic name, bounded results, human-readable fields,
// and a verbosity knob the agent controls.
server.registerTool(
'semantic_search_issues',
{
description:
'Search issues by meaning. Returns the top matches with title, ' +
'status, and assignee. Use detail="full" only when you need bodies.',
inputSchema: {
query: z.string().describe('Natural-language description of the issue'),
limit: z.number().int().min(1).max(20).default(5),
detail: z.enum(['concise', 'full']).default('concise'),
},
},
async ({ query, limit, detail }) => {
const hits = await issues.semanticSearch(query, limit);
const rows = hits.map((i) =>
detail === 'full'
? `#${i.number} [${i.status}] ${i.title} — ${i.assignee}\n${i.body}`
: `#${i.number} [${i.status}] ${i.title} — ${i.assignee}`,
);
return { content: [{ type: 'text', text: rows.join('\n') }] };
},
);Ground just-in-time — don't pre-load the world
There are three retrieval postures; the modern agentic default has shifted to the second. Pre-retrieval (up-front) loads data into context before inference (classic RAG with embeddings/indexes) — fast, but risks stale indexes and window pollution. Just-in-time ("agentic search") keeps only lightweight identifiers in context (file paths, URLs, queries) and loads the data at runtime with tools like glob, grep, and file reads — turning metadata into signal and enabling progressive disclosure. Hybrid does both: retrieve a little up front, then explore. Claude Code is the canonical hybrid — CLAUDE.md is dropped in up front while glob/grep/read fetch files just-in-time, sidestepping stale indexing.
| Posture | What sits in context | Trade-off |
|---|---|---|
| Pre-retrieval (up-front RAG) | Data loaded before inference via embeddings/indexes | Fast; risks stale indexes and pollution |
| Just-in-time (agentic search) | Identifiers only (glob/grep/read at runtime) | Lean window, progressive disclosure; needs tool calls |
| Hybrid | CLAUDE.md up front + tools for the rest | "The decision boundary for the right level of autonomy depends on the task" |
Two ways to give an agent the same codebase
Pre-load everything
Paste 4,000 lines across twelve files into the prompt "so it has full context."
The relevant 40 lines drown in 3,960 irrelevant ones. Attention stretches thin, the window is mostly spent before work begins, and you pay for all of it every turn.
Just-in-time retrieval
Point the agent at the directory and let it grep for the symbol, read the two files that matter, and open more only if needed.
The window holds high-signal tokens: a query, two files, the result. Progressive disclosure keeps attention concentrated.
MCP: the wire format for what context an agent can reach
The Model Context Protocol (MCP) standardizes how an agent discovers and pulls external context, making "what context does this agent have access to" portable across clients. It defines three server-exposed primitives that map onto the ways context enters a window: Resources (server-exposed data the client loads into context), Tools (model-controlled functions the agent calls to act or retrieve), and Prompts (server-exposed, user-controlled templates — discovered via prompts/list, fetched via prompts/get with arguments, commonly surfaced as slash commands). A prompt can embed resources directly (text, image, audio, or a type:"resource" reference), injecting documentation or code samples into the conversation flow.
| Item | Current value | Notes |
|---|---|---|
| Spec revision | 2025-11-25 | Supersedes 2025-06-18; 2026-07-28 is a release candidate only |
| Standard transports | stdio, Streamable HTTP | HTTP+SSE from 2024-11-05 is deprecated (back-compat only) |
| Protocol headers | MCP-Protocol-Version: 2025-11-25 | Session id MCP-Session-Id; resume via Last-Event-ID on HTTP GET |
| Python SDK | mcp v1.27.2 | PyPI 2026-05-29; v2.0.0 in alpha (no GA date) |
| TypeScript SDK | @modelcontextprotocol/sdk v1.29.0 | npm latest; v2 pre-alpha, targeting Q3 2026 |
# pip install "mcp[cli]" — package name is "mcp", stable line v1.x (v1.27.2)
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("project-context")
# A RESOURCE: server-exposed data the client can load into the window
# on demand, identified by a lightweight URI — the just-in-time pattern.
@mcp.resource("docs://coding-standards")
def coding_standards() -> str:
"""The team's coding standards, fetched only when referenced."""
return (
"- Prefer composition over inheritance.\n"
"- Every public function has a docstring and a test.\n"
"- No new dependencies without sign-off."
)
# A PROMPT: a user-controlled template, discovered via prompts/list and
# fetched via prompts/get. Clients commonly surface these as slash commands.
@mcp.prompt()
def review_diff(diff: str) -> str:
"""Structured code-review prompt the user invokes with arguments."""
return (
"<task>Review the following diff against our coding standards.</task>\n"
"<focus>Correctness, then readability. Cite the standard you apply.</focus>\n"
f"<diff>\n{diff}\n</diff>"
)
if __name__ == "__main__":
# stdio is one of the two standard transports (the other is Streamable HTTP).
mcp.run(transport="stdio")Decompose into durable artifacts, not an ever-growing chat
In a single sprawling conversation the chat history is your only memory, and it rots as it grows. Decomposition that matters for context engineering produces durable artifacts — each phase writes a high-signal file that becomes the input to the next, instead of relying on accumulated chat. Two vendor-official methodologies make this concrete: GitHub Spec Kit (a /speckit.* slash-command workflow) and AWS Kiro (three files per feature under .kiro/specs/).
Install GitHub Spec Kit and run the spec-driven flow
Install the CLI
Requires
uv+ Python 3.11+. Runuv tool install specify-cli --from git+https://github.com/github/spec-kit.git@vX.Y.Z(the docs recommend pinning a release tag; the bare.gitform works but is unpinned).Initialize the workflow
Run
specify initin the repo. The agent gains the/speckit.*slash commands (works with 30+ CLI and IDE agents).Optionally set principles
Run
/speckit.constitutionto establish governing project principles.Specify the what
Run
/speckit.specifyto define requirements / user stories. Use/speckit.clarifyto resolve underspecified requirements (recommended).Plan the how, then break into tasks
Run
/speckit.planfor the technical implementation plan, then/speckit.tasksto break it into actionable tasks./speckit.analyzecross-checks artifact consistency;/speckit.checklistgenerates quality checklists.Implement
Run
/speckit.implementto execute the tasks./speckit.taskstoissuesconverts tasks into GitHub issues. Each phase produces a durable artifact — the spec "becomes executable" and defines the what before the how.
| Methodology | Where artifacts live | The phased flow |
|---|---|---|
| GitHub Spec Kit | specify CLI output + repo | /speckit.specify → /speckit.plan → /speckit.tasks → /speckit.implement |
| AWS Kiro | .kiro/specs/<feature>/ | requirements.md (EARS) → design.md → tasks.md → Run all Tasks |
Keep long sessions sharp: four techniques
When a task outruns one window, four named techniques (Anthropic) prevent the slide into context rot. Compaction summarizes the conversation near the limit and reinitiates from the summary — maximize recall first, then iterate to improve precision; "one of the safest, lightest-touch forms of compaction is tool result clearing." Structured note-taking (agentic memory) writes notes to durable storage outside the window (a progress file, a to-do list) and pulls them back only when needed. Sub-agent architectures spin up specialized sub-agents with clean context windows that return only a condensed summary — typically ~1,000–2,000 tokens — to the coordinator ("since context is your fundamental constraint, subagents are one of the most powerful tools available"). Cross-session harnesses externalize state so a fresh session resumes cold.
| Tool | File | Load behavior |
|---|---|---|
| Claude Code | CLAUDE.md | Dropped in up front; tools fetch the rest just-in-time |
| OpenAI Codex | AGENTS.md | Files from ~/.codex + each dir from repo root to cwd; deeper overrides shallower; model "trained to closely adhere" |
| Gemini CLI | GEMINI.md | Concatenated from ~/.gemini/GEMINI.md, ./GEMINI.md, subdir files; /memory show to inspect, /memory reload to reload |
| AWS Kiro | .kiro/steering/*.md | product.md, tech.md, structure.md — "included in every interaction by default" |
Knowledge check
You are 90 minutes into a large refactor and the context window is nearly full. The agent is starting to "forget" decisions it made early in the session. What is the soundest move?
| Footgun | Fix |
|---|---|
| "A bigger context window solves it" | It does not — context rot degrades accuracy regardless of max size |
| Dumping the whole repo or giant pasted logs into the prompt | Just-in-time retrieval (paths + tools) or a curated subset |
| Contradictory instructions in the prompt or memory file | Audit for conflicts; the model burns tokens reconciling them |
| Over-stuffed tool lists with overlapping tools | Consolidate and namespace under a common prefix |
| Verbose tool results (raw JSON, uuids, base64) | Return human-readable output; paginate or truncate |
| Letting chat history be the only memory in long tasks | Externalize state to files (progress file, JSON task list) |
| Ending with only a plan | A plan is not a deliverable unless asked; reconcile every TODO first |
Reach the end and this star joins your charted sky.