First Principles · 11 min mission
Evals for Agents: Measuring Whether Your Setup Actually Works
Build a repeatable eval loop so every prompt, tool, and model change becomes a measurement instead of a leap of faith.
On this page
An eval is a repeatable test for a non-deterministic system: feed the agent an input, let it run, then apply grading logic to what it produced — over enough tasks and enough trials that a trustworthy number falls out. This guide gives you the vocabulary, the three grader families, the two metrics that survive non-determinism (pass@k and pass^k), the 2026 framework options, and a runnable loop you can ship this week.
An eval suite is the regression test for everything in your setup that isn't code — the system prompt, the AGENTS.md/CLAUDE.md rules, the tool definitions, the model choice. It lives in Foundations because the mechanics are identical whether you drive Claude Code, Codex, Gemini CLI, Cursor, or a hand-rolled MCP agent.
| Term | What it means |
|---|---|
| Eval | A test: give the AI an input, apply grading logic to its output to measure success |
| Task / problem | One test with defined inputs and success criteria |
| Trial | A single attempt at a task; run many per task (agents are non-deterministic) |
| Grader | The logic that scores a task; a task can have several |
| Transcript / trace | The full record of a run: outputs, tool calls, reasoning, intermediates |
| Outcome | The final environment state at the end of the trial — what changed, not how |
| Eval harness | Infrastructure that runs the evals end-to-end |
| Agent harness / scaffold | The loop + tools + prompts that turn a model into an agent — usually the real thing under test |
Pick a grader family
Every eval lives or dies on its grader. Anthropic groups graders into three families with sharp trade-offs. Do not pick one — run deterministic graders on every trial, LLM judges on a sample, humans on a small gold-set. Anthropic's "Swiss Cheese Model": "no single evaluation layer catches every issue," so evals run alongside production monitoring, A/B tests, and manual transcript review.
| Grader family | How it scores | Strength | Weakness |
|---|---|---|---|
| Code-based | String match, unit tests, static analysis, env-state checks | Fast, cheap, objective, reproducible | Brittle — rejects valid variations |
| Model-based (LLM-as-judge) | Rubric scoring, NL assertions, pairwise comparison | Flexible; handles nuance and open-ended output | Non-deterministic, costs tokens, must be calibrated |
| Human | Expert review, crowdsourcing, spot-checks | Gold-standard quality | Expensive, slow, hard to scale |
LLM-as-judge: three modes, one rule
When a deterministic check cannot capture "good," reach for an LLM judge in one of three modes, then calibrate it.
- Pointwise — judge rates one response on a scale (1–5 / 1–10). Simple and fast, but drifts (a model's idea of "a 7" wanders).
- Pairwise — judge picks the better of two responses. More reliable: relative judgement is an easier task for a model than absolute scoring.
- Reference-grounded — judge compares the output to a gold reference, explains the delta, scores the gap. Most accurate when a known-correct answer exists.
The non-negotiable rule: calibrate the judge against human judgment. Anthropic: "calibrate LLM graders against human expert judgment." OpenAI makes it concrete: "Start with gpt-5.5 when you need a strong LLM judge, then validate agreement against your human labels." Without that comparison you do not know your numbers.
Grade the outcome, not the path
There are two things you could grade; default to the outcome.
Outcome-based (state-based) grading checks the final environment state: did the tests pass, is the row in the database, does the file contain X. This is the preferred default — Anthropic's guidance is to design graders "emphasizing outcome over process." Trajectory-based (process) grading scores how the agent got there: which tools, in what order, whether policy was followed. It is useful for debugging and policy compliance but dangerous as the primary grade, because "agents regularly find valid approaches that eval designers didn't anticipate."
OpenAI frames the same split as trace grading, where a trace is "the end-to-end record of model calls, tool calls, guardrails, and handoffs for one run." They call it "the fastest way to identify workflow-level issues." Reach for trajectory/trace checks to diagnose and enforce compliance; keep the headline pass/fail on the outcome. On multi-step tasks, award partial credit rather than binary pass/fail.
Two ways to grade the same coding task
Outcome-based (default)
"After the agent runs, the repo’s own test suite must pass and lint must be clean."
Any path that lands in a working state scores a pass. Robust to the agent solving it a way you did not foresee.
Trajectory-based (use sparingly)
"The agent must read config.ts, then call run_tests, then edit auth.ts."
Good for debugging why a run failed or enforcing policy — but it fails a correct fix that skipped a step you assumed was mandatory.
Report both pass@k and pass^k
A single run is not an eval. Run k trials per task and report two numbers; the gap between them is the most important signal in your suite.
pass@k= probability at least one ofktrials succeeds. Measures capability:1 − (1−p)^k, rises fast toward 1.pass^k= probability allktrials succeed. Measures reliability:p^k, falls fast toward 0.
Both were introduced for agents by the original τ-bench paper; Anthropic's eval guidance reuses them. For production, pass^k is the honest number.
| Agent type | What to grade | Reference benchmark |
|---|---|---|
| Coding | Run the repo’s unit tests (deterministic) + an LLM rubric for code quality | SWE-bench Verified, Terminal-Bench |
| Conversational | State verification + LLM rubric for interaction quality; usually needs a simulated user | τ²-bench |
| Research | Groundedness + coverage + source-quality checks | BrowseComp |
| Computer-use | Verify outcomes via environment-state checks | WebArena, OSWorld |
| Tool | Lane / license | Version | Model / key feature |
|---|---|---|---|
| Inspect AI | OSS, MIT (UK AISI + Meridian Labs) | 0.3.239 (PyPI 2026-06-09) | Eval = Task = Dataset + Solver + Scorer; 200+ built-in evals |
| DeepEval | OSS, Apache-2.0 (Confident AI) | 4.0.6 (PyPI 2026-06-10) | "Pytest for LLM apps": LLMTestCase, G-Eval, DAG + Tool Correctness/Task Completion; gate via deepeval test run |
| Promptfoo | OSS, MIT | current line | Declarative YAML assertions in promptfooconfig.yaml; 100+ red-team plugins |
| Braintrust / LangSmith / Langfuse / Phoenix | Hosted + observability (Langfuse/Phoenix OSS cores) | — | Offline eval + production tracing |
| OpenAI Datasets + trace grading | Hosted (SDK/API) | — | Forward path — the hosted Evals UI is deprecating (see below) |
| Harbor | Containerized benchmark/RL infra (Terminal-Bench creators) | — | Standardized task/grader format; runs trials at scale across sandbox providers |
Run the minimal eval loop
Anthropic runs a concrete five-move loop to evaluate tool and prompt setups (from "Writing effective tools for agents"). It starts as a single script over 20–50 real failures — no platform required. Follow the steps, then run the script below.
The evaluation-driven loop (Anthropic, "Writing effective tools for agents")
Generate tasks from real use
Build tasks from genuine workflows, not toy sandboxes: "Prompts should be inspired by real-world uses and be based on realistic data sources." Strong tasks need multiple tool calls and each pairs with a verifiable outcome (exact-match or an LLM judge).
Run programmatically with direct API calls
Drive the eval with direct LLM API calls in a simple agentic loop — alternate a model call with tool execution, and have the agent emit reasoning before each tool call so the transcript is legible.
Collect more than accuracy
Log runtime per tool call and per task, total tool calls, total token consumption, and tool errors. A pass that took 40 tool calls is a different result than a pass that took 4.
Read the raw transcripts
Numbers hide behavior: "what agents omit in their feedback and responses can often be more important than what they include." This is also how you catch a grader scoring false passes or false fails.
Close the loop with the agent
Feed transcripts back to the agent — "let agents analyze your results and improve your tools for you." This dovetails with the evaluator-optimizer pattern: one model generates, a second evaluates and gives feedback in a loop, "when we have clear evaluation criteria, and when iterative refinement provides measurable value."
"""A 60-line eval harness: tasks + trials + an outcome grader + pass@k/pass^k.
No platform, no SaaS. Swap run_agent() for your own setup's entry point."""
import json
from statistics import mean
# 1. Tasks pulled from REAL failures, each with a verifiable outcome.
TASKS = [
{
"id": "fix-null-guard",
"prompt": "Add a null guard to parseConfig() so an empty file "
"returns {} instead of throwing.",
# Outcome grader: deterministic check on the FINAL state, not the path.
"grade": lambda repo: repo.run("pytest tests/test_config.py").ok,
},
{
"id": "rename-endpoint",
"prompt": "Rename the /v1/user endpoint to /v1/users and update callers.",
"grade": lambda repo: repo.run("pytest tests/test_api.py").ok
and repo.grep("/v1/user\b") == [],
},
]
K = 5 # trials per task — never trust a single greedy run
def evaluate(task):
"""Run K trials, grade the OUTCOME of each, return per-task metrics."""
results = []
for trial in range(K):
repo = fresh_sandbox() # isolated, clean state every trial
transcript = run_agent(task["prompt"], repo) # your agent setup
passed = task["grade"](repo)
results.append({"passed": passed,
"tool_calls": transcript.tool_calls,
"tokens": transcript.tokens,
"errors": transcript.tool_errors})
save_transcript(task["id"], trial, transcript) # READ these later
n_pass = sum(r["passed"] for r in results)
return {
"task": task["id"],
"pass_at_k": int(n_pass >= 1), # capability
"pass_pow_k": int(n_pass == K), # reliability
"avg_tool_calls": mean(r["tool_calls"] for r in results),
"avg_tokens": mean(r["tokens"] for r in results),
}
if __name__ == "__main__":
report = [evaluate(t) for t in TASKS]
print(json.dumps(report, indent=2))
print("suite pass@k:", mean(r["pass_at_k"] for r in report))
print("suite pass^k:", mean(r["pass_pow_k"] for r in report))Anthropic’s 8-step roadmap, condensed for your own setup
Start early
20–50 simple tasks pulled from real failures, not hypotheticals. A handful of real tasks beats a polished platform with none.
Convert manual testing into formal cases
Every time you would hand-check something, write it down as a task. Manual checks that never become tasks are checks you will forget to repeat.
Write unambiguous tasks
Pair each with a reference solution clear enough that "two domain experts would independently reach the same pass/fail verdict."
Build isolated environments
Clean state between runs — no cross-trial contamination, or one trial’s side effects silently pass or fail the next.
Design thoughtful graders
Outcome over process, with partial credit on multi-step tasks so you keep the signal binary pass/fail throws away.
Review transcripts regularly
Verify the grader is fair — catch false fails and false passes by reading the actual runs, not just the scores.
Monitor for saturation
When a suite hits ~100% pass it has stopped discriminating; that is "too easy," not "done." Add harder, real-failure-derived tasks.
Maintain the suite
Treat it as living infrastructure and design tasks to be bypass-resistant so the agent cannot game the grader instead of solving the task.
Knowledge check
Your coding agent solves a refactor task on 4 of 5 trials. You want to know whether it is safe to let it run unattended in CI. Which number answers that, and what does it tell you?
Reach the end and this star joins your charted sky.