First Principles · 11 min mission

Evals for Agents: Measuring Whether Your Setup Actually Works

Build a repeatable eval loop so every prompt, tool, and model change becomes a measurement instead of a leap of faith.

evalstestingllm-as-judgereliabilityagentsFact-checked 2026-06-15

On this page

Pick a grader family
LLM-as-judge: three modes, one rule
Grade the outcome, not the path
Report both pass@k and pass^k
Run the minimal eval loop

An eval is a repeatable test for a non-deterministic system: feed the agent an input, let it run, then apply grading logic to what it produced — over enough tasks and enough trials that a trustworthy number falls out. This guide gives you the vocabulary, the three grader families, the two metrics that survive non-determinism (pass@k and pass^k), the 2026 framework options, and a runnable loop you can ship this week.

An eval suite is the regression test for everything in your setup that isn't code — the system prompt, the AGENTS.md/CLAUDE.md rules, the tool definitions, the model choice. It lives in Foundations because the mechanics are identical whether you drive Claude Code, Codex, Gemini CLI, Cursor, or a hand-rolled MCP agent.

Term	What it means
Eval	A test: give the AI an input, apply grading logic to its output to measure success
Task / problem	One test with defined inputs and success criteria
Trial	A single attempt at a task; run many per task (agents are non-deterministic)
Grader	The logic that scores a task; a task can have several
Transcript / trace	The full record of a run: outputs, tool calls, reasoning, intermediates
Outcome	The final environment state at the end of the trial — what changed, not how
Eval harness	Infrastructure that runs the evals end-to-end
Agent harness / scaffold	The loop + tools + prompts that turn a model into an agent — usually the real thing under test

The load-bearing vocabulary from Anthropic’s eval taxonomy ("Demystifying evals for AI agents"). Learn these before writing a single task.

Pick a grader family

Every eval lives or dies on its grader. Anthropic groups graders into three families with sharp trade-offs. Do not pick one — run deterministic graders on every trial, LLM judges on a sample, humans on a small gold-set. Anthropic's "Swiss Cheese Model": "no single evaluation layer catches every issue," so evals run alongside production monitoring, A/B tests, and manual transcript review.

Grader family	How it scores	Strength	Weakness
Code-based	String match, unit tests, static analysis, env-state checks	Fast, cheap, objective, reproducible	Brittle — rejects valid variations
Model-based (LLM-as-judge)	Rubric scoring, NL assertions, pairwise comparison	Flexible; handles nuance and open-ended output	Non-deterministic, costs tokens, must be calibrated
Human	Expert review, crowdsourcing, spot-checks	Gold-standard quality	Expensive, slow, hard to scale

The three grader families (Anthropic, "Demystifying evals for AI agents"). Combine all three; do not pick one.

LLM-as-judge: three modes, one rule

When a deterministic check cannot capture "good," reach for an LLM judge in one of three modes, then calibrate it.

Pointwise — judge rates one response on a scale (1–5 / 1–10). Simple and fast, but drifts (a model's idea of "a 7" wanders).
Pairwise — judge picks the better of two responses. More reliable: relative judgement is an easier task for a model than absolute scoring.
Reference-grounded — judge compares the output to a gold reference, explains the delta, scores the gap. Most accurate when a known-correct answer exists.

The non-negotiable rule: calibrate the judge against human judgment. Anthropic: "calibrate LLM graders against human expert judgment." OpenAI makes it concrete: "Start with gpt-5.5 when you need a strong LLM judge, then validate agreement against your human labels." Without that comparison you do not know your numbers.

Grade the outcome, not the path

There are two things you could grade; default to the outcome.

Outcome-based (state-based) grading checks the final environment state: did the tests pass, is the row in the database, does the file contain X. This is the preferred default — Anthropic's guidance is to design graders "emphasizing outcome over process." Trajectory-based (process) grading scores how the agent got there: which tools, in what order, whether policy was followed. It is useful for debugging and policy compliance but dangerous as the primary grade, because "agents regularly find valid approaches that eval designers didn't anticipate."

OpenAI frames the same split as trace grading, where a trace is "the end-to-end record of model calls, tool calls, guardrails, and handoffs for one run." They call it "the fastest way to identify workflow-level issues." Reach for trajectory/trace checks to diagnose and enforce compliance; keep the headline pass/fail on the outcome. On multi-step tasks, award partial credit rather than binary pass/fail.

Two ways to grade the same coding task

Outcome-based (default)

"After the agent runs, the repo’s own test suite must pass and lint must be clean."

Any path that lands in a working state scores a pass. Robust to the agent solving it a way you did not foresee.

Trajectory-based (use sparingly)

"The agent must read config.ts, then call run_tests, then edit auth.ts."

Good for debugging why a run failed or enforcing policy — but it fails a correct fix that skipped a step you assumed was mandatory.

Report both pass@k and pass^k

A single run is not an eval. Run k trials per task and report two numbers; the gap between them is the most important signal in your suite.

pass@k = probability at least one of k trials succeeds. Measures capability: 1 − (1−p)^k, rises fast toward 1.
pass^k = probability all k trials succeed. Measures reliability: p^k, falls fast toward 0.

Both were introduced for agents by the original τ-bench paper; Anthropic's eval guidance reuses them. For production, pass^k is the honest number.

Agent type	What to grade	Reference benchmark
Coding	Run the repo’s unit tests (deterministic) + an LLM rubric for code quality	SWE-bench Verified, Terminal-Bench
Conversational	State verification + LLM rubric for interaction quality; usually needs a simulated user	τ²-bench
Research	Groundedness + coverage + source-quality checks	BrowseComp
Computer-use	Verify outcomes via environment-state checks	WebArena, OSWorld

What to grade by agent type (Anthropic, "Demystifying evals"). The reference benchmarks describe *models*; your own suite describes *your setup*.

Tool	Lane / license	Version	Model / key feature
Inspect AI	OSS, MIT (UK AISI + Meridian Labs)	`0.3.239` (PyPI 2026-06-09)	Eval = `Task = Dataset + Solver + Scorer`; 200+ built-in evals
DeepEval	OSS, Apache-2.0 (Confident AI)	`4.0.6` (PyPI 2026-06-10)	"Pytest for LLM apps": `LLMTestCase`, `G-Eval`, DAG + Tool Correctness/Task Completion; gate via `deepeval test run`
Promptfoo	OSS, MIT	current line	Declarative YAML assertions in `promptfooconfig.yaml`; 100+ red-team plugins
Braintrust / LangSmith / Langfuse / Phoenix	Hosted + observability (Langfuse/Phoenix OSS cores)	—	Offline eval + production tracing
OpenAI Datasets + trace grading	Hosted (SDK/API)	—	Forward path — the hosted Evals UI is deprecating (see below)
Harbor	Containerized benchmark/RL infra (Terminal-Bench creators)	—	Standardized task/grader format; runs trials at scale across sandbox providers

Where to run evals in 2026. For most teams, start in the open-source CI lane.

Run the minimal eval loop

Anthropic runs a concrete five-move loop to evaluate tool and prompt setups (from "Writing effective tools for agents"). It starts as a single script over 20–50 real failures — no platform required. Follow the steps, then run the script below.

The evaluation-driven loop (Anthropic, "Writing effective tools for agents")

Generate tasks from real use
Build tasks from genuine workflows, not toy sandboxes: "Prompts should be inspired by real-world uses and be based on realistic data sources." Strong tasks need multiple tool calls and each pairs with a verifiable outcome (exact-match or an LLM judge).
Run programmatically with direct API calls
Drive the eval with direct LLM API calls in a simple agentic loop — alternate a model call with tool execution, and have the agent emit reasoning before each tool call so the transcript is legible.
Collect more than accuracy
Log runtime per tool call and per task, total tool calls, total token consumption, and tool errors. A pass that took 40 tool calls is a different result than a pass that took 4.
Read the raw transcripts
Numbers hide behavior: "what agents omit in their feedback and responses can often be more important than what they include." This is also how you catch a grader scoring false passes or false fails.
Close the loop with the agent
Feed transcripts back to the agent — "let agents analyze your results and improve your tools for you." This dovetails with the evaluator-optimizer pattern: one model generates, a second evaluates and gives feedback in a loop, "when we have clear evaluation criteria, and when iterative refinement provides measurable value."

minimal_eval.py — a runnable loop you own (illustrative)

"""A 60-line eval harness: tasks + trials + an outcome grader + pass@k/pass^k.
No platform, no SaaS. Swap run_agent() for your own setup's entry point."""
import json
from statistics import mean
 
# 1. Tasks pulled from REAL failures, each with a verifiable outcome.
TASKS = [
    {
        "id": "fix-null-guard",
        "prompt": "Add a null guard to parseConfig() so an empty file "
                  "returns {} instead of throwing.",
        # Outcome grader: deterministic check on the FINAL state, not the path.
        "grade": lambda repo: repo.run("pytest tests/test_config.py").ok,
    },
    {
        "id": "rename-endpoint",
        "prompt": "Rename the /v1/user endpoint to /v1/users and update callers.",
        "grade": lambda repo: repo.run("pytest tests/test_api.py").ok
                              and repo.grep("/v1/user\b") == [],
    },
]
 
K = 5  # trials per task — never trust a single greedy run
 
def evaluate(task):
    """Run K trials, grade the OUTCOME of each, return per-task metrics."""
    results = []
    for trial in range(K):
        repo = fresh_sandbox()          # isolated, clean state every trial
        transcript = run_agent(task["prompt"], repo)   # your agent setup
        passed = task["grade"](repo)
        results.append({"passed": passed,
                        "tool_calls": transcript.tool_calls,
                        "tokens": transcript.tokens,
                        "errors": transcript.tool_errors})
        save_transcript(task["id"], trial, transcript)  # READ these later
    n_pass = sum(r["passed"] for r in results)
    return {
        "task": task["id"],
        "pass_at_k": int(n_pass >= 1),                 # capability
        "pass_pow_k": int(n_pass == K),                # reliability
        "avg_tool_calls": mean(r["tool_calls"] for r in results),
        "avg_tokens": mean(r["tokens"] for r in results),
    }
 
if __name__ == "__main__":
    report = [evaluate(t) for t in TASKS]
    print(json.dumps(report, indent=2))
    print("suite pass@k:",  mean(r["pass_at_k"]  for r in report))
    print("suite pass^k:",  mean(r["pass_pow_k"] for r in report))

Running the loop — capability passes, reliability does not

… scroll to run this session

Both tasks reach pass@k=1 (the agent can do each), but rename-endpoint has pass^k=0 — one of its 5 trials failed. The suite’s pass^k of 0.5 is the number that tells you it is not yet safe unattended; read the saved transcripts for rename-endpoint before shipping.

Anthropic’s 8-step roadmap, condensed for your own setup

Start early
20–50 simple tasks pulled from real failures, not hypotheticals. A handful of real tasks beats a polished platform with none.
Convert manual testing into formal cases
Every time you would hand-check something, write it down as a task. Manual checks that never become tasks are checks you will forget to repeat.
Write unambiguous tasks
Pair each with a reference solution clear enough that "two domain experts would independently reach the same pass/fail verdict."
Build isolated environments
Clean state between runs — no cross-trial contamination, or one trial’s side effects silently pass or fail the next.
Design thoughtful graders
Outcome over process, with partial credit on multi-step tasks so you keep the signal binary pass/fail throws away.
Review transcripts regularly
Verify the grader is fair — catch false fails and false passes by reading the actual runs, not just the scores.
Monitor for saturation
When a suite hits ~100% pass it has stopped discriminating; that is "too easy," not "done." Add harder, real-failure-derived tasks.
Maintain the suite
Treat it as living infrastructure and design tasks to be bypass-resistant so the agent cannot game the grader instead of solving the task.

Knowledge check

Your coding agent solves a refactor task on 4 of 5 trials. You want to know whether it is safe to let it run unattended in CI. Which number answers that, and what does it tell you?

Reach the end and this star joins your charted sky.