First Principles · 11 min mission

Evals for Agents: Measuring Whether Your Setup Actually Works

Build a repeatable eval loop so every prompt, tool, and model change becomes a measurement instead of a leap of faith.

evalstestingllm-as-judgereliabilityagentsFact-checked 2026-06-15
On this page

An eval is a repeatable test for a non-deterministic system: feed the agent an input, let it run, then apply grading logic to what it produced — over enough tasks and enough trials that a trustworthy number falls out. This guide gives you the vocabulary, the three grader families, the two metrics that survive non-determinism (pass@k and pass^k), the 2026 framework options, and a runnable loop you can ship this week.

An eval suite is the regression test for everything in your setup that isn't code — the system prompt, the AGENTS.md/CLAUDE.md rules, the tool definitions, the model choice. It lives in Foundations because the mechanics are identical whether you drive Claude Code, Codex, Gemini CLI, Cursor, or a hand-rolled MCP agent.

TermWhat it means
EvalA test: give the AI an input, apply grading logic to its output to measure success
Task / problemOne test with defined inputs and success criteria
TrialA single attempt at a task; run many per task (agents are non-deterministic)
GraderThe logic that scores a task; a task can have several
Transcript / traceThe full record of a run: outputs, tool calls, reasoning, intermediates
OutcomeThe final environment state at the end of the trial — what changed, not how
Eval harnessInfrastructure that runs the evals end-to-end
Agent harness / scaffoldThe loop + tools + prompts that turn a model into an agent — usually the real thing under test
The load-bearing vocabulary from Anthropic’s eval taxonomy ("Demystifying evals for AI agents"). Learn these before writing a single task.

Pick a grader family

Every eval lives or dies on its grader. Anthropic groups graders into three families with sharp trade-offs. Do not pick one — run deterministic graders on every trial, LLM judges on a sample, humans on a small gold-set. Anthropic's "Swiss Cheese Model": "no single evaluation layer catches every issue," so evals run alongside production monitoring, A/B tests, and manual transcript review.

Grader familyHow it scoresStrengthWeakness
Code-basedString match, unit tests, static analysis, env-state checksFast, cheap, objective, reproducibleBrittle — rejects valid variations
Model-based (LLM-as-judge)Rubric scoring, NL assertions, pairwise comparisonFlexible; handles nuance and open-ended outputNon-deterministic, costs tokens, must be calibrated
HumanExpert review, crowdsourcing, spot-checksGold-standard qualityExpensive, slow, hard to scale
The three grader families (Anthropic, "Demystifying evals for AI agents"). Combine all three; do not pick one.

LLM-as-judge: three modes, one rule

When a deterministic check cannot capture "good," reach for an LLM judge in one of three modes, then calibrate it.

  • Pointwise — judge rates one response on a scale (1–5 / 1–10). Simple and fast, but drifts (a model's idea of "a 7" wanders).
  • Pairwise — judge picks the better of two responses. More reliable: relative judgement is an easier task for a model than absolute scoring.
  • Reference-grounded — judge compares the output to a gold reference, explains the delta, scores the gap. Most accurate when a known-correct answer exists.

The non-negotiable rule: calibrate the judge against human judgment. Anthropic: "calibrate LLM graders against human expert judgment." OpenAI makes it concrete: "Start with gpt-5.5 when you need a strong LLM judge, then validate agreement against your human labels." Without that comparison you do not know your numbers.

Grade the outcome, not the path

There are two things you could grade; default to the outcome.

Outcome-based (state-based) grading checks the final environment state: did the tests pass, is the row in the database, does the file contain X. This is the preferred default — Anthropic's guidance is to design graders "emphasizing outcome over process." Trajectory-based (process) grading scores how the agent got there: which tools, in what order, whether policy was followed. It is useful for debugging and policy compliance but dangerous as the primary grade, because "agents regularly find valid approaches that eval designers didn't anticipate."

OpenAI frames the same split as trace grading, where a trace is "the end-to-end record of model calls, tool calls, guardrails, and handoffs for one run." They call it "the fastest way to identify workflow-level issues." Reach for trajectory/trace checks to diagnose and enforce compliance; keep the headline pass/fail on the outcome. On multi-step tasks, award partial credit rather than binary pass/fail.

Two ways to grade the same coding task

Outcome-based (default)

"After the agent runs, the repo’s own test suite must pass and lint must be clean."

Any path that lands in a working state scores a pass. Robust to the agent solving it a way you did not foresee.

Trajectory-based (use sparingly)

"The agent must read config.ts, then call run_tests, then edit auth.ts."

Good for debugging why a run failed or enforcing policy — but it fails a correct fix that skipped a step you assumed was mandatory.

Report both pass@k and pass^k

A single run is not an eval. Run k trials per task and report two numbers; the gap between them is the most important signal in your suite.

  • pass@k = probability at least one of k trials succeeds. Measures capability: 1 − (1−p)^k, rises fast toward 1.
  • pass^k = probability all k trials succeed. Measures reliability: p^k, falls fast toward 0.

Both were introduced for agents by the original τ-bench paper; Anthropic's eval guidance reuses them. For production, pass^k is the honest number.

Agent typeWhat to gradeReference benchmark
CodingRun the repo’s unit tests (deterministic) + an LLM rubric for code qualitySWE-bench Verified, Terminal-Bench
ConversationalState verification + LLM rubric for interaction quality; usually needs a simulated userτ²-bench
ResearchGroundedness + coverage + source-quality checksBrowseComp
Computer-useVerify outcomes via environment-state checksWebArena, OSWorld
What to grade by agent type (Anthropic, "Demystifying evals"). The reference benchmarks describe *models*; your own suite describes *your setup*.
ToolLane / licenseVersionModel / key feature
Inspect AIOSS, MIT (UK AISI + Meridian Labs)0.3.239 (PyPI 2026-06-09)Eval = Task = Dataset + Solver + Scorer; 200+ built-in evals
DeepEvalOSS, Apache-2.0 (Confident AI)4.0.6 (PyPI 2026-06-10)"Pytest for LLM apps": LLMTestCase, G-Eval, DAG + Tool Correctness/Task Completion; gate via deepeval test run
PromptfooOSS, MITcurrent lineDeclarative YAML assertions in promptfooconfig.yaml; 100+ red-team plugins
Braintrust / LangSmith / Langfuse / PhoenixHosted + observability (Langfuse/Phoenix OSS cores)Offline eval + production tracing
OpenAI Datasets + trace gradingHosted (SDK/API)Forward path — the hosted Evals UI is deprecating (see below)
HarborContainerized benchmark/RL infra (Terminal-Bench creators)Standardized task/grader format; runs trials at scale across sandbox providers
Where to run evals in 2026. For most teams, start in the open-source CI lane.

Run the minimal eval loop

Anthropic runs a concrete five-move loop to evaluate tool and prompt setups (from "Writing effective tools for agents"). It starts as a single script over 20–50 real failures — no platform required. Follow the steps, then run the script below.

The evaluation-driven loop (Anthropic, "Writing effective tools for agents")

  1. Generate tasks from real use

    Build tasks from genuine workflows, not toy sandboxes: "Prompts should be inspired by real-world uses and be based on realistic data sources." Strong tasks need multiple tool calls and each pairs with a verifiable outcome (exact-match or an LLM judge).

  2. Run programmatically with direct API calls

    Drive the eval with direct LLM API calls in a simple agentic loop — alternate a model call with tool execution, and have the agent emit reasoning before each tool call so the transcript is legible.

  3. Collect more than accuracy

    Log runtime per tool call and per task, total tool calls, total token consumption, and tool errors. A pass that took 40 tool calls is a different result than a pass that took 4.

  4. Read the raw transcripts

    Numbers hide behavior: "what agents omit in their feedback and responses can often be more important than what they include." This is also how you catch a grader scoring false passes or false fails.

  5. Close the loop with the agent

    Feed transcripts back to the agent — "let agents analyze your results and improve your tools for you." This dovetails with the evaluator-optimizer pattern: one model generates, a second evaluates and gives feedback in a loop, "when we have clear evaluation criteria, and when iterative refinement provides measurable value."

minimal_eval.py — a runnable loop you own (illustrative)
python
"""A 60-line eval harness: tasks + trials + an outcome grader + pass@k/pass^k.
No platform, no SaaS. Swap run_agent() for your own setup's entry point."""
import json
from statistics import mean
 
# 1. Tasks pulled from REAL failures, each with a verifiable outcome.
TASKS = [
    {
        "id": "fix-null-guard",
        "prompt": "Add a null guard to parseConfig() so an empty file "
                  "returns {} instead of throwing.",
        # Outcome grader: deterministic check on the FINAL state, not the path.
        "grade": lambda repo: repo.run("pytest tests/test_config.py").ok,
    },
    {
        "id": "rename-endpoint",
        "prompt": "Rename the /v1/user endpoint to /v1/users and update callers.",
        "grade": lambda repo: repo.run("pytest tests/test_api.py").ok
                              and repo.grep("/v1/user\b") == [],
    },
]
 
K = 5  # trials per task — never trust a single greedy run
 
def evaluate(task):
    """Run K trials, grade the OUTCOME of each, return per-task metrics."""
    results = []
    for trial in range(K):
        repo = fresh_sandbox()          # isolated, clean state every trial
        transcript = run_agent(task["prompt"], repo)   # your agent setup
        passed = task["grade"](repo)
        results.append({"passed": passed,
                        "tool_calls": transcript.tool_calls,
                        "tokens": transcript.tokens,
                        "errors": transcript.tool_errors})
        save_transcript(task["id"], trial, transcript)  # READ these later
    n_pass = sum(r["passed"] for r in results)
    return {
        "task": task["id"],
        "pass_at_k": int(n_pass >= 1),                 # capability
        "pass_pow_k": int(n_pass == K),                # reliability
        "avg_tool_calls": mean(r["tool_calls"] for r in results),
        "avg_tokens": mean(r["tokens"] for r in results),
    }
 
if __name__ == "__main__":
    report = [evaluate(t) for t in TASKS]
    print(json.dumps(report, indent=2))
    print("suite pass@k:",  mean(r["pass_at_k"]  for r in report))
    print("suite pass^k:",  mean(r["pass_pow_k"] for r in report))
Running the loop — capability passes, reliability does not
… scroll to run this session
Both tasks reach pass@k=1 (the agent can do each), but rename-endpoint has pass^k=0 — one of its 5 trials failed. The suite’s pass^k of 0.5 is the number that tells you it is not yet safe unattended; read the saved transcripts for rename-endpoint before shipping.

Anthropic’s 8-step roadmap, condensed for your own setup

  1. Start early

    20–50 simple tasks pulled from real failures, not hypotheticals. A handful of real tasks beats a polished platform with none.

  2. Convert manual testing into formal cases

    Every time you would hand-check something, write it down as a task. Manual checks that never become tasks are checks you will forget to repeat.

  3. Write unambiguous tasks

    Pair each with a reference solution clear enough that "two domain experts would independently reach the same pass/fail verdict."

  4. Build isolated environments

    Clean state between runs — no cross-trial contamination, or one trial’s side effects silently pass or fail the next.

  5. Design thoughtful graders

    Outcome over process, with partial credit on multi-step tasks so you keep the signal binary pass/fail throws away.

  6. Review transcripts regularly

    Verify the grader is fair — catch false fails and false passes by reading the actual runs, not just the scores.

  7. Monitor for saturation

    When a suite hits ~100% pass it has stopped discriminating; that is "too easy," not "done." Add harder, real-failure-derived tasks.

  8. Maintain the suite

    Treat it as living infrastructure and design tasks to be bypass-resistant so the agent cannot game the grader instead of solving the task.

Knowledge check

Your coding agent solves a refactor task on 4 of 5 trials. You want to know whether it is safe to let it run unattended in CI. Which number answers that, and what does it tell you?

Reach the end and this star joins your charted sky.