Get started
A hands-on walk-through: by the end you'll have run a real evaluation and understood every line. We build it up one concept at a time — no prior eval experience assumed.
What an evaluation is#
An evaluation is a repeatable test for an agent. You give it a suite of tasks, let your agent answer each one, then run scorers that grade the answers. The result is a scorecard you can track as your agent changes.
It's like unit testing, with one difference: most agent answers don't have a single right value, so instead of one pass/fail you grade several properties — did it call the right tool? is the answer correct? did it stay under budget? — and watch how the aggregate moves between versions.
Four pieces, that's all: a dataset, your agent, one or more scorers, and a call to run_agent.
Step 1 — your first run#
Start as small as possible: two questions, an agent, one check. This calls a real model, so set a key first (export OPENAI_API_KEY=...); Step 5 below shows how to run with none.
Run it and you get a scorecard:
Run a1b2c3d4
Suite: inline (2 tasks, source: inline)
Runs: 2
Pass rates:
final_answer_matches 100.0% (2/2)
That's a complete evaluation. The next steps unpack each piece.
Step 2 — read the result#
run_agent returns a RunResult and saves a JSON file under store_dir. A few accessors:
Every scorer becomes a column, looked up by its key (here "final_answer_matches"). Boolean scorers give a pass-rate; numeric ones give score_stats (mean / p50 / p95); categorical ones give value_counts.
Step 3 — the expected answer (reference_outputs)#
Notice each task carried a reference_outputs — the gold answer, the thing you grade against:
{"task_id": "france", "inputs": {"input": "Capital of France?"}, "reference_outputs": {"answer": "Paris"}}
It's a small labelled record, not a bare string, so a scorer can pick out the field it cares about: final_answer_matches(field="answer") reads reference_outputs["answer"] and compares it to the agent's answer. inputs is what goes in (the prompt lives under "input"); reference_outputs is what should come out. Tasks graded purely from the trace ("did it call the tool?") don't need one.
Matchers
matcher="contains" passes if the gold value appears anywhere in the answer — right for free-text replies like "The capital is Paris." Use "casefold" or "exact" when the answer should be exactly the value.
Step 4 — ask more questions (scorers)#
One check is rarely enough. A scorer is just a function that asks one question about a run. Mix the prebuilt ones with your own — a function decorated with @scorer that declares what it needs by name (trace, outputs, reference_outputs, …; here just outputs, the final answer):
Keep each scorer to a single, specific question — three small checks tell you what broke when one fails; one big check just says "something's wrong." See Scorers for the prebuilts (including the agent_judge LLM scorer) and the return-type rules.
Step 5 — run it with no API key (for CI)#
You don't want pre-merge CI calling a live model — slow, costly, flaky. Swap the model for a TestConfig cassette: a canned reply per task, so the run is deterministic and free.
Two things changed. The agent is now a factory (build) so the runner can hand each task its own config, and model_config supplies the canned replies keyed by task_id. (A single fixed config can stay a plain instance, as in Step 1; per-task configs need the factory.) Same suite, same scorers, identical output on every machine — wrap result.pass_rate(...) in a pytest assertion and you have a CI gate.
Where to next#
You've seen the whole loop. From here:
- Scorers — the six prebuilts, writing custom scorers, and
agent_judgefor grading subjective quality. - Runs — datasets and the
RunResultin depth, grading existing traces (evaluate_traces), and observing a run live. - Variants & Pairwise — compare builds on a leaderboard, or head-to-head.
- Persistence & tracking — save runs and compare them over time to catch regressions and confirm improvements.