Overview
Evaluation is how you measure whether your agent actually works. The framework runs your agent over a dataset of tasks, scores each run on multiple properties, and gives you aggregate metrics you can track over time.
One idea ties the whole framework together: a scorer grades the recorded trace of a run — the typed log of what the agent did. That's why the same scorers work whether you produce the trace by running the agent here, or grade a trace captured from production — and why runs persist as data you can compare over time.
Testing vs evaluation#
These sound similar but aren't.
- Testing asserts correctness. A unit test passes or fails. Either the function returned the right value or it didn't.
- Evaluation measures performance. Most agent outputs don't have a single right answer, so you grade them on many properties — did the agent call the right tool, did the final answer mention the right thing, did it stay under a token budget — and watch how the aggregate moves between releases.
The framework supports both. You can wrap eval metrics in pytest assertions ("pass rate must be ≥ 0.95"), but the underlying mechanism — scorers producing structured feedback that rolls up into aggregates — is built for measurement, not assertion.
What you'll write#
Every eval suite has four pieces. That's it.
- A dataset — a suite of tasks with inputs and (optional) expected outputs.
- An agent — an
Agentinstance, or a factory that builds a fresh one per task. - One or more scorers — functions that grade a run.
- A call to run_agent — the framework's entry point.
The rest of these docs cover each piece in depth. Read the quick-start below first.
Two ways in
run_agent runs your agent and grades it. If the traces already exist — captured from production, or produced by another tool — grade them directly with evaluate_traces, no agent run required. It reads both OpenTelemetry GenAI-semconv and OpenInference spans. Same scorers either way; see Runs.
Quick start#
The simplest plausible eval. Two tasks, one custom scorer, one prebuilt.
Output is a printed summary table plus a JSON file under ./runs/.
Run abc123def
Suite: inline (2 tasks, source: inline)
Runs: 2
Duration: 3120ms
Tokens: input=423 output=78 total=501
Pass rates:
called_get_weather 100.0% (2/2)
tool_called[get_weather] 100.0% (2/2)
That's the whole framework in 30 lines. Everything else in these docs is variations on this shape.
Tip
New to evals and this moved fast? The Get started tutorial builds the same thing up one concept at a time.
Determinism for CI#
You don't want your eval suite to depend on a live LLM in pre-merge CI — too slow, too expensive, too flaky. The framework uses TestConfig cassettes to mock the model deterministically:
Same suite, same scorers, no API key required. Output is identical on every machine.
What you get back#
Every run_agent() returns a RunResult and writes a JSON file to store_dir. The result exposes:
result.summary()— a printable text table (above)result.pass_rate("scorer_name")— float between 0 and 1 for boolean scorersresult.score_stats("scorer_name")—mean / p50 / p95 / nfor numeric scorersresult.value_counts("scorer_name")—{label: count}for categorical scorersresult.pass_rate("scorer_name", tag="hard")— any accessor takestag=to slice to one segment (result.tagslists them)result.aggregates— everything togetherresult.tasks— per-task records with fullTrace, feedback list, and budget statusresult.diff(load_run("runs/old.json"))— compare against a prior run;.regressionsgates CI (assert not diff.regressions)
The JSON file is the persistence format. It's schema-versioned ("0.1") and is what a future hosted dashboard will render. See Runs for the full shape.
Where to next#
- Get started — a paced, step-by-step first eval if this quick-start moved too fast.
- Scorers — how to write them, the six prebuilts (including
agent_judgeandfailure_attribution), return-shape rules, exception handling. - Runs —
Suite, therun_agent()signature,RunResultdeep dive,evaluate_tracesfor existing traces,repeats=, andstream=to observe a run live. - Variants & Pairwise — compare builds on a leaderboard, or head-to-head.
- Persistence & tracking — the run JSON, comparing runs for regressions, and tracking an agent across iterations.
What's in scope#
v0 ships offline evaluation (run a curated dataset with run_agent) and grading of existing traces (evaluate_traces over a directory or Grafana Tempo) — so you can already grade captured production traces with the same scorers. Still on the roadmap: continuous online evaluation and a hosted dashboard.