Scorers
A scorer is a function that grades one agent run and produces a structured feedback record. Run it across N tasks and you get N feedback records — which the framework aggregates into pass rates, score distributions, and label counts.
Writing a scorer#
Decorate a function with @scorer. The framework calls it once per task; the return value becomes feedback.
That's it. Three things to notice:
- It takes only what it needs. The decorator inspects the signature and injects only the parameters you declared.
- It returns a
bool. The framework turns that into a pass rate. - It's pure. No I/O, no global state. Scorers run concurrently across tasks; impure scorers race.
What you can ask for#
A scorer can declare any subset of these five parameters, by name:
| Parameter | Type | What it is |
|---|---|---|
inputs | dict[str, Any] | The task's input payload — typically {"input": "the user's prompt"}. |
outputs | dict[str, Any] | The agent's final answer projected from the trace, mirroring the reply API: {"body": <final text>, "content": <typed answer>}. body is the text (like reply.body); content is the parsed value when the answer is JSON — e.g. a response_schema agent — otherwise the text (like await reply.content()). Read structured fields via outputs["content"]["answer"]. |
reference_outputs | dict[str, Any] \| None | The task's expected output, if the dataset provided one. |
trace | Trace | The typed events (model responses, tool calls / results, …) plus tokens, duration, and exception. |
task | Task | The task record (id, tags, metadata). |
Declare only what you use:
Three return shapes → three aggregation behaviors#
The framework routes by return type:
boollands inresult.pass_rate("scorer_name")aspasses / total.int/floatlands inresult.score_stats("scorer_name")asScoreStats(mean, p50, p95, n).strlands inresult.value_counts("scorer_name")as{label: count}. Useful for slicing — "of 100 runs, 95 completed, 5 errored".Noneis treated as "skip — no feedback recorded for this task".- A
Feedbackinstance orlist[Feedback]lets you set the key explicitly, attach a comment, or emit multiple records from one call.
Note
bool is a subclass of int in Python, so the framework checks isinstance(value, bool) first — True always becomes a pass-rate feedback, never a numeric one. On the flip side, returning 1 / 0 (an int) is a numeric score and lands in score_stats, not pass_rate — return True / False for pass/fail.
Reference-based vs reference-free#
A load-bearing distinction:
- Reference-based scorers compare what happened to what should have happened. They need
reference_outputsfrom a labelled dataset — so they can't grade arbitrary production traffic that has no gold answer. - Reference-free scorers judge from the trace alone — so the same code grades
run_agenttraces, stored traces, and live production traces (viaevaluate_traces).
Write scorers reference-free whenever you can — the same code then runs everywhere.
One question per scorer#
Don't bundle. A god-scorer like
…tells you nothing when it fails. Three scorers, one each, give you three signals you can trace.
Prebuilt scorers#
Six ship under autogen.beta.eval.scorers. Four are simple, deterministic checks (below); two richer ones — agent_judge and failure_attribution — get their own sections after.
| Scorer | Question | Type | When to use |
|---|---|---|---|
tool_called(name, *, exactly=None) | Did the agent call this tool? | bool | Most tool-use scenarios. exactly=N for strict count. |
no_tool_errors() | Were there zero ToolErrorEvents? | bool | Catch tools that exploded. |
final_answer_matches(field, matcher) | Does the answer match reference_outputs[field]? | bool | Closed-form correctness. Matcher: "exact", "casefold", "contains". |
token_budget(max_tokens) | Did the run stay under max_tokens total? | bool | Cost discipline as a pass/fail signal. |
Each is a factory — calling it returns a Scorer. Drop them straight into the scorers= list:
Tip
Two distinct tool_called(...) calls produce distinct keys (tool_called[get_weather] vs tool_called[get_news]), so multiple instances coexist in one run.
Agent-as-a-judge — agent_judge#
Some properties can't be checked with ==: is the answer helpful? well-reasoned? on-brand? agent_judge hands the answer plus a criterion you write to a judge model, which returns a numeric score. This is a more capable take on a LLM-as-a-judge evaluation.
One judge = one criterion = one column. The score lands in result.score_stats(key), so a list of judges is a multi-dimension scorecard — each criterion scored and aggregated independently. The numeric range defaults to (0.0, 1.0) and is enforced (out-of-range scores are clamped); pass scale=(1, 5) for a Likert range. The judge is an ordinary Agent, so it can be made deterministic using a TestConfig in CI.
Failure attribution — failure_attribution#
When a task fails, why? failure_attribution labels each run with a failure mode (a str, so it rolls up in result.value_counts(key)):
Out of the box it runs deterministic detectors (a crash/exception, no final answer, a tool error) — no model needed. Pass a config to add an LLM attributor that classifies the subtler, semantic failures the detectors can't see (wrong answer, gave up, looped):
Exception handling#
A scorer that raises does NOT fail the run. The framework catches the exception, records a Feedback with score=None and a comment explaining what blew up, logs a warning, and moves on:
If the trace had no final answer (outputs has no "body") this raises KeyError. The feedback for that task becomes:
This is by design — one broken scorer should never kill an eval over 100 tasks. But it also means you should treat score=None as a signal worth investigating, not noise.
Tips for good scorers#
- Be specific. "Did the agent call
get_weather?" beats "Was the output good?" - Be cheap. Scorers run for every task. No API calls inside a scorer.
- Prefer structured fields over text.
trace.events_of(ToolCallEvent, name="get_weather")beats regex-matchingoutputs["body"]. - Don't share state. Two scorers grade independently. No mutable globals.
- Return
Nonewhen you can't grade. A scorer that doesn't apply to a task should skip rather than fail.
Sync vs async#
Either works. The framework detects async with inspect.iscoroutinefunction and awaits accordingly:
Most scorers consume an in-memory Trace and stay sync; async is there for scorers that call out — agent_judge, for instance, runs its judge model under the hood.
Custom keys with Scorer directly#
If you want a key independent of the function name, construct a Scorer yourself:
The prebuilts use this pattern internally — tool_called("get_weather") returns a Scorer with key="tool_called[get_weather]".
Where to next#
- Runs — the
SuiteAPI, therun_agent()signature,RunResultaggregation methods, and the persistence format.