Runs
This page covers the core of the offline eval pipeline: building a Suite of tasks, calling run_agent(), reading the RunResult, and grading traces that already exist. (Saving runs and comparing them over time has its own page — Persistence & tracking.)
The trace is the foundation#
Grading is a pure function of one thing: a Trace — the typed record of what the agent did on a task (model responses, tool calls, tool results, human input) plus token usage, duration, and any exception. Each of those steps is emitted as an OpenTelemetry span while the agent runs, and the framework reconstructs the Trace from those spans. Scorers only ever read the Trace.
It's the same reconstruction no matter where the spans came from — a run_agent call here, a folder of saved traces, or live production telemetry in Grafana Tempo. That's exactly what lets the framework's two halves — producing a trace and grading one — share one code path and the same scorers.
Three concrete consequences:
- Multi-turn just works. A
reply.ask(...)continuation re-enters the agent loop, so its spans land in the same trace. Scorers see the whole conversation, not just the first turn. - Offline vs online is just where the trace came from. Offline = run a curated dataset with
run_agent. Online = grade traces captured from production withevaluate_traces(e.g. from Grafana Tempo). Same scorers, different source. - No special primitives for trajectory scoring. Assembly-policy correctness, compaction faithfulness, sub-task tree quality — all event-pattern questions, answered by the events already in the trace.
You don't manage any of this: run_agent attaches a telemetry middleware, collects the spans the agent emits, and reconstructs the Trace for you. But the frame matters when you write a custom scorer — you're always reading a Trace, never a live stream.
Datasets — the Suite#
A Suite is an immutable collection of Task records. Build one from a JSONL file or inline.
From JSONL (recommended)#
Each line in the file is a JSON object:
{"task_id": "weather-001", "inputs": {"input": "What's the weather in Tokyo?"}, "reference_outputs": {"city": "Tokyo"}, "tags": ["happy-path"]}
{"task_id": "weather-002", "inputs": {"input": "Weather in Paris?"}, "reference_outputs": {"city": "Paris"}}
Fields:
| Field | Required | What it is |
|---|---|---|
task_id | optional | Stable identifier. If absent, auto-filled as task-0000, task-0001, …. |
inputs | required | Dict containing at least "input" — the prompt passed to agent.ask(...). |
reference_outputs | optional | Expected output for reference-based scorers. |
tags | optional | List of labels. Slice results by them — result.pass_rate(key, tag="happy-path"). |
metadata | optional | Free-form dict. Surfaces in the persisted run JSON. |
Tip
JSONL is the canonical format because it's grep-friendly, line-diffable, and HuggingFace-compatible. Blank lines are skipped; malformed lines raise with the line number for fast diagnosis.
From a list (for quick experimentation)#
Same shape as JSONL lines. Use this in notebooks or for ad-hoc suites that don't deserve a file yet.
The agent — an instance or a factory#
run_agent(agent=...) accepts either a built agent instance or a factory that builds one:
- An instance is the simplest path —
agent=my_agent. The runner reuses it across tasks, but each task gets a fresh stream, so conversation history never leaks. Fine for any stateless agent (i.e. most agents). - A factory (a callable returning a fresh agent) is what you want when you need a clean instance per task — for per-task
model_configinjection (cassette-based CI), or for an agent that carries mutable instance state (e.g. a knowledge bootstrap). The runner calls it once per task.
The config= keyword is what the runner uses to inject either a live ModelConfig (when you pass model_config= to run_agent()) or a TestConfig cassette for deterministic CI. A factory without a config= parameter still works — the runner warns and falls through to the factory's default.
Multi-agent flows go through evaluate_traces
run_agent is for an ask-shaped target — an Agent (or a factory of one). Multi-agent / network / workflow flows aren't driven by a single ask(prompt), so they don't go through run_agent. Run them however they run — they emit OpenTelemetry spans — and grade the reconstructed trace with evaluate_traces. Grading is decoupled from production via the Trace, so the same scorers work either way.
Running#
model_config modes#
The same parameter is overloaded three ways:
| Value | Behavior |
|---|---|
None (default) | Factory's default config wins. |
A single ModelConfig | Same config for every task. |
A dict[task_id, ModelConfig] | Per-task config. Standard pattern for cassette-based CI. |
Repeats — consistency#
repeats=N runs each task N times — the simplest way to ask "does my agent do this consistently?". The per-key pass_rate / score_stats pool across every run (so 8 of 10 passing shows as 80%), and each run gets a distinct task_id suffix ("weather-001#1", "weather-001#2", …).
result = await run_agent(suite, agent=build_weather_agent, scorers=[...], store_dir="runs", repeats=10)
(At temperature=0 consistency is near-trivial; repeats earns its keep when there's real nondeterminism.)
Labels#
label="weather-eval" stamps a user-defined identifier on the run, recorded at the top of the run JSON. Unlike run_id (unique per run), a label is meant to be shared across runs of the same eval — so a sequence of runs can be grouped and trended over time. The framework never fills it in.
Concurrency#
Tasks run in parallel up to concurrency, bounded by an asyncio.Semaphore. Default is 4. Raise it for I/O-bound suites against fast models; lower it when you're rate-limited.
Budgets#
BudgetThresholds records violations on each task's budget_violation flag but never aborts a task that goes over. The aggregate count surfaces in result.aggregates.budget_violations — useful as a CI regression signal ("zero tasks may exceed budget").
Warning
Budgets are observational in v0. If you need a hard kill switch, use observers (TokenMonitor with AlertPolicy(severity=FATAL)) — the agent halts via a HaltEvent at runtime.
What you get back — RunResult#
summary()#
Returns a multi-line string suitable for a CI log:
Run 25be826dc1a94a4b9d50a4f94449139e
Schema: 0.1
Created: 2026-05-11T01:38:04.157919+00:00
Duration: 5292ms
Suite: dataset (5 tasks, source: eval/dataset.jsonl)
Runs: 5
Concurrency: 4
Errors: 0
Budget violations: 0
Tokens: input=1544 output=174 total=1718
Pass rates:
called_get_weather_once 100.0% (5/5)
final_answer_matches 100.0% (5/5)
no_tool_errors 100.0% (5/5)
token_budget 100.0% (5/5)
tool_called[get_weather] 100.0% (5/5)
Score stats:
extra_tool_calls mean=0.00 p50=0.00 p95=0.00 n=5
Value counts:
termination_reason completed=5
Grading existing traces — evaluate_traces#
run_agent is the produce-and-grade path: it runs your agent, then grades the trace. evaluate_traces is the grade-only path — for traces that already exist, captured elsewhere. The grading is identical; only the source differs.
A TraceSource is anything that yields traces. Three ship:
| Source | Reads from |
|---|---|
InMemoryTraceSource | traces you already hold in memory |
DirectoryTraceSource | a folder of saved trace files (save_trace writes them) |
TempoTraceSource | Grafana Tempo over OTLP — grade real production telemetry |
Because grading depends only on the reconstructed Trace, the same scorers work whether the trace came from run_agent, a directory, or production.
Reconstruction understands two span dialects, auto-detected per trace, so traces from a range of tools and frameworks grade unchanged:
- the OpenTelemetry GenAI semantic conventions —
gen_ai.*spans, what AG2's ownTelemetryMiddlewareemits, and - OpenInference —
openinference.span.kind+llm.*/tool.*, emitted by the Arize/Phoenix instrumentors.
Observing a run#
Just like agent.ask(stream=...), run_agent, run_variants, and run_pairwise all accept a stream. Pass one and the runner publishes eval lifecycle events to it as the run unfolds — so you observe an evaluation with the same machinery you use on an agent (subscribe, where, the watch system, persistent backends). Nothing is printed for you; you attach the observer and render however you like.
The events live in autogen.beta.eval.events and are all transient — observational only, since the durable record of a run is its persisted JSON:
| Event | Emitted by | When |
|---|---|---|
EvalStarted | run_agent | a run begins |
TaskEvaluated | run_agent | each task finishes (carries its feedback) |
EvalCompleted | run_agent | a run finishes (carries the RunResult) |
VariantStarted / VariantCompleted | run_variants | around each variant (carries the variant's RunResult) |
PairwiseStarted / PairwiseCompared / PairwiseCompleted | run_pairwise | around the run and each head-to-head comparison |
Ready-made console output
For a quick console view without writing a callback, subscribe the built-in console_reporter (from autogen.beta.eval): stream.subscribe(console_reporter). It's just an opt-in observer — no run_agent() flag — so swap in your own callback whenever you want different output.
Exceptions don't abort the run#
The framework catches exceptions at three levels and records them on the task instead of aborting:
| Level | What happens |
|---|---|
| Target factory raises | Task gets trace.exception = <the error>, no events. Other tasks continue. |
agent.ask() raises | Same — exception lands on trace.exception, other tasks continue. |
| A scorer raises | The scorer's feedback becomes Feedback(score=None, comment="scorer raised: ..."). Other scorers run normally. |
The aggregate count of agent-level errors surfaces in result.aggregates.errors. Treat it as a CI signal — a healthy suite has zero.
Where to next#
- Scorers — what goes into
scorers=[...]. - Variants — run several builds over the suite and rank them on a leaderboard.
- Pairwise — judge two builds head-to-head.
- Persistence & tracking — compare runs over time, catch regressions, and track an agent across iterations.
- Testing —
TestConfigcassettes for deterministic CI runs. - Observers — runtime safety guards (token budgets that actually halt, loop detection).