Skip to content

Runs

This page covers the core of the offline eval pipeline: building a Suite of tasks, calling run_agent(), reading the RunResult, and grading traces that already exist. (Saving runs and comparing them over time has its own page — Persistence & tracking.)

The trace is the foundation#

Grading is a pure function of one thing: a Trace — the typed record of what the agent did on a task (model responses, tool calls, tool results, human input) plus token usage, duration, and any exception. Each of those steps is emitted as an OpenTelemetry span while the agent runs, and the framework reconstructs the Trace from those spans. Scorers only ever read the Trace.

It's the same reconstruction no matter where the spans came from — a run_agent call here, a folder of saved traces, or live production telemetry in Grafana Tempo. That's exactly what lets the framework's two halves — producing a trace and grading one — share one code path and the same scorers.

Three concrete consequences:

  • Multi-turn just works. A reply.ask(...) continuation re-enters the agent loop, so its spans land in the same trace. Scorers see the whole conversation, not just the first turn.
  • Offline vs online is just where the trace came from. Offline = run a curated dataset with run_agent. Online = grade traces captured from production with evaluate_traces (e.g. from Grafana Tempo). Same scorers, different source.
  • No special primitives for trajectory scoring. Assembly-policy correctness, compaction faithfulness, sub-task tree quality — all event-pattern questions, answered by the events already in the trace.

You don't manage any of this: run_agent attaches a telemetry middleware, collects the spans the agent emits, and reconstructs the Trace for you. But the frame matters when you write a custom scorer — you're always reading a Trace, never a live stream.

Datasets — the Suite#

A Suite is an immutable collection of Task records. Build one from a JSONL file or inline.

1
2
3
from autogen.beta.eval import Suite

suite = Suite.from_jsonl("eval/dataset.jsonl")

Each line in the file is a JSON object:

{"task_id": "weather-001", "inputs": {"input": "What's the weather in Tokyo?"}, "reference_outputs": {"city": "Tokyo"}, "tags": ["happy-path"]}
{"task_id": "weather-002", "inputs": {"input": "Weather in Paris?"}, "reference_outputs": {"city": "Paris"}}

Fields:

Field Required What it is
task_id optional Stable identifier. If absent, auto-filled as task-0000, task-0001, ….
inputs required Dict containing at least "input" — the prompt passed to agent.ask(...).
reference_outputs optional Expected output for reference-based scorers.
tags optional List of labels. Slice results by them — result.pass_rate(key, tag="happy-path").
metadata optional Free-form dict. Surfaces in the persisted run JSON.

Tip

JSONL is the canonical format because it's grep-friendly, line-diffable, and HuggingFace-compatible. Blank lines are skipped; malformed lines raise with the line number for fast diagnosis.

From a list (for quick experimentation)#

1
2
3
4
5
6
from autogen.beta.eval import Suite

suite = Suite.from_list([
    {"inputs": {"input": "What's the weather in Tokyo?"}, "reference_outputs": {"city": "Tokyo"}},
    {"inputs": {"input": "Weather in Paris?"}, "reference_outputs": {"city": "Paris"}},
])

Same shape as JSONL lines. Use this in notebooks or for ad-hoc suites that don't deserve a file yet.

The agent — an instance or a factory#

run_agent(agent=...) accepts either a built agent instance or a factory that builds one:

  • An instance is the simplest path — agent=my_agent. The runner reuses it across tasks, but each task gets a fresh stream, so conversation history never leaks. Fine for any stateless agent (i.e. most agents).
  • A factory (a callable returning a fresh agent) is what you want when you need a clean instance per task — for per-task model_config injection (cassette-based CI), or for an agent that carries mutable instance state (e.g. a knowledge bootstrap). The runner calls it once per task.
from autogen.beta import Agent, tool
from autogen.beta.config import GeminiConfig, ModelConfig

def build_weather_agent(*, config: ModelConfig | None = None) -> Agent:
    return Agent(
        "weather",
        prompt="You are a weather assistant. Use get_weather.",
        config=config or GeminiConfig(model="gemini-3-flash-preview"),
        tools=[get_weather],
    )

The config= keyword is what the runner uses to inject either a live ModelConfig (when you pass model_config= to run_agent()) or a TestConfig cassette for deterministic CI. A factory without a config= parameter still works — the runner warns and falls through to the factory's default.

Multi-agent flows go through evaluate_traces

run_agent is for an ask-shaped target — an Agent (or a factory of one). Multi-agent / network / workflow flows aren't driven by a single ask(prompt), so they don't go through run_agent. Run them however they run — they emit OpenTelemetry spans — and grade the reconstructed trace with evaluate_traces. Grading is decoupled from production via the Trace, so the same scorers work either way.

Running#

from autogen.beta.eval import BudgetThresholds, run_agent

result = await run_agent(
    suite,                                 # Suite, str/Path (JSONL), or list[dict]
    agent=build_weather_agent,
    scorers=[...],                         # list of Scorer instances
    store_dir=Path("./runs"),              # required — where the JSON lands
    model_config=cassettes,                # optional — None / ModelConfig / dict[task_id, ModelConfig]
    budgets=BudgetThresholds(              # optional — observational, doesn't abort
        max_tokens_per_task=2_000,
        max_seconds_per_task=15.0,
    ),
    concurrency=4,                         # parallel task cap
    repeats=1,                             # run each task N times (consistency); pools the pass-rate
    run_id="2026-05-11-weather-suite",     # optional — overrides the UUID4 default
    label="weather-eval",                  # optional — groups runs of the same eval over time
    stream=my_stream,                      # optional — observe lifecycle events (see "Observing a run")
)

model_config modes#

The same parameter is overloaded three ways:

Value Behavior
None (default) Factory's default config wins.
A single ModelConfig Same config for every task.
A dict[task_id, ModelConfig] Per-task config. Standard pattern for cassette-based CI.

Repeats — consistency#

repeats=N runs each task N times — the simplest way to ask "does my agent do this consistently?". The per-key pass_rate / score_stats pool across every run (so 8 of 10 passing shows as 80%), and each run gets a distinct task_id suffix ("weather-001#1", "weather-001#2", …).

result = await run_agent(suite, agent=build_weather_agent, scorers=[...], store_dir="runs", repeats=10)

(At temperature=0 consistency is near-trivial; repeats earns its keep when there's real nondeterminism.)

Labels#

label="weather-eval" stamps a user-defined identifier on the run, recorded at the top of the run JSON. Unlike run_id (unique per run), a label is meant to be shared across runs of the same eval — so a sequence of runs can be grouped and trended over time. The framework never fills it in.

Concurrency#

Tasks run in parallel up to concurrency, bounded by an asyncio.Semaphore. Default is 4. Raise it for I/O-bound suites against fast models; lower it when you're rate-limited.

Budgets#

BudgetThresholds records violations on each task's budget_violation flag but never aborts a task that goes over. The aggregate count surfaces in result.aggregates.budget_violations — useful as a CI regression signal ("zero tasks may exceed budget").

Warning

Budgets are observational in v0. If you need a hard kill switch, use observers (TokenMonitor with AlertPolicy(severity=FATAL)) — the agent halts via a HaltEvent at runtime.

What you get back — RunResult#

result = await run_agent(...)

# Run-level metadata
result.run_id              # str (UUID4 hex unless you set it) — unique per run
result.label               # str | None — user-defined; groups runs of the same eval over time
result.schema_version      # "0.1"
result.created_at          # ISO-8601 UTC
result.duration_ms         # int
result.suite               # the Suite that was executed

# Per-task records
result.tasks               # tuple[TaskResult, ...]
result.tasks[0].task       # Task
result.tasks[0].trace      # Trace
result.tasks[0].feedback   # tuple[Feedback, ...]
result.tasks[0].budget_violation  # bool

# Aggregates
result.pass_rate("scorer_name")    # float — boolean scorers
result.score_stats("scorer_name")  # ScoreStats(mean, p50, p95, n) — numeric scorers
result.value_counts("scorer_name") # dict[label, count] — categorical scorers
result.pass_rate("scorer_name", tag="hard")  # any accessor takes tag= to slice to one segment
result.tags                        # frozenset[str] — the tags present, for slicing
result.aggregates                  # Aggregates — everything together
result.aggregates.tokens           # TokenUsage(input, output, cache_creation, cache_read)
result.diff(baseline)              # RunDiff vs a prior run — see the Persistence & tracking page
result.aggregates.errors           # int — tasks where the agent raised
result.aggregates.budget_violations  # int

# Human-readable
result.summary()           # printable multi-line table
result.save(path=None)     # re-save the run JSON (path defaults to store_dir/<run_id>.json)

summary()#

Returns a multi-line string suitable for a CI log:

Run 25be826dc1a94a4b9d50a4f94449139e
  Schema:      0.1
  Created:     2026-05-11T01:38:04.157919+00:00
  Duration:    5292ms
  Suite:       dataset (5 tasks, source: eval/dataset.jsonl)
  Runs:        5
  Concurrency: 4
  Errors:      0
  Budget violations: 0
  Tokens:      input=1544 output=174 total=1718

Pass rates:
  called_get_weather_once   100.0% (5/5)
  final_answer_matches      100.0% (5/5)
  no_tool_errors            100.0% (5/5)
  token_budget              100.0% (5/5)
  tool_called[get_weather]  100.0% (5/5)

Score stats:
  extra_tool_calls  mean=0.00 p50=0.00 p95=0.00 n=5

Value counts:
  termination_reason  completed=5

Grading existing traces — evaluate_traces#

run_agent is the produce-and-grade path: it runs your agent, then grades the trace. evaluate_traces is the grade-only path — for traces that already exist, captured elsewhere. The grading is identical; only the source differs.

1
2
3
4
5
6
7
8
from autogen.beta.eval import DirectoryTraceSource, evaluate_traces

result = await evaluate_traces(
    DirectoryTraceSource("./captured-traces"),   # a folder of saved traces
    scorers=[...],
    suite=suite,            # optional — only needed for reference-based scorers (joined by task_id)
    store_dir="runs",
)

A TraceSource is anything that yields traces. Three ship:

Source Reads from
InMemoryTraceSource traces you already hold in memory
DirectoryTraceSource a folder of saved trace files (save_trace writes them)
TempoTraceSource Grafana Tempo over OTLP — grade real production telemetry

Because grading depends only on the reconstructed Trace, the same scorers work whether the trace came from run_agent, a directory, or production.

Reconstruction understands two span dialects, auto-detected per trace, so traces from a range of tools and frameworks grade unchanged:

Observing a run#

Just like agent.ask(stream=...), run_agent, run_variants, and run_pairwise all accept a stream. Pass one and the runner publishes eval lifecycle events to it as the run unfolds — so you observe an evaluation with the same machinery you use on an agent (subscribe, where, the watch system, persistent backends). Nothing is printed for you; you attach the observer and render however you like.

from autogen.beta.stream import MemoryStream
from autogen.beta.eval import run_variants
from autogen.beta.eval.events import VariantCompleted

stream = MemoryStream()

async def on_variant(event: VariantCompleted) -> None:
    print(f"{event.variant}: {event.result.pass_rate('final_answer_matches'):.0%}")

stream.where(VariantCompleted).subscribe(on_variant)

board = await run_variants(suite, variants=variants, scorers=[...], store_dir="runs", stream=stream)

The events live in autogen.beta.eval.events and are all transient — observational only, since the durable record of a run is its persisted JSON:

Event Emitted by When
EvalStarted run_agent a run begins
TaskEvaluated run_agent each task finishes (carries its feedback)
EvalCompleted run_agent a run finishes (carries the RunResult)
VariantStarted / VariantCompleted run_variants around each variant (carries the variant's RunResult)
PairwiseStarted / PairwiseCompared / PairwiseCompleted run_pairwise around the run and each head-to-head comparison

Ready-made console output

For a quick console view without writing a callback, subscribe the built-in console_reporter (from autogen.beta.eval): stream.subscribe(console_reporter). It's just an opt-in observer — no run_agent() flag — so swap in your own callback whenever you want different output.

Exceptions don't abort the run#

The framework catches exceptions at three levels and records them on the task instead of aborting:

Level What happens
Target factory raises Task gets trace.exception = <the error>, no events. Other tasks continue.
agent.ask() raises Same — exception lands on trace.exception, other tasks continue.
A scorer raises The scorer's feedback becomes Feedback(score=None, comment="scorer raised: ..."). Other scorers run normally.

The aggregate count of agent-level errors surfaces in result.aggregates.errors. Treat it as a CI signal — a healthy suite has zero.

Where to next#

  • Scorers — what goes into scorers=[...].
  • Variants — run several builds over the suite and rank them on a leaderboard.
  • Pairwise — judge two builds head-to-head.
  • Persistence & tracking — compare runs over time, catch regressions, and track an agent across iterations.
  • TestingTestConfig cassettes for deterministic CI runs.
  • Observers — runtime safety guards (token budgets that actually halt, loop detection).