Persistence
Every run is saved to disk. That's what turns "run an eval once" into "track an agent over its lifetime" — comparing each version against its past selves to confirm improvements and catch regressions before they ship.
The schema-0.1 JSON#
Every run_agent() writes a JSON file to store_dir/<run_id>.json before returning. The file is the run's permanent record — comparable across days, releases, and pull requests. Shape:
{
"schema_version": "0.1",
"run_id": "01J7Z3M4QF8X9Y0K1V2N3P4Q5R",
"label": "weather-eval",
"created_at": "2026-05-11T14:23:00+00:00",
"duration_ms": 5292,
"suite": { "name": "dataset", "size": 5, "source": "eval/dataset.jsonl" },
"target": "weather_agent.agent:build_weather_agent",
"concurrency": 4,
"tasks": [
{
"task_id": "weather-001",
"inputs": { "input": "What's the weather in Tokyo?" },
"reference_outputs": { "city": "Tokyo" },
"tags": ["happy-path"],
"metadata": {},
"duration_ms": 1234,
"events": [
{ "type": "ToolCallEvent", "name": "get_weather", "arguments": "..." }
],
"exception": null,
"tokens": { "input": 423, "output": 78, "cache_creation": 0, "cache_read": 0 },
"feedback": [
{ "key": "tool_called[get_weather]", "score": true, "value": null, "comment": null }
],
"budget_violation": false
}
],
"aggregates": {
"pass_rate": { "tool_called[get_weather]": 1.0 },
"score_stats": { "extra_tool_calls": { "mean": 0.0, "p50": 0.0, "p95": 0.0, "n": 5 } },
"value_counts": { "termination_reason": { "completed": 5 } },
"tokens": { "input": 2115, "output": 390, "total": 2505 },
"errors": 0,
"budget_violations": 0
}
}
The schema is forward-compatible: future versions add fields at the end of objects; existing fields don't change name or type. Don't rely on this shape for in-tree code — use the RunResult Python API. The JSON is for cross-process / cross-time persistence: a future dashboard, a CI artifact, a diff between two releases.
Comparing two runs#
Persistence pays off when you compare two runs: "did my change make the agent better or worse?". Load a past run with load_run(path) and diff the current one against it.
diff joins the two runs by task_id and scorer key and reports per-scorer pass-rate / mean deltas plus the tasks that flipped pass↔fail. RunDiff.regressions is the (scorer, task_id) pairs that went pass → fail — the CI gate:
Runs must be comparable
By default (strict=True) diff raises RunsNotComparableError when the two runs didn't grade the same tasks with the same checks — a task or scorer present in only one run, or a task whose inputs / reference_outputs changed under the same id (content drift). The error itemizes every mismatch and tells you to pass strict=False to diff the overlap instead. With strict=False, only the genuinely comparable (task, scorer) pairs are diffed and everything excluded is reported on the RunDiff (content_changed, only_in_current / only_in_baseline, scorers_only_in_*) — so you are never silently shown an apples-to-oranges number.
Note
load_run reconstructs each run's scores and task identity — enough for diff and the pass_rate / score_stats / value_counts accessors. It does not replay event traces, so a loaded run's aggregates.tokens reads zero; read the JSON directly if you need event-level detail.
Tracking an agent across iterations#
Persistence turns a one-off eval into a record of an agent's progress. There's no separate "project" concept to set up — you get one from three habits:
- One
store_dirper agent — the folder is that agent's history. - A
run_idthat encodes the version —run_id="blog-writer-v7"saves asblog-writer-v7.json. - A stable suite — keep the eval set fixed across iterations so the runs stay comparable.
Then every iteration follows the same loop: run → compare to the last → act on the diff.
Iteration 1 — establish a baseline#
You now have runs/blog-writer/v1.json. Read the scorecard and note what's weak — say has_call_to_action sits at 40%.
Iteration 2 — change something, then confirm you moved the needle#
You tweak the prompt to always end with a call to action, and re-run against the same suite with a new run_id:
The diff confirms the fix landed and nothing else moved. Ship it.
Iteration 3 — catch a regression before it ships#
Later you swap to a cheaper model and re-run:
The cheaper model dropped the topic on one task. Now you can decide deliberately: accept −20% for the cost saving, or adjust (sharpen the prompt, or keep the old model for that case). Either way you knew before merging — and in CI, assert not diff.regressions would have blocked it for you.
The folder now holds v1.json, v2.json, v3.json — a complete, comparable history. Diff any pair to see how the agent moved between two points in its life, or hand the JSONs to a dashboard later.