Variants
To answer "which is best?" — across models, prompts, tools, middleware, or whole builds — run the same suite under several named variants and rank them on a leaderboard.
run_variants#
A Variants is a mapping of name → Agent instance, plus an axis label that records what you varied:
One axis at a time#
Variants holds prebuilt agents, so you vary whatever you like by constructing them accordingly. For a controlled comparison, vary one axis across the agents and hold the rest fixed — then set axis= to label what changed (it shows up in summary(); it defaults to "variant"):
| Axis | Vary across the agents | axis= |
|---|---|---|
| model / provider / params | config= | "config" |
| system prompt | prompt= | "prompt" |
| tool set | tools= | "tools" |
| middleware stack | middleware= | "middleware" |
Need a per-task model override within a variant? Pass model_config= to run_variants exactly as with run_agent — it reaches each variant's run_agent call.
Each variant runs via run_agent() (so repeats, persistence, everything applies) and is saved as its own <run_id>-<variant>.json. VariantRunResult.leaderboard(key) ranks variants by a scorer — pass-rate for boolean scorers, mean for numeric — best first; tied scores share a rank, and best(key) returns None when there's no unique winner. (A 3-way 100% tie usually means the eval isn't discriminating — make the task harder, or score quality with a judge — not that there's a true winner.)
Where to next#
- Pairwise — when "A vs B" is easier to judge than scoring each variant on its own.
- Persistence & tracking — every variant's run is saved, so you can diff it against past runs.