Variants
To answer "which is best?" — across models, prompts, tools, middleware, or whole builds — run the same suite under several named variants and rank them on a leaderboard.
run_variants#
A Variants is a single-axis set, built with a typed from_* constructor:
One axis at a time#
Each from_* constructor fixes one axis and types its values, so a single call only varies one thing — a controlled comparison (vary one axis, hold the rest fixed):
| Constructor | Varies | Value type |
|---|---|---|
Variants.from_configs(factory, {...}) | model / provider / params | ModelConfig |
Variants.from_prompts(factory, {...}) | system prompt | str |
Variants.from_tools(factory, {...}) | tool set | Sequence[Tool] |
Variants.from_middleware(factory, {...}) | middleware stack | Sequence[MiddlewareFactory] |
Variants.from_targets({...}) | the whole build | a factory or instance each |
Each variant runs via run_agent() (so repeats, persistence, everything applies) and is saved as its own <run_id>-<variant>.json. VariantRunResult.leaderboard(key) ranks variants by a scorer — pass-rate for boolean scorers, mean for numeric — best first; tied scores share a rank, and best(key) returns None when there's no unique winner. (A 3-way 100% tie usually means the eval isn't discriminating — make the task harder, or score quality with a judge — not that there's a true winner.)
Where to next#
- Pairwise — when "A vs B" is easier to judge than scoring each variant on its own.
- Persistence & tracking — every variant's run is saved, so you can diff it against past runs.