Skip to content

Variants

To answer "which is best?" — across models, prompts, tools, middleware, or whole builds — run the same suite under several named variants and rank them on a leaderboard.

run_variants#

A Variants is a single-axis set, built with a typed from_* constructor:

from autogen.beta.config import GeminiConfig, OpenAIConfig
from autogen.beta.eval import Variants, run_variants

board = await run_variants(
    suite,
    variants=Variants.from_configs(build_weather_agent, {
        "gpt-4o": OpenAIConfig("gpt-4o"),
        "flash":  GeminiConfig("gemini-3-flash-preview"),
    }),
    scorers=[...],
    store_dir="runs",
    repeats=5,                                # optional — N runs per variant for stability
)

print(board.summary("final_answer_matches"))  # ranked leaderboard
board.best("final_answer_matches")             # the winning variant's name
board.results["gpt-4o"]                        # each variant's full RunResult

One axis at a time#

Each from_* constructor fixes one axis and types its values, so a single call only varies one thing — a controlled comparison (vary one axis, hold the rest fixed):

Constructor Varies Value type
Variants.from_configs(factory, {...}) model / provider / params ModelConfig
Variants.from_prompts(factory, {...}) system prompt str
Variants.from_tools(factory, {...}) tool set Sequence[Tool]
Variants.from_middleware(factory, {...}) middleware stack Sequence[MiddlewareFactory]
Variants.from_targets({...}) the whole build a factory or instance each

Each variant runs via run_agent() (so repeats, persistence, everything applies) and is saved as its own <run_id>-<variant>.json. VariantRunResult.leaderboard(key) ranks variants by a scorer — pass-rate for boolean scorers, mean for numeric — best first; tied scores share a rank, and best(key) returns None when there's no unique winner. (A 3-way 100% tie usually means the eval isn't discriminating — make the task harder, or score quality with a judge — not that there's a true winner.)

Where to next#

  • Pairwise — when "A vs B" is easier to judge than scoring each variant on its own.
  • Persistence & tracking — every variant's run is saved, so you can diff it against past runs.