Skip to content

run_pairwise

autogen.beta.eval.runtime.runner.run_pairwise async #

run_pairwise(suite, *, variant_a, variant_b, comparators, store_dir, model_config=None, variant_a_name='A', variant_b_name='B', concurrency=4, run_id=None, label=None, stream=None)

Produce traces for two variants over a suite, then compare them.

Convenience over :func:~autogen.beta.eval.evaluate_pairwise: runs each variant across the suite (capturing a :class:Trace per task, keyed by task_id), then pairwise-compares the two sets. Mirrors how :func:run_agent is produce-then-:func:~autogen.beta.eval.evaluate_traces for one variant. For decoupled grading of pre-existing traces, call evaluate_pairwise directly.

label is a shared identifier recorded on the result (like :func:run_agent); pass stream to observe PairwiseStarted / PairwiseCompared / PairwiseCompleted lifecycle events as the comparison runs.

Source code in autogen/beta/eval/runtime/runner.py
async def run_pairwise(
    suite: Suite | str | os.PathLike[str] | list[dict[str, Any]],
    *,
    variant_a: Agent | Callable[..., Agent],
    variant_b: Agent | Callable[..., Agent],
    comparators: Iterable[PairwiseComparator],
    store_dir: str | os.PathLike[str],
    model_config: ModelConfig | dict[str, ModelConfig] | None = None,
    variant_a_name: str = "A",
    variant_b_name: str = "B",
    concurrency: int = 4,
    run_id: str | None = None,
    label: str | None = None,
    stream: Stream | None = None,
) -> PairwiseRunResult:
    """Produce traces for two variants over a suite, then compare them.

    Convenience over :func:`~autogen.beta.eval.evaluate_pairwise`: runs each
    variant across the suite (capturing a :class:`Trace` per task,
    keyed by ``task_id``), then pairwise-compares the two sets. Mirrors how
    :func:`run_agent` is produce-then-:func:`~autogen.beta.eval.evaluate_traces` for one
    variant. For decoupled grading of pre-existing traces, call
    ``evaluate_pairwise`` directly.

    ``label`` is a shared identifier recorded on the result (like :func:`run_agent`);
    pass ``stream`` to observe ``PairwiseStarted`` / ``PairwiseCompared`` /
    ``PairwiseCompleted`` lifecycle events as the comparison runs.
    """
    resolved_suite = _resolve_suite(suite)
    factory_a, accepts_a, _ = _normalize_target(variant_a)
    factory_b, accepts_b, _ = _normalize_target(variant_b)
    source_a = await _produce(
        resolved_suite, factory_a, accepts_config=accepts_a, model_config=model_config, concurrency=concurrency
    )
    source_b = await _produce(
        resolved_suite, factory_b, accepts_config=accepts_b, model_config=model_config, concurrency=concurrency
    )
    return await evaluate_pairwise(
        source_a,
        source_b,
        comparators=comparators,
        variant_a=variant_a_name,
        variant_b=variant_b_name,
        suite=resolved_suite,
        store_dir=store_dir,
        concurrency=concurrency,
        run_id=run_id,
        label=label,
        stream=stream,
    )