Skip to content

pairwise_judge

autogen.beta.eval.scorers.pairwise_judge.pairwise_judge #

pairwise_judge(config, *, criterion, key, include_trace=False, include_reference=True, retries=1, swap=True, middleware=())

Build an LLM pairwise comparator for one criterion.

PARAMETER DESCRIPTION
config

Judge model config (pin temperature 0; use a different model family than the variants to avoid self-preference bias).

TYPE: ModelConfig

criterion

The single standard to compare on, in plain English.

TYPE: str

key

Result column this comparator reports under.

TYPE: str

include_trace

Render each response's tool-call trajectory into the prompt.

TYPE: bool DEFAULT: False

include_reference

When True (default), render the reference answer into the prompt as a ## Reference section whenever reference_outputs is present. Set False for dimensions that must judge the responses on their own (e.g. faithfulness / grounding), so the golden answer cannot leak into the comparison.

TYPE: bool DEFAULT: True

retries

content() re-asks on schema-validation failure.

TYPE: int DEFAULT: 1

swap

Run the dual-order position-swap (default, recommended). When False, a single call is used (faster, position-biased).

TYPE: bool DEFAULT: True

middleware

Middleware for the judge agent (e.g. TelemetryMiddleware).

TYPE: Iterable[MiddlewareFactory] DEFAULT: ()

Source code in autogen/beta/eval/scorers/pairwise_judge.py
def pairwise_judge(
    config: ModelConfig,
    *,
    criterion: str,
    key: str,
    include_trace: bool = False,
    include_reference: bool = True,
    retries: int = 1,
    swap: bool = True,
    middleware: Iterable[MiddlewareFactory] = (),
) -> PairwiseComparator:
    """Build an LLM pairwise comparator for one criterion.

    Args:
        config: Judge model config (pin temperature 0; use a different model
            family than the variants to avoid self-preference bias).
        criterion: The single standard to compare on, in plain English.
        key: Result column this comparator reports under.
        include_trace: Render each response's tool-call trajectory into the prompt.
        include_reference: When ``True`` (default), render the reference answer
            into the prompt as a ``## Reference`` section whenever
            ``reference_outputs`` is present. Set ``False`` for dimensions that must
            judge the responses on their own (e.g. faithfulness / grounding), so the
            golden answer cannot leak into the comparison.
        retries: ``content()`` re-asks on schema-validation failure.
        swap: Run the dual-order position-swap (default, recommended). When
            ``False``, a single call is used (faster, position-biased).
        middleware: Middleware for the judge agent (e.g. ``TelemetryMiddleware``).
    """
    judge = Agent(
        f"pairwise_judge_{key}",
        _system_prompt(criterion),
        config=config,
        response_schema=PairwiseVerdict,
        middleware=middleware,
    )
    return _PairwiseJudge(
        judge, key, include_trace=include_trace, include_reference=include_reference, retries=retries, swap=swap
    )