Skip to content

final_answer_matches

autogen.beta.eval.scorers.correctness.final_answer_matches #

final_answer_matches(field='answer', *, matcher='casefold')

Pass iff the agent's final answer matches the task's reference output.

Reads reference_outputs[field] for the expected value and compares it against the agent's final answer: outputs["content"][field] when the agent returned a structured (JSON-object) answer carrying that field — e.g. via response_schema — otherwise outputs["body"], the final response text.

PARAMETER DESCRIPTION
field

Key to read from reference_outputs and (when available) from the agent's structured outputs. Defaults to "answer".

TYPE: str DEFAULT: 'answer'

matcher

How to compare the two strings:

  • "exact" — strict equality (good for closed-form answers in structured outputs).
  • "casefold" — case-insensitive equality (the default; forgives capitalization drift).
  • "contains" — pass iff expected appears as a substring of actual (good when the reply is a sentence and the reference is a single field value).

TYPE: _Matcher DEFAULT: 'casefold'

The scorer is reference-based: it returns False when no reference output is available for field rather than raising, so an unreferenced task simply scores as a fail. Wrap your own scorer if you'd rather skip such tasks.

Source code in autogen/beta/eval/scorers/correctness.py
def final_answer_matches(
    field: str = "answer",
    *,
    matcher: _Matcher = "casefold",
) -> Scorer:
    """Pass iff the agent's final answer matches the task's reference output.

    Reads ``reference_outputs[field]`` for the expected value and compares it
    against the agent's final answer: ``outputs["content"][field]`` when the agent
    returned a structured (JSON-object) answer carrying that field — e.g. via
    ``response_schema`` — otherwise ``outputs["body"]``, the final response text.

    Args:
        field: Key to read from ``reference_outputs`` and (when
            available) from the agent's structured ``outputs``.
            Defaults to ``"answer"``.
        matcher: How to compare the two strings:

            * ``"exact"`` — strict equality (good for closed-form
              answers in structured outputs).
            * ``"casefold"`` — case-insensitive equality (the default;
              forgives capitalization drift).
            * ``"contains"`` — pass iff ``expected`` appears as a
              substring of ``actual`` (good when the reply is a
              sentence and the reference is a single field value).

    The scorer is reference-based: it returns ``False`` when no
    reference output is available for ``field`` rather than raising,
    so an unreferenced task simply scores as a fail. Wrap your own
    scorer if you'd rather skip such tasks.
    """

    def _check(outputs: dict[str, Any], reference_outputs: dict[str, Any] | None) -> bool:
        if reference_outputs is None or field not in reference_outputs:
            return False
        expected = reference_outputs[field]
        content = outputs.get("content")
        actual = content[field] if isinstance(content, dict) and field in content else outputs.get("body")
        if actual is None:
            return False
        if matcher == "exact":
            return actual == expected
        if matcher == "casefold":
            return str(actual).casefold() == str(expected).casefold()
        return str(expected) in str(actual)

    return Scorer(_check, key="final_answer_matches")