RunDiff

autogen.beta.eval.results.diff.RunDiff `dataclass` #

RunDiff(current_run_id, baseline_run_id, comparable_tasks, pass_rate_deltas, mean_deltas, flipped_to_fail, flipped_to_pass, only_in_current, only_in_baseline, content_changed, scorers_only_in_current, scorers_only_in_baseline)

The result of comparing a run against a baseline over what they share.

pass_rate_deltas / mean_deltas map a scorer key to (baseline, current) over the comparable tasks. flipped_to_fail / flipped_to_pass are (scorer_key, task_id) pairs whose boolean verdict changed. The remaining tuples list everything excluded from the comparison.

current_run_id `instance-attribute` #

current_run_id

baseline_run_id `instance-attribute` #

baseline_run_id

comparable_tasks `instance-attribute` #

comparable_tasks

pass_rate_deltas `instance-attribute` #

pass_rate_deltas

mean_deltas `instance-attribute` #

mean_deltas

flipped_to_fail `instance-attribute` #

flipped_to_fail

flipped_to_pass `instance-attribute` #

flipped_to_pass

only_in_current `instance-attribute` #

only_in_current

only_in_baseline `instance-attribute` #

only_in_baseline

content_changed `instance-attribute` #

content_changed

scorers_only_in_current `instance-attribute` #

scorers_only_in_current

scorers_only_in_baseline `instance-attribute` #

scorers_only_in_baseline

regressions `property` #

regressions

(scorer_key, task_id) pairs that flipped pass -> fail — the CI gate (assert not diff.regressions).

summary #

summary()

A printable diff: per-scorer deltas, the flips, and everything excluded.

Source code in autogen/beta/eval/results/diff.py

def summary(self) -> str:
    """A printable diff: per-scorer deltas, the flips, and everything excluded."""
    lines = [
        f"Diff {self.current_run_id} vs {self.baseline_run_id}  —  {len(self.comparable_tasks)} comparable task(s)"
    ]
    for key in sorted(self.pass_rate_deltas):
        base, cur = self.pass_rate_deltas[key]
        mark = "   REGRESSION" if cur < base else ""
        lines.append(f"  {key:<24} {base * 100:5.1f}% -> {cur * 100:5.1f}%   {(cur - base) * 100:+5.1f}{mark}")
    for key in sorted(self.mean_deltas):
        base, cur = self.mean_deltas[key]
        lines.append(f"  {key:<24} mean {base:.2f} -> {cur:.2f}   {cur - base:+.2f}")
    if self.flipped_to_fail:
        lines.append(f"  flipped pass->fail: {[f'{k}:{t}' for k, t in self.flipped_to_fail]}")
    if self.flipped_to_pass:
        lines.append(f"  flipped fail->pass: {[f'{k}:{t}' for k, t in self.flipped_to_pass]}")
    excluded = _excluded_lines(self)
    if excluded:
        lines.append("  — excluded (not comparable) —")
        lines.extend(excluded)
    return "\n".join(lines)

RunDiff

autogen.beta.eval.results.diff.RunDiff dataclass #

current_run_id instance-attribute #

baseline_run_id instance-attribute #

comparable_tasks instance-attribute #

pass_rate_deltas instance-attribute #

mean_deltas instance-attribute #

flipped_to_fail instance-attribute #

flipped_to_pass instance-attribute #

only_in_current instance-attribute #

only_in_baseline instance-attribute #

content_changed instance-attribute #

scorers_only_in_current instance-attribute #

scorers_only_in_baseline instance-attribute #

regressions property #