RunDiff(current_run_id, baseline_run_id, comparable_tasks, pass_rate_deltas, mean_deltas, flipped_to_fail, flipped_to_pass, only_in_current, only_in_baseline, content_changed, scorers_only_in_current, scorers_only_in_baseline)
The result of comparing a run against a baseline over what they share.
pass_rate_deltas / mean_deltas map a scorer key to (baseline, current) over the comparable tasks. flipped_to_fail / flipped_to_pass are (scorer_key, task_id) pairs whose boolean verdict changed. The remaining tuples list everything excluded from the comparison.
current_run_id instance-attribute
baseline_run_id instance-attribute
comparable_tasks instance-attribute
pass_rate_deltas instance-attribute
mean_deltas instance-attribute
flipped_to_fail instance-attribute
flipped_to_pass instance-attribute
only_in_current instance-attribute
only_in_baseline instance-attribute
content_changed instance-attribute
scorers_only_in_current instance-attribute
scorers_only_in_baseline instance-attribute
regressions property
(scorer_key, task_id) pairs that flipped pass -> fail — the CI gate (assert not diff.regressions).
summary
A printable diff: per-scorer deltas, the flips, and everything excluded.
Source code in autogen/beta/eval/results/diff.py
| def summary(self) -> str:
"""A printable diff: per-scorer deltas, the flips, and everything excluded."""
lines = [
f"Diff {self.current_run_id} vs {self.baseline_run_id} — {len(self.comparable_tasks)} comparable task(s)"
]
for key in sorted(self.pass_rate_deltas):
base, cur = self.pass_rate_deltas[key]
mark = " REGRESSION" if cur < base else ""
lines.append(f" {key:<24} {base * 100:5.1f}% -> {cur * 100:5.1f}% {(cur - base) * 100:+5.1f}{mark}")
for key in sorted(self.mean_deltas):
base, cur = self.mean_deltas[key]
lines.append(f" {key:<24} mean {base:.2f} -> {cur:.2f} {cur - base:+.2f}")
if self.flipped_to_fail:
lines.append(f" flipped pass->fail: {[f'{k}:{t}' for k, t in self.flipped_to_fail]}")
if self.flipped_to_pass:
lines.append(f" flipped fail->pass: {[f'{k}:{t}' for k, t in self.flipped_to_pass]}")
excluded = _excluded_lines(self)
if excluded:
lines.append(" — excluded (not comparable) —")
lines.extend(excluded)
return "\n".join(lines)
|