Pairwise
Sometimes "which answer is better?" is easier and more reliable than scoring each answer in isolation — especially for subjective quality. run_pairwise runs two variants over the same suite and asks a comparator to pick a winner per task, then reports B's win-rate with a confidence interval.
run_pairwise#
PairwiseRunResult reports B's win-rate, ties, position flips, and — when you supply both a model and a human comparator — their agreement (Cohen's κ).
Comparators#
A comparator decides the winner of one pair. Two kinds ship — an LLM judge, and a human:
pairwise_judge(config, criterion=, key=)— an LLM. It judges each pair in both orders (A-then-B and B-then-A) and only counts a win when the verdict is consistent, which cancels position bias.human_pairwise(...)/human_labels(...)— a person decides. Covered next.
Human comparison#
A human is often the ground truth for "which is better." There are two ways to collect those judgements.
Inline — judge as the run goes#
human_pairwise prompts a reviewer for each task. The default prompt prints the question and both answers blinded — Response 1 / Response 2, in random order so there's no position bias — and reads 1 / 2 / tie:
Task: What's the capital of France?
[1] The capital of France is Paris.
[2] Paris.
Which is better? 1 / 2 / tie: 2
Pass your own ask callback to collect the choice from a UI or notebook instead of the terminal. It receives (task, response_1, response_2) (still blinded) and returns "1", "2", or "tie":
At scale — blinded offline labeling#
Past a handful of cases — or with several labelers — you don't sit at a terminal. Export a blinded manifest, have people label it in any tool, then import the results. This is the workflow for a real human-eval pass.
Each manifest line is blinded — the labeler sees the two answers but not which model produced which:
{"case_id": "task-1::more helpful", "task_id": "task-1", "criterion": "more helpful",
"task_input": "What's the capital of France?",
"response_1": "Paris.", "response_2": "The capital of France is Paris.", "first_variant": "b"}
first_variant is the de-blinding key — it records which model is Response 1. Keep it out of the labeler's view; human_labels uses it to map their "1" / "2" back to the right variant. Export several criteria at once and add one human_labels(criterion=…, key=…) comparator per criterion.
Grading existing pairs — evaluate_pairwise#
run_pairwise is to evaluate_pairwise what run_agent is to evaluate_traces: the grade-only version. Given two trace sources you already have, it pairs them by task_id and runs the comparators — no agent invocation. The offline-labeling flow above is one use; it works just as well with pairwise_judge to re-grade captured champion/challenger traces with an LLM.