Arena
The ag2 arena command lets you systematically compare agent implementations, models, or strategies using evaluation suites.
Basic Comparison#
Compare two agent files against an eval suite:
This runs both agents against every test case and reports:
- Pass rate per agent
- Quality scores (from
llm_judgeassertions) - Cost comparison
- Latency comparison
- Statistical significance of differences
Model Comparison#
Compare the same agent across different LLM backends:
Interactive Mode#
Run head-to-head comparisons where you vote on the better output:
Tournament Mode#
Run multiple agents against multiple benchmarks with a leaderboard:
ELO Leaderboard#
Arena maintains ELO ratings across sessions in ~/.ag2/arena/leaderboard.json:
Options#
| Flag | Description |
|---|---|
--eval | Path to eval YAML file or directory |
--models | Comma-separated list of models to compare |
--interactive | Interactive voting mode |
--tournament | Tournament mode with leaderboard |
--leaderboard | Show current ELO leaderboard |
--budget | Maximum cost budget for the run |
--dry-run | Estimate cost without running |