Arena

The ag2 arena command lets you systematically compare agent implementations, models, or strategies using evaluation suites.

Basic Comparison#

Compare two agent files against an eval suite:

ag2 arena agent_v1.py agent_v2.py --eval tests/eval.yaml

This runs both agents against every test case and reports:

Compare the same agent across different LLM backends:

ag2 arena my_agent.py --models gpt-4o,claude-sonnet-4-20250514 --eval tests/eval.yaml

Run head-to-head comparisons where you vote on the better output:

ag2 arena agent_a.py agent_b.py --interactive

Run multiple agents against multiple benchmarks with a leaderboard:

ag2 arena agent_*.py --eval tests/ --tournament

Arena maintains ELO ratings across sessions in ~/.ag2/arena/leaderboard.json:

ag2 arena --leaderboard

Flag	Description
`--eval`	Path to eval YAML file or directory
`--models`	Comma-separated list of models to compare
`--interactive`	Interactive voting mode
`--tournament`	Tournament mode with leaderboard
`--leaderboard`	Show current ELO leaderboard
`--budget`	Maximum cost budget for the run
`--dry-run`	Estimate cost without running