Building low-latency voice agents in 3 lines of code with GPT Realtime 2

OpenAI recently announced advanced voice intelligence in the API, including GPT Realtime 2 for lower-latency, more natural spoken interactions. AG2 Beta wraps that class of model behind LiveAgent: one bidirectional session, continuous audio in and out, and provider-side voice activity detection so users can speak and interrupt like on a phone call—not like a walkie-talkie app.
In this post we walk through why that matters, how LiveAgent compares to the classic STT → Agent → TTS stack, and how to add tools and subagent-style delegation without giving up the realtime voice surface.
Why voice agents matter#
Voice is often the lowest-friction channel for hands-busy or eyes-busy tasks: driving, cooking, field work, accessibility, and quick “just tell me” moments on a phone. Users expect low latency, natural turn-taking, and the ability to cut in when the model goes wrong or when new information arrives. If every utterance waits for a full record → transcribe → reason → synthesize cycle, it feels like filling out a form, not a conversation.
Provider realtime APIs push much of that rhythm into the model and transport layer: streaming audio, built-in VAD (Voice Activity Detection), and barge-in semantics (where supported) so the application is not re-implementing half of a telephony stack in user space.
Basic code sample#
Once you've installed...
..., import LiveAgent, SoundDevicePlayer, SoundDeviceRecorder, and the openai config helpers from autogen.beta.live. The running session is then just three lines: construct the agent, open the session and shared I/O on one async with, then block until you cancel.
That is the whole “wire microphone and speaker to GPT Realtime 2” loop. A self-contained script (imports and asyncio.run) looks like this:
Beyond the defaults, you can tune how the assistant sounds and when it speaks: OpenAI realtime AudioOutput accepts voice and speed, and InputConfig carries VAD / turn-detection options (for example semantic VAD with interruption). See the Providers (OpenAI) documentation for voices and configs.
How it works with tools and subagents#
Tools#
LiveAgent uses the same @agent.tool decorator as a text Agent. Calls go through AG2’s normal tool executor; results are sent back into the realtime session automatically—so middleware and HITL patterns you use elsewhere can stay aligned with the rest of your stack.
More detail: Tools in a realtime session and Tools.
Subagents (delegation)#
LiveAgent is built around a realtime audio session, not the same ask() loop as a text Agent. When the voice surface needs deeper reasoning, longer context work, or tooling you prefer to keep on the text side, expose a separate Agent as a callable tool with Agent.as_tool() and pass it into LiveAgent via tools=[...]. The realtime model issues a tool call; AG2 runs that nested Agent and sends the result back into the live session.
That gives you subagent-style delegation without pretending the realtime session is a full Agent.ask loop. See LiveAgent vs Agent.
Supported providers#
LiveAgent accepts any RealtimeConfig. AG2 Beta ships OpenAI and Gemini implementations today.
- OpenAI —
gpt-realtime-2withAudioOutputforvoiceandspeed, orTextOutputfor text-only replies;InputConfigcontrols VAD and turn detection (defaults lean toward semantic VAD with interruption). - Gemini — e.g.
gemini-3.1-flash-live-previewwith live audio.
OpenAI and Gemini voices and config snippets: Providers.
Broader install and I/O notes: Voice & Realtime overview.
LiveAgent or STT–TTS#
AG2 Beta supports two voice patterns:
| STT → Agent → TTS | LiveAgent (realtime) | |
|---|---|---|
| Shape | Discrete turns: transcribe, call a text Agent, synthesize | One full-duplex session for the whole conversation |
| Typical latency | on the order of 1–3 s per turn | sub-500 ms in many setups |
| Turn detection | Your app decides when a “turn” ends | Provider-driven (e.g. semantic VAD, interruption) |
| Best when… | You need every Agent feature each turn (structured output, rich middleware, arbitrary model routing) | You need phone-call-like experience: continuous audio, interruption, minimal glue code |
The two are complementary, not competing. Reach for STT–TTS when the text agent must be the source of truth for each step. Reach for LiveAgent when latency and conversational flow matter most and you are happy to drive the session through a realtime model such as gpt-realtime-2.
What's next#
- Read the docs — start with LiveAgent — Realtime Voice Sessions and the Voice & Realtime overview.
- Install and try —
pip install "ag2[openai] sounddevice[numpy]", then run the sketches underexamples/live_playground/in the AG2 repo. - Star the repo — if
LiveAgentis useful, a star on github.com/ag2ai/ag2 helps others find the project. - Join the community — share builds, ask questions, and compare notes on the AG2 Discord.
We are excited to get your feedback on LiveAgent.