STT & TTS
The STT → Agent → TTS flow turns any existing Agent into a voice agent without changing the agent itself. Speech-to-text is added as a pipeline wrapper; text-to-speech is added as an observer that listens to the model's streamed message chunks.
Audio I/O primitives#
SoundDeviceRecorder captures microphone input and SoundDevicePlayer plays synthesized speech. Both are thin wrappers around the sounddevice library and share the same event stream.
The recorder produces a VoiceInput containing 16-bit PCM bytes plus the sample rate and channel count. The player subscribes to SynthesizedAudioEvent on its context's stream and plays each chunk on a background thread.
Note
Recorder.record(duration=...) is a one-shot, blocking helper for the turn-by-turn flow. For continuous streaming (used by LiveAgent), use the recorder as an async context manager — see LiveAgent.
Speech-to-Text#
OpenAITranscriber implements the STTConfig protocol and exposes a .pipe(agent) method that wraps an Agent in a VoicePipeline. Calling pipeline.ask(voice) transcribes the audio and forwards the text to the agent's normal ask() flow.
pipeline.ask(...) returns a VoiceReply that exposes the same surface as AgentReply (.body, .response, .history, .ask(...)) plus .ask(voice_input) for the next voice turn. The agent's history is preserved across turns.
Tip
The transcriber emits TranscriptionChunkEvent and TranscriptionCompletedEvent on the agent's stream as soon as the transcription server starts producing tokens. Subscribe to them to display live captions.
Translation#
If you want the user's speech transcribed into English regardless of input language, swap in OpenAITranslationTranscriber. It has the same API as OpenAITranscriber but uses OpenAI's translation endpoint.
Text-to-Speech#
TTSObserver is an observer that listens to ModelMessageChunk events as the agent streams its response, batches them into sentence-sized chunks, calls a TTS provider, and emits SynthesizedAudioEvents onto the stream. A SoundDevicePlayer attached to the same stream then plays them.
Warning
The agent's config must be set up for streaming output (e.g., streaming=True). TTSObserver works at the ModelMessageChunk granularity — if the model emits a single non-streaming ModelMessage, the observer will still synthesize it, but you lose the sentence-level pipelining that keeps latency low.
Voice and speed#
OpenAITTSConfig accepts the standard OpenAI TTS parameters:
Combining STT and TTS#
The full round-trip — voice in, voice out — is just both halves wired up at once: pipe the agent through the transcriber, attach a TTSObserver, and share a stream with the player.
Tip
player.join() blocks the main task until the synthesized audio queue has drained. Use it between turns when you want the assistant to finish speaking before the next recording starts — otherwise the recorder will capture the tail of the assistant's voice.