Multimodal Inputs#
AG2 agents can process images, audio, video, and documents alongside text. The input event system provides a unified API across providers — you create inputs the same way regardless of which model you use.
Input Types#
| Factory Function | Creates | Description |
|---|---|---|
ImageInput(...) | Image input | JPEG, PNG, GIF, WebP |
AudioInput(...) | Audio input | WAV, MP3, OGG, FLAC, AAC |
VideoInput(...) | Video input | MP4, WebM, MOV, MKV, MPEG |
DocumentInput(...) | Document input | PDF, TXT, HTML, Markdown, CSV, JSON, Office formats |
Each factory function supports multiple ways to provide the data:
Using Inputs with Agents#
Pass inputs directly to agent.ask() as positional arguments alongside text:
You can pass multiple inputs in a single request:
Provider Support#
Not all providers support all input types. The table below shows what each provider accepts:
| Input Type | OpenAI | OpenAI Responses | Gemini | Anthropic |
|---|---|---|---|---|
| Text | Yes | Yes | Yes | Yes |
| Image (URL) | Yes | Yes | Yes | Yes |
| Image (binary) | Yes | Yes | Yes | Yes |
| Audio (URL) | - | - | Yes | - |
| Audio (binary) | Yes | - | Yes | - |
| Video (URL) | - | - | Yes | - |
| Video (binary) | - | - | Yes | - |
| Document (URL) | - | Yes | Yes | Yes |
| Document (binary) | - | - | Yes | Yes |
| File ID | - | Yes | - | Yes |
If you pass an unsupported input type to a provider, an UnsupportedInputError is raised with a clear message indicating what is not supported and by which provider.
Provider-Specific Details#
Gemini#
Gemini has the broadest multimodal support — it accepts images, audio, video, and documents in all forms (URL, binary, and local file path).
YouTube URLs are supported directly:
Google Files API — for large files (>20MB), upload via the Google Files API first and pass the returned URI:
Vendor Metadata#
Gemini supports provider-specific settings via vendor_metadata on binary inputs. These map to Gemini Part fields:
| Key | Type | Description |
|---|---|---|
media_resolution | str | Controls token allocation per image/video frame |
video_metadata | dict | Video clipping (start_offset, end_offset) and frame rate (fps) |
display_name | str | Display name for the file |
Media resolution — control quality vs cost tradeoff for images and video frames:
Available values: MEDIA_RESOLUTION_LOW, MEDIA_RESOLUTION_MEDIUM, MEDIA_RESOLUTION_HIGH, MEDIA_RESOLUTION_ULTRA_HIGH.
Video clipping and frame rate — process only a portion of a video or adjust the sampling rate:
Display name — attach a name to the file for reference:
OpenAI#
OpenAI supports images via both the Completions and Responses APIs. Audio binary input (WAV, MP3) is supported in the Completions API. The Responses API additionally supports file IDs and document URLs.
Vendor Metadata#
OpenAI supports vendor_metadata for image detail control:
Anthropic#
Anthropic supports images (JPEG, PNG, GIF, WebP) and documents (PDF) via URL, base64, or File ID. Audio and video are not supported.
File ID — upload files via the Anthropic Files API (beta) and reference by ID:
Vendor Metadata#
Anthropic supports vendor_metadata for prompt caching on content blocks: