Multimodal Inputs#
AG2 agents can process images, audio, video, and documents alongside text. The input event system provides a unified API across providers — you create inputs the same way regardless of which model you use.
Input Types#
| Factory Function | Creates | Description |
|---|---|---|
ImageInput(...) | Image input | JPEG, PNG, GIF, WebP |
AudioInput(...) | Audio input | WAV, MP3, OGG, FLAC, AAC |
VideoInput(...) | Video input | MP4, WebM, MOV, MKV, MPEG |
DocumentInput(...) | Document input | PDF, TXT, HTML, Markdown, CSV, JSON, Office formats |
Each factory function supports multiple ways to provide the data:
Using Inputs with Agents#
Pass inputs directly to agent.ask() as positional arguments alongside text:
You can pass multiple inputs in a single request:
Provider Support#
Not all providers support all input types. The table below shows what each provider accepts:
| Input Type | OpenAI | OpenAI Responses | Gemini | Anthropic | xAI | Bedrock |
|---|---|---|---|---|---|---|
| Text | Yes | Yes | Yes | Yes | Yes | Yes |
| Image (URL) | Yes | Yes | Yes | Yes | Yes | - |
| Image (binary) | Yes | Yes | Yes | Yes | Yes | Yes |
| Audio (URL) | - | - | Yes | - | - | - |
| Audio (binary) | Yes | - | Yes | - | - | - |
| Video (URL) | - | - | Yes | - | - | - |
| Video (binary) | - | - | Yes | - | - | Yes |
| Document (URL) | - | Yes | Yes | Yes | Yes | - |
| Document (binary) | - | - | Yes | Yes | Yes | Yes |
| File ID | - | Yes | - | Yes | Yes | - |
If you pass an unsupported input type to a provider, an UnsupportedInputError is raised with a clear message indicating what is not supported and by which provider.
Provider-Specific Details#
Gemini#
Gemini has the broadest multimodal support — it accepts images, audio, video, and documents in all forms (URL, binary, and local file path).
YouTube URLs are supported directly:
Google Files API — for large files (>20MB), upload via the Google Files API first and pass the returned URI:
Vendor Metadata#
Gemini supports provider-specific settings via vendor_metadata on binary inputs. These map to Gemini Part fields:
| Key | Type | Description |
|---|---|---|
media_resolution | str | Controls token allocation per image/video frame |
video_metadata | dict | Video clipping (start_offset, end_offset) and frame rate (fps) |
display_name | str | Display name for the file |
Media resolution — control quality vs cost tradeoff for images and video frames:
Available values: MEDIA_RESOLUTION_LOW, MEDIA_RESOLUTION_MEDIUM, MEDIA_RESOLUTION_HIGH, MEDIA_RESOLUTION_ULTRA_HIGH.
Video clipping and frame rate — process only a portion of a video or adjust the sampling rate:
Display name — attach a name to the file for reference:
OpenAI#
OpenAI supports images via both the Completions and Responses APIs. Audio binary input (WAV, MP3) is supported in the Completions API. The Responses API additionally supports file IDs and document URLs.
Vendor Metadata#
OpenAI supports vendor_metadata for image detail control:
Anthropic#
Anthropic supports images (JPEG, PNG, GIF, WebP) and documents (PDF) via URL, base64, or File ID. Audio and video are not supported.
File ID — upload files via the Anthropic Files API (beta) and reference by ID:
Vendor Metadata#
Anthropic supports vendor_metadata for prompt caching on content blocks:
xAI#
xAI supports images (URL and binary), documents (URL and binary), and pre-uploaded file IDs. Audio and video are not currently supported — passing them raises UnsupportedInputError.
File ID — reference a file previously uploaded via the xAI Files API:
Vendor Metadata#
xAI reads detail for image quality control from two different attributes depending on the input source — vendor_metadata for binary, metadata for URL. Mixing them up means the value is silently ignored and xAI falls back to "auto".
Binary image — set detail via vendor_metadata:
URL image — set detail via metadata (not vendor_metadata):
Note
The factory ImageInput(url=...) does not forward metadata. To configure detail on a URL image, construct UrlInput directly as shown above.
Document filename — xAI requires a filename for binary documents. When sending raw bytes, either provide one via vendor_metadata={"filename": ...}, or rely on the auto-derived fallback (file.<subtype> from the media type, e.g. file.pdf for application/pdf):
Amazon Bedrock#
The Bedrock Converse API accepts binary sources only — images (JPEG, PNG, GIF, WebP), documents (PDF, CSV, DOC, DOCX, XLS, XLSX, HTML, TXT, Markdown), and video (MP4, WebM, MOV, MKV, and more; Amazon Nova models). URL inputs and file IDs raise UnsupportedInputError — Bedrock has no Files API, so source data from a URL must be downloaded and passed as bytes:
Note
Modality support also depends on the model behind the Converse API: Amazon Nova models accept images, documents, and video; many others (e.g. DeepSeek) are text-only and return a ValidationException from AWS for non-text blocks. The provider raises UnsupportedInputError only for inputs the Converse API itself cannot carry.
Document name — Converse requires a name for document blocks. It is taken from vendor_metadata={"filename": ...} (set automatically when using path=), sanitized to the characters Converse allows (alphanumerics, single spaces, hyphens, parentheses, brackets), and falls back to "document" when absent: