Skip to content

Multimodal Inputs#

AG2 agents can process images, audio, video, and documents alongside text. The input event system provides a unified API across providers — you create inputs the same way regardless of which model you use.

Input Types#

Factory Function Creates Description
ImageInput(...) Image input JPEG, PNG, GIF, WebP
AudioInput(...) Audio input WAV, MP3, OGG, FLAC, AAC
VideoInput(...) Video input MP4, WebM, MOV, MKV, MPEG
DocumentInput(...) Document input PDF, TXT, HTML, Markdown, CSV, JSON, Office formats

Each factory function supports multiple ways to provide the data:

from autogen.beta.events import ImageInput, AudioInput, VideoInput, DocumentInput

# From a URL
image = ImageInput("https://example.com/photo.jpg")

# From a local file path
image = ImageInput(path="photo.jpg")

# From raw bytes
image = ImageInput(data=raw_bytes, media_type="image/png")

# From a pre-uploaded file ID (provider-specific)
image = ImageInput(file_id="file-abc123")

Using Inputs with Agents#

Pass inputs directly to agent.ask() as positional arguments alongside text:

from autogen.beta import Agent
from autogen.beta.config import GeminiConfig
from autogen.beta.events import ImageInput

agent = Agent(
    "vision_agent",
    "You are a helpful assistant that describes images.",
    config=GeminiConfig(model="gemini-3-flash-preview"),
)

image = ImageInput("https://example.com/photo.jpg")
reply = await agent.ask("Describe this image in detail.", image)
print(reply.body)

You can pass multiple inputs in a single request:

1
2
3
4
image1 = ImageInput("https://example.com/before.jpg")
image2 = ImageInput("https://example.com/after.jpg")

reply = await agent.ask("Compare these two images.", image1, image2)

Provider Support#

Not all providers support all input types. The table below shows what each provider accepts:

Input Type OpenAI OpenAI Responses Gemini Anthropic
Text Yes Yes Yes Yes
Image (URL) Yes Yes Yes Yes
Image (binary) Yes Yes Yes Yes
Audio (URL) - - Yes -
Audio (binary) Yes - Yes -
Video (URL) - - Yes -
Video (binary) - - Yes -
Document (URL) - Yes Yes Yes
Document (binary) - - Yes Yes
File ID - Yes - Yes

If you pass an unsupported input type to a provider, an UnsupportedInputError is raised with a clear message indicating what is not supported and by which provider.


Provider-Specific Details#

Gemini#

Gemini has the broadest multimodal support — it accepts images, audio, video, and documents in all forms (URL, binary, and local file path).

YouTube URLs are supported directly:

1
2
3
4
from autogen.beta.events import VideoInput

video = VideoInput("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
reply = await agent.ask("Summarize this video.", video)

Google Files API — for large files (>20MB), upload via the Google Files API first and pass the returned URI:

from google import genai
from autogen.beta.events import VideoInput

client = genai.Client()
uploaded = client.files.upload(file="large_video.mp4")

# Wait for processing to complete
import time
while uploaded.state.name == "PROCESSING":
    time.sleep(2)
    uploaded = client.files.get(name=uploaded.name)

video = VideoInput(uploaded.uri)
reply = await agent.ask("Describe this video.", video)

Vendor Metadata#

Gemini supports provider-specific settings via vendor_metadata on binary inputs. These map to Gemini Part fields:

Key Type Description
media_resolution str Controls token allocation per image/video frame
video_metadata dict Video clipping (start_offset, end_offset) and frame rate (fps)
display_name str Display name for the file

Media resolution — control quality vs cost tradeoff for images and video frames:

1
2
3
4
5
6
7
8
from autogen.beta.events import ImageInput

# Lower resolution = fewer tokens = lower cost
image = ImageInput(
    data=raw_bytes,
    media_type="image/jpeg",
    vendor_metadata={"media_resolution": "MEDIA_RESOLUTION_LOW"},
)

Available values: MEDIA_RESOLUTION_LOW, MEDIA_RESOLUTION_MEDIUM, MEDIA_RESOLUTION_HIGH, MEDIA_RESOLUTION_ULTRA_HIGH.

Video clipping and frame rate — process only a portion of a video or adjust the sampling rate:

from autogen.beta.events import VideoInput

video = VideoInput(
    path="lecture.mp4",
    vendor_metadata={
        "video_metadata": {
            "start_offset": "60s",
            "end_offset": "120s",
            "fps": 0.5,
        },
    },
)
reply = await agent.ask("Summarize this section of the video.", video)

Display name — attach a name to the file for reference:

1
2
3
4
5
6
from autogen.beta.events import DocumentInput

doc = DocumentInput(
    path="report.pdf",
    vendor_metadata={"display_name": "Q4 Financial Report"},
)

OpenAI#

OpenAI supports images via both the Completions and Responses APIs. Audio binary input (WAV, MP3) is supported in the Completions API. The Responses API additionally supports file IDs and document URLs.

Vendor Metadata#

OpenAI supports vendor_metadata for image detail control:

1
2
3
4
5
6
7
from autogen.beta.events import ImageInput

image = ImageInput(
    data=raw_bytes,
    media_type="image/png",
    vendor_metadata={"detail": "low"},  # "low", "high", or "auto"
)

Anthropic#

Anthropic supports images (JPEG, PNG, GIF, WebP) and documents (PDF) via URL, base64, or File ID. Audio and video are not supported.

File ID — upload files via the Anthropic Files API (beta) and reference by ID:

import anthropic
from autogen.beta.events import ImageInput, DocumentInput

client = anthropic.Anthropic()

# Upload an image
uploaded = client.beta.files.upload(
    file=("photo.jpg", open("photo.jpg", "rb"), "image/jpeg"),
)

# Reference by file_id — filename determines block type (image vs document)
image = ImageInput(file_id=uploaded.id, filename="photo.jpg")
reply = await agent.ask("Describe this image.", image)

Vendor Metadata#

Anthropic supports vendor_metadata for prompt caching on content blocks:

1
2
3
4
5
6
from autogen.beta.events import DocumentInput

doc = DocumentInput(
    path="report.pdf",
    vendor_metadata={"cache_control": {"type": "ephemeral"}},
)