autogen.agentchat.contrib.capabilities.vision_capability.VisionCapability
VisionCapability
We can add vision capability to regular ConversableAgent, even if the agent does not have the multimodal capability,
such as GPT-3.5-turbo agent, Llama, Orca, or Mistral agents. This vision capability will invoke a LMM client to describe
the image (captioning) before sending the information to the agent’s actual client.
The vision capability will hook to the ConversableAgent’s process_last_received_message
.
Some technical details:
When the agent (who has the vision capability) received an message, it will:
1. _process_received_message:
a. _append_oai_message
2. generate_reply: if the agent is a MultimodalAgent, it will also use the image tag.
a. hook process_last_received_message (NOTE: this is where the vision capability will be hooked to.)
b. hook process_all_messages_before_reply
3. send:
a. hook process_message_before_send
b. _append_oai_message
Initializes a new instance, setting up the configuration for interacting with
a Language Multimodal (LMM) client and specifying optional parameters for image
description and captioning.
Name | Description |
---|---|
lmm_config | Type: dict[str, typing.Any] |
description_prompt | Type: str | None Default: ‘Write a detailed caption for this image. Pay special attention to any details that might be useful or relevant to the ongoing conversation.‘ |
custom_caption_func | Type: Callable Default: None |
Instance Methods
add_to_agent
Adds a particular capability to the given agent. Must be implemented by the capability subclass.
An implementation will typically call agent.register_hook() one or more times. See teachability.py as an example.
Name | Description |
---|---|
agent | Type: ConversableAgent |
process_last_received_message
Processes the last received message content by normalizing and augmenting it
with descriptions of any included images. The function supports input content
as either a string or a list of dictionaries, where each dictionary represents
a content item (e.g., text, image). If the content contains image URLs, it
fetches the image data, generates a caption for each image, and inserts the
caption into the augmented content.
The function aims to transform the content into a format compatible with GPT-4V
multimodal inputs, specifically by formatting strings into PIL-compatible
images if needed and appending text descriptions for images. This allows for
a more accessible presentation of the content, especially in contexts where
images cannot be displayed directly.
Name | Description |
---|---|
content | The last received message content, which can be a plain text string or a list of dictionaries representing different types of content items (e.g., text, image_url). Type: list[dict[str, typing.Any]] | str |
Type | Description |
---|---|
str | str: The augmented message content Raises: AssertionError: If an item in the content list is not a dictionary. Examples: Assuming self._get_image_caption(img_data) returns “A beautiful sunset over the mountains” for the image. - Input as String: content = “Check out this cool photo!” Output: “Check out this cool photo!” (Content is a string without an image, remains unchanged.) - Input as String, with image location: content = “What’s weather in this cool photo: <img http://example.com/photo.jpg> ” Output: “What’s weather in this cool photo: <img http://example.com/photo.jpg> in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image) - Input as List with Text Only: content = [\{"type": "text", "text": "Here's an interesting fact."}] Output: “Here’s an interesting fact.” (No images in the content, it remains unchanged.) - Input as List with Image URL: python content = [ \{"type": "text", "text": "What's weather in this cool photo:"}, \{"type": "image_url", "image_url": \{"url": "http://example.com/photo.jpg"}}, ] Output: “What’s weather in this cool photo: <img http://example.com/photo.jpg> in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image) |