VisionCapability

VisionCapability(
    lmm_config: dict[str, Any],
    description_prompt: str | None = 'Write a detailed caption for this image. Pay special attention to any details that might be useful or relevant to the ongoing conversation.',
    custom_caption_func: Callable = None
)

We can add vision capability to regular ConversableAgent, even if the agent does not have the multimodal capability, such as GPT-3.5-turbo agent, Llama, Orca, or Mistral agents. This vision capability will invoke a LMM client to describe the image (captioning) before sending the information to the agent’s actual client.
The vision capability will hook to the ConversableAgent’s process_last_received_message.
Some technical details:
When the agent (who has the vision capability) received an message, it will:
1. _process_received_message:
a. _append_oai_message 2. generate_reply: if the agent is a MultimodalAgent, it will also use the image tag.
a. hook process_last_received_message (NOTE: this is where the vision capability will be hooked to.) b. hook process_all_messages_before_reply 3. send:
a. hook process_message_before_send b. _append_oai_message

Initializes a new instance, setting up the configuration for interacting with a Language Multimodal (LMM) client and specifying optional parameters for image description and captioning.

Parameters:
NameDescription
lmm_configType: dict[str, typing.Any]
description_promptType: str | None

Default: ‘Write a detailed caption for this image. Pay special attention to any details that might be useful or relevant to the ongoing conversation.‘
custom_caption_funcType: Callable

Default: None

Instance Methods

add_to_agent

add_to_agent(self, agent: ConversableAgent) -> None

Adds a particular capability to the given agent. Must be implemented by the capability subclass.
An implementation will typically call agent.register_hook() one or more times. See teachability.py as an example.

Parameters:
NameDescription
agentType: ConversableAgent

process_last_received_message

process_last_received_message(self, content: list[dict[str, Any]] | str) -> str

Processes the last received message content by normalizing and augmenting it with descriptions of any included images. The function supports input content as either a string or a list of dictionaries, where each dictionary represents a content item (e.g., text, image). If the content contains image URLs, it fetches the image data, generates a caption for each image, and inserts the caption into the augmented content.
The function aims to transform the content into a format compatible with GPT-4V multimodal inputs, specifically by formatting strings into PIL-compatible images if needed and appending text descriptions for images. This allows for a more accessible presentation of the content, especially in contexts where images cannot be displayed directly.

Parameters:
NameDescription
contentThe last received message content, which can be a plain text string or a list of dictionaries representing different types of content items (e.g., text, image_url).

Type: list[dict[str, typing.Any]] | str
Returns:
TypeDescription
strstr: The augmented message content Raises: AssertionError: If an item in the content list is not a dictionary. Examples: Assuming self._get_image_caption(img_data) returns “A beautiful sunset over the mountains” for the image. - Input as String: content = “Check out this cool photo!” Output: “Check out this cool photo!” (Content is a string without an image, remains unchanged.) - Input as String, with image location: content = “What’s weather in this cool photo: <img http://example.com/photo.jpg>” Output: “What’s weather in this cool photo: <img http://example.com/photo.jpg> in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image) - Input as List with Text Only: content = [\{"type": "text", "text": "Here's an interesting fact."}] Output: “Here’s an interesting fact.” (No images in the content, it remains unchanged.) - Input as List with Image URL: python content = [ \{"type": "text", "text": "What's weather in this cool photo:"}, \{"type": "image_url", "image_url": \{"url": "http://example.com/photo.jpg"}}, ] Output: “What’s weather in this cool photo: <img http://example.com/photo.jpg> in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image)