autogen.agentchat.contrib.capabilities.vision_capability.VisionCapability

API Reference

autogen
- Overview
- AFTER_WORK
- AfterWork
- AfterWorkOption
- Agent
- AgentNameConflictError
- AssistantAgent
- Cache
- ChatCompletion
- ChatResult
- Completion
- ConversableAgent
- GroupChat
- GroupChatManager
- InvalidCarryOverTypeError
- ModelClient
- NoEligibleSpeakerError
- ON_CONDITION
- OnCondition
- OpenAIWrapper
- ReasoningAgent
- SenderRequiredError
- SwarmAgent
- SwarmResult
- ThinkNode
- UPDATE_SYSTEM_MESSAGE
- UndefinedNextAgentError
- UpdateSystemMessage
- UserProxyAgent
- a_initiate_swarm_chat
- config_list_from_dotenv
- config_list_from_json
- config_list_from_models
- config_list_gpt4_gpt35
- config_list_openai_aoai
- filter_config
- gather_usage_summary
- get_config_list
- initiate_chats
- initiate_swarm_chat
- register_function
- register_hand_off
- visualize_tree
- agentchat
  - Overview
  - AFTER_WORK
  - ContextStr
  - ON_CONDITION
  - SwarmAgent
  - SwarmResult
  - UPDATE_SYSTEM_MESSAGE
  - UpdateSystemMessage
  - a_initiate_chats
  - register_hand_off
  - chat
  - contrib
    - agent_eval
    - agent_optimizer
    - capabilities
      - agent_capability
      - generate_images
      - teachability
      - text_compressors
      - tools_capability
      - transform_messages
      - transforms
      - transforms_util
      - vision_capability
        Overview
        VisionCapability
        get_image_data
        get_pil_image
        gpt4v_formatter
    - captainagent
    - gpt_assistant_agent
    - graph_rag
    - img_utils
    - llamaindex_conversable_agent
    - llava_agent
    - math_user_proxy_agent
    - multimodal_conversable_agent
    - qdrant_retrieve_user_proxy_agent
    - rag
    - reasoning_agent
    - retrieve_assistant_agent
    - retrieve_user_proxy_agent
    - society_of_mind_agent
    - swarm_agent
    - text_analyzer_agent
    - vectordb
    - web_surfer
  - realtime
  - utils
- agents
- browser_utils
- cache
- code_utils
- coding
- doc_utils
- exception_utils
- formatting_utils
- graph_utils
- import_utils
- interop
- io
- logger
- math_utils
- messages
- oai
- retrieve_utils
- runtime_logging
- token_count_utils
- tools
- types

VisionCapability

VisionCapability(
    lmm_config: dict[str, Any],
    description_prompt: str | None = 'Write a detailed caption for this image. Pay special attention to any details that might be useful or relevant to the ongoing conversation.',
    custom_caption_func: Callable = None
)

We can add vision capability to regular ConversableAgent, even if the agent does not have the multimodal capability, such as GPT-3.5-turbo agent, Llama, Orca, or Mistral agents. This vision capability will invoke a LMM client to describe the image (captioning) before sending the information to the agent’s actual client.
The vision capability will hook to the ConversableAgent’s process_last_received_message.
Some technical details:
When the agent (who has the vision capability) received an message, it will:
1. _process_received_message:
a. _append_oai_message 2. generate_reply: if the agent is a MultimodalAgent, it will also use the image tag.
a. hook process_last_received_message (NOTE: this is where the vision capability will be hooked to.) b. hook process_all_messages_before_reply 3. send:
a. hook process_message_before_send b. _append_oai_message

Initializes a new instance, setting up the configuration for interacting with a Language Multimodal (LMM) client and specifying optional parameters for image description and captioning.

Parameters:

Name	Description
`lmm_config`	Type: dict[str, typing.Any]
`description_prompt`	Type: str \| None Default: ‘Write a detailed caption for this image. Pay special attention to any details that might be useful or relevant to the ongoing conversation.‘
`custom_caption_func`	Type: Callable Default: None

Instance Methods

add_to_agent

add_to_agent(self, agent: ConversableAgent) -> None

Adds a particular capability to the given agent. Must be implemented by the capability subclass.
An implementation will typically call agent.register_hook() one or more times. See teachability.py as an example.

Parameters:

Name	Description
`agent`	Type: ConversableAgent

process_last_received_message

process_last_received_message(self, content: list[dict[str, Any]] | str) -> str

Processes the last received message content by normalizing and augmenting it with descriptions of any included images. The function supports input content as either a string or a list of dictionaries, where each dictionary represents a content item (e.g., text, image). If the content contains image URLs, it fetches the image data, generates a caption for each image, and inserts the caption into the augmented content.
The function aims to transform the content into a format compatible with GPT-4V multimodal inputs, specifically by formatting strings into PIL-compatible images if needed and appending text descriptions for images. This allows for a more accessible presentation of the content, especially in contexts where images cannot be displayed directly.

Parameters:

Name	Description
`content`	The last received message content, which can be a plain text string or a list of dictionaries representing different types of content items (e.g., text, image_url). Type: list[dict[str, typing.Any]] \| str

Returns:

Type Description

str str: The augmented message content Raises: AssertionError: If an item in the content list is not a dictionary. Examples: Assuming self._get_image_caption(img_data) returns “A beautiful sunset over the mountains” for the image. - Input as String: content = “Check out this cool photo!” Output: “Check out this cool photo!” (Content is a string without an image, remains unchanged.) - Input as String, with image location: content = “What’s weather in this cool photo: <img http://example.com/photo.jpg>” Output: “What’s weather in this cool photo: <img http://example.com/photo.jpg> in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image) - Input as List with Text Only: content = [\{"type": "text", "text": "Here's an interesting fact."}] Output: “Here’s an interesting fact.” (No images in the content, it remains unchanged.) - Input as List with Image URL: python content = [ \{"type": "text", "text": "What's weather in this cool photo:"}, \{"type": "image_url", "image_url": \{"url": "http://example.com/photo.jpg"}}, ] Output: “What’s weather in this cool photo: <img http://example.com/photo.jpg> in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image)

Type	Description
str	str: The augmented message content Raises: AssertionError: If an item in the content list is not a dictionary. Examples: Assuming `self._get_image_caption(img_data)` returns “A beautiful sunset over the mountains” for the image. - Input as String: content = “Check out this cool photo!” Output: “Check out this cool photo!” (Content is a string without an image, remains unchanged.) - Input as String, with image location: content = “What’s weather in this cool photo: `<img http://example.com/photo.jpg>`” Output: “What’s weather in this cool photo: `<img http://example.com/photo.jpg>` in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image) - Input as List with Text Only: content = `[\{"type": "text", "text": "Here's an interesting fact."}]` Output: “Here’s an interesting fact.” (No images in the content, it remains unchanged.) - Input as List with Image URL: `python content = [ \{"type": "text", "text": "What's weather in this cool photo:"}, \{"type": "image_url", "image_url": \{"url": "http://example.com/photo.jpg"}}, ]` Output: “What’s weather in this cool photo: `<img http://example.com/photo.jpg>` in case you can not see, the caption of this image is: A beautiful sunset over the mountains ” (Caption added after the image)

Overview get_image_data

On this page

VisionCapability
Instance Methods
add_to_agent
process_last_received_message

API Reference

​VisionCapability

​Instance Methods

​add_to_agent

​process_last_received_message

VisionCapability

Instance Methods

add_to_agent

process_last_received_message