Agent Chat with Multimodal Models: LLaVA#

This notebook uses LLaVA as an example for the multimodal feature. More information about LLaVA can be found in their GitHub page

This notebook contains the following information and examples:

Setup LLaVA Model
- Option 1: Use API calls from Replicate
- Option 2: Setup LLaVA locally (requires GPU)
Application 1: Image Chat
Application 2: Figure Creator

Before everything starts, install AG2 with the `lmm` option#

pip install "autogen[lmm]>=0.3.0"

# We use this variable to control where you want to host LLaVA, locally or remotely?
# More details in the two setup options below.
import os

import matplotlib.pyplot as plt
from PIL import Image

import autogen
from autogen import Agent, AssistantAgent, LLMConfig
from autogen.agentchat.contrib.llava_agent import LLaVAAgent, llava_call

LLAVA_MODE = "remote"  # Either "local" or "remote"
assert LLAVA_MODE in ["local", "remote"]

## (Option 1, preferred) Use API Calls from Replicate [Remote] We can also use Replicate to use LLaVA directly, which will host the model for you.

Run pip install replicate to install the package
You need to get an API key from Replicate from your account setting page
Next, copy your API token and authenticate by setting it as an environment variable: export REPLICATE_API_TOKEN=<paste-your-token-here>
You need to enter your credit card information for Replicate 🥲

# pip install replicate
# import os
# alternatively, you can put your API key here for the environment variable.
# os.environ["REPLICATE_API_TOKEN"] = "r8_xyz your api key goes here~"

if LLAVA_MODE == "remote":
    llava_config_list = [
        {
            "model": "whatever, will be ignored for remote",  # The model name doesn't matter here right now.
            "api_key": "None",  # Note that you have to setup the API key with os.environ["REPLICATE_API_TOKEN"]
            "base_url": "yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
        }
    ]

## [Option 2] Setup LLaVA Locally

Install the LLaVA library#

Please follow the LLaVA GitHub page to install LLaVA.

Download the package#

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

Install the inference package#

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Some helpful packages and dependencies:

conda install -c nvidia cuda-toolkit

Launch#

In one terminal, start the controller first:

python -m llava.serve.controller --host 0.0.0.0 --port 10000

Then, in another terminal, start the worker, which will load the model to the GPU:

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
``

::: {.cell}
``` {.python .cell-code}
# Run this code block only if you want to run LlaVA locally
if LLAVA_MODE == "local":
    llava_config_list = [
        {
            "model": "llava-v1.5-13b",
            "api_key": "None",
            "base_url": "http://0.0.0.0:10000",
        }
    ]

:::

Multimodal Functions#

We cal test the llava_call function with the following AG2 image.

rst = llava_call(
    "Describe this AG2 framework <img /static/img/autogen_agentchat.png> with bullet points.",
    llm_config=LLMConfig(config_list=llava_config_list, temperature=0),
)

print(rst)

## Application 1: Image Chat

In this section, we present a straightforward dual-agent architecture to enable user to chat with a multimodal agent.

First, we show this image and ask a question.

Within the user proxy agent, we can decide to activate the human input mode or not (for here, we use human_input_mode=“NEVER” for conciseness). This allows you to interact with LLaVA in a multi-round dialogue, enabling you to provide feedback as the conversation unfolds.

image_agent = LLaVAAgent(
    name="image-explainer",
    max_consecutive_auto_reply=10,
    llm_config=LLMConfig(config_list=llava_config_list, temperature=0.5, max_new_tokens=1000),
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "groupchat",
        "use_docker": False,
    },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
    human_input_mode="NEVER",  # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0,
)

# Ask the question with an image
user_proxy.initiate_chat(
    image_agent,
    message="""What's the breed of this dog?
<img https://th.bing.com/th/id/R.422068ce8af4e15b0634fe2540adea7a?rik=y4OcXBE%2fqutDOw&pid=ImgRaw&r=0>.""",
)

Now, input another image, and ask a followup question.

# Ask the question with an image
user_proxy.send(
    message="""What is this breed?
<img https://th.bing.com/th/id/OIP.29Mi2kJmcHHyQVGe_0NG7QHaEo?pid=ImgDet&rs=1>

Among the breeds, which one barks less?""",
    recipient=image_agent,
)

## Application 2: Figure Creator

Here, we define a FigureCreator agent, which contains three child agents: commander, coder, and critics.

Commander: interacts with users, runs code, and coordinates the flow between the coder and critics.
Coder: writes code for visualization.
Critics: LLaVA-based agent that provides comments and feedback on the generated image.

class FigureCreator(AssistantAgent):
    def __init__(self, n_iters=2, **kwargs):
        """Initializes a FigureCreator instance.

        This agent facilitates the creation of visualizations through a collaborative effort among its child agents: commander, coder, and critics.

        Parameters:
            - n_iters (int, optional): The number of "improvement" iterations to run. Defaults to 2.
            - **kwargs: keyword arguments for the parent AssistantAgent.
        """
        super().__init__(**kwargs)
        self.register_reply([Agent, None], reply_func=FigureCreator._reply_user, position=0)
        self._n_iters = n_iters

    def _reply_user(self, messages=None, sender=None, config=None):
        if all((messages is None, sender is None)):
            error_msg = f"Either {messages=} or {sender=} must be provided."
            logger.error(error_msg)  # noqa: F821
            raise AssertionError(error_msg)

        if messages is None:
            messages = self._oai_messages[sender]

        user_question = messages[-1]["content"]

        # Define the agents
        commander = AssistantAgent(
            name="Commander",
            human_input_mode="NEVER",
            max_consecutive_auto_reply=10,
            system_message="Help me run the code, and tell other agents it is in the <img result.jpg> file location.",
            is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
            code_execution_config={
                "last_n_messages": 3,
                "work_dir": ".",
                "use_docker": False,
            },  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.
            llm_config=self.llm_config,
        )

        critics = LLaVAAgent(
            name="Critics",
            system_message="""Criticize the input figure. How to replot the figure so it will be better? Find bugs and issues for the figure.
            Pay attention to the color, format, and presentation. Keep in mind of the reader-friendliness.
            If you think the figures is good enough, then simply say NO_ISSUES""",
            llm_config=LLMConfig(config_list=llava_config_list),
            human_input_mode="NEVER",
            max_consecutive_auto_reply=1,
            #     use_docker=False,
        )

        coder = AssistantAgent(
            name="Coder",
            llm_config=self.llm_config,
        )

        coder.update_system_message(
            coder.system_message
            + "ALWAYS save the figure in `result.jpg` file. Tell other agents it is in the <img result.jpg> file location."
        )

        # Data flow begins
        commander.initiate_chat(coder, message=user_question)
        img = Image.open("result.jpg")
        plt.imshow(img)
        plt.axis("off")  # Hide the axes
        plt.show()

        for i in range(self._n_iters):
            commander.send(message="Improve <img result.jpg>", recipient=critics, request_reply=True)

            feedback = commander._oai_messages[critics][-1]["content"]
            if feedback.find("NO_ISSUES") >= 0:
                break
            commander.send(
                message="Here is the feedback to your figure. Please improve! Save the result to `result.jpg`\n"
                + feedback,
                recipient=coder,
                request_reply=True,
            )
            img = Image.open("result.jpg")
            plt.imshow(img)
            plt.axis("off")  # Hide the axes
            plt.show()

        return True, "result.jpg"

gpt4_llm_config = autogen.LLMConfig.from_json(path="OAI_CONFIG_LIST", cache_seed=42).where(
    model=["gpt-4", "gpt-4-0314", "gpt4", "gpt-4-32k", "gpt-4-32k-0314", "gpt-4-32k-v0314"]
)

# gpt35_llm_config = autogen.LLMConfig.from_json(
#     path="OAI_CONFIG_LIST", cache_seed=42
# ).where(model=["gpt-35-turbo", "gpt-3.5-turbo"])

creator = FigureCreator(name="Figure Creator~", llm_config=gpt4_llm_config)

user_proxy = autogen.UserProxyAgent(
    name="User", human_input_mode="NEVER", max_consecutive_auto_reply=0, code_execution_config={"use_docker": False}
)  # Please set use_docker=True if docker is available to run the generated code. Using docker is safer than running the generated code directly.

user_proxy.initiate_chat(
    creator,
    message="""
Plot a figure by using the data from:
https://raw.githubusercontent.com/vega/vega/main/docs/data/seattle-weather.csv

I want to show both temperature high and low.
""",
)

if os.path.exists("result.jpg"):
    os.remove("result.jpg")  # clean up