Engaging Image Input/Output with OpenAI's Responses API in AG2#

Author: Yixuan Zhai

This notebook demonstrates how to do image input and image generate through a two-agent chat with OpenAI’s Responses API and their GPT-4o model.

Note: Current support for the OpenAI Responses API is limited to initiate_chat with a two-agent chat. Future releases will included expanded support for group chat and the run interfaces.

As an example, using image inputs with the OpenAI Responses model provider in AG2 we can generate stylized versions of the images:

Set LLM config to use OpenAI response API#

For image generation, we need to add the built in tool image_generation.

Visit the OpenAI Responses API Documentation for more information.

Install AG2 and dependencies#

To be able to run this notebook, you will need to install AG2 with the openai extra.

Requirements

Install ag2 with 'openai' extra:

pip install ag2[openai]

For more information, please refer to the installation guide.

import base64
import os
import textwrap

from autogen import AssistantAgent

# LLM config
llm_cfg = {
    "config_list": [
        {
            "api_type": "responses",  # use 'responses' for OpenAI Responses API
            "model": "gpt-4o",  # supports vision + images
            "api_key": os.getenv("OPENAI_API_KEY"),
            "built_in_tools": ["image_generation"],
        }
    ]
}

Create an assistant for image processing#

assistant = AssistantAgent(
    name="ArtBot",
    llm_config=llm_cfg,
    system_message=textwrap.dedent("""
        You are an assistant that can reason over images and
        use the built-in image_generation tool. When generating
        an image, return ONLY the tool call result you receive.
    """).strip(),
)

#  initial image (URL or data-URI)
IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/3/3b/BlkStdSchnauzer2.jpg"

response = assistant.run(message=f"Describe this image <{IMAGE_URL}> in one sentence", user_input=True)

response.process()

Use formal image input to reduce hallucination#

Sometimes, using image links as a part of natural language will cause hallucinations in the follow-up questions.

The OpenAI Response API provides a formal way for image input, visit the Image Input for more information.

In this example, ask the assistant to generate different variations of the image:

“Give me a Ghibli-style version of the image”
“Give me a version of the image in a Matrix style”

# initialize chat with image input
chat = {
    "role": "user",
    "content": [
        {"type": "input_text", "text": "Describe this image in one sentence."},
        {"type": "input_image", "image_url": IMAGE_URL},
    ],
}

response = assistant.run(message=chat, user_input=True)

response.process()

Save generated images#

Run the following cell to save the generated images from the previous conversation.

An image will be saved for each message in the chat history that had an image generated.

# ----helper function to save image from base64 string----
def save_b64_png(b64_str, fname="generated.png"):
    with open(fname, "wb") as f:
        f.write(base64.b64decode(b64_str))
    print(f"image saved → {fname}")

messages = response.messages
for i in range(len(messages)):
    print(i)
    message = messages[i]
    # print(message)
    if message.get("name") == "ArtBot":
        contents = message.get("content", [])
        for content in contents:
            if (
                content.get("type") == "tool_call"
                and content.get("name") == "image_generation"
                and "content" in content
                and content["content"]
            ):
                print("Saving image!")
                save_b64_png(content["content"], f"image{i}.png")

Image costs#

Image costs are not provided by OpenAI’s API, instead they need to be calculated. In AG2 this is done automatically and is included in the chat result’s cost attribute.

print(f"The cost of the conversation, including image generation, is: {response.cost}")