Skip to content

Multimodal#

Multimodal with GPT-4V and LLaVA

Deprecated

LLaVAAgent is deprecated as of v0.12 and will be removed in v0.14. v1.0 will contain native multimodal support on Agent. This blog post and associated notebooks will also be removed in v0.14.

LMM Teaser

In Brief: * Introducing the Multimodal Conversable Agent and the LLaVA Agent to enhance LMM functionalities. * Users can input text and images simultaneously using the <img img_path> tag to specify image loading. * Demonstrated through the GPT-4V notebook. * Demonstrated through the LLaVA notebook.

Introduction

Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.

This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. We support the gpt-4-vision-preview model from OpenAI and LLaVA model from Microsoft now.

Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity. GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.