Skip to content

Multimodal#

Multimodal with GPT-4V and LLaVA

LMM Teaser

In Brief: * Introducing the Multimodal Conversable Agent and the LLaVA Agent to enhance LMM functionalities. * Users can input text and images simultaneously using the <img img_path> tag to specify image loading. * Demonstrated through the GPT-4V notebook. * Demonstrated through the LLaVA notebook.

Introduction

Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.

This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. We support the gpt-4-vision-preview model from OpenAI and LLaVA model from Microsoft now.

Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity. GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.