Multimodal with GPT-4V and LLaVA
In Brief: * Introducing the Multimodal Conversable Agent and the LLaVA Agent to enhance LMM functionalities. * Users can input text and images simultaneously using the <img img_path>
tag to specify image loading. * Demonstrated through the GPT-4V notebook. * Demonstrated through the LLaVA notebook.
Introduction
Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data.
This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. We support the gpt-4-vision-preview
model from OpenAI and LLaVA
model from Microsoft now.
Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity. GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.