Skip to content

Blog#

AgentEval: A Developer Tool to Assess Utility of LLM-powered Applications

Fig.1: An AgentEval framework with verification step

Fig.1 illustrates the general flow of AgentEval with verification step

TL;DR: * As a developer, how can you assess the utility and effectiveness of an LLM-powered application in helping end users with their tasks? * To shed light on the question above, we previously introduced AgentEval — a framework to assess the multi-dimensional utility of any LLM-powered application crafted to assist users in specific tasks. We have now embedded it as part of the AutoGen library to ease developer adoption. * Here, we introduce an updated version of AgentEval that includes a verification process to estimate the robustness of the QuantifierAgent. More details can be found in this paper.

Introduction

Previously introduced AgentEval is a comprehensive framework designed to bridge the gap in assessing the utility of LLM-powered applications. It leverages recent advancements in LLMs to offer a scalable and cost-effective alternative to traditional human evaluations. The framework comprises three main agents: CriticAgent, QuantifierAgent, and VerifierAgent, each playing a crucial role in assessing the task utility of an application.

Agents in AutoGen

agents

TL;DR

  • AutoGen agents unify different agent definitions.
  • When talking about multi vs. single agents, it is beneficial to clarify whether we refer to the interface or the architecture.

I often get asked two common questions: 1. What's an agent? 1. What are the pros and cons of multi vs. single agent?

This blog collects my thoughts from several interviews and recent learnings.

What's an agent?

There are many different types of definitions of agents. When building AutoGen, I was looking for the most generic notion that can incorporate all these different types of definitions. And to do that we really need to think about the minimal set of concepts that are needed.

In AutoGen, we think about the agent as an entity that can act on behalf of human intent. They can send messages, receive messages, respond to other agents after taking actions and interact with other agents. We think it's a minimal set of capabilities that an agent needs to have underneath. They can have different types of backends to support them to perform actions and generate replies. Some of the agents can use AI models to generate replies. Some other agents can use functions underneath to generate tool-based replies and other agents can use human input as a way to reply to other agents. And you can also have agents that mix these different types of backends or have more complex agents that have internal conversations among multiple agents. But on the surface, other agents still perceive it as a single entity to communicate to.

With this definition, we can incorporate both very simple agents that can solve simple tasks with a single backend, but also we can have agents that are composed of multiple simpler agents. One can recursively build up more powerful agents. The agent concept in AutoGen can cover all these different complexities.

AutoDefense - Defend against jailbreak attacks with AutoGen

architecture

TL;DR

  • We propose AutoDefense, a multi-agent defense framework using AutoGen to protect LLMs from jailbreak attacks.
  • AutoDefense employs a response-filtering mechanism with specialized LLM agents collaborating to analyze potentially harmful responses.
  • Experiments show our three-agents (consisting of an intention analyzer, a prompt analyzer, and a judge) defense agency with LLaMA-2-13B effectively reduces jailbreak attack success rate while maintaining low false positives on normal user requests.

What's New in AutoGen?

autogen is loved

TL;DR

  • AutoGen has received tremendous interest and recognition.
  • AutoGen has many exciting new features and ongoing research.

Five months have passed since the initial spinoff of AutoGen from FLAML. What have we learned since then? What are the milestones achieved? What's next?

Background

AutoGen was motivated by two big questions:

  • What are future AI applications like?
  • How do we empower every developer to build them?

Last year, I worked with my colleagues and collaborators from Penn State University and University of Washington, on a new multi-agent framework, to enable the next generation of applications powered by large language models. We have been building AutoGen, as a programming framework for agentic AI, just like PyTorch for deep learning. We developed AutoGen in an open source project FLAML: a fast library for AutoML and tuning. After a few studies like EcoOptiGen and MathChat, in August, we published a technical report about the multi-agent framework. In October, we moved AutoGen from FLAML to a standalone repo on GitHub, and published an updated technical report.

StateFlow - Build State-Driven Workflows with Customized Speaker Selection in GroupChat

TL;DR: Introduce Stateflow, a task-solving paradigm that conceptualizes complex task-solving processes backed by LLMs as state machines. Introduce how to use GroupChat to realize such an idea with a customized speaker selection function.

Introduction

It is a notable trend to use Large Language Models (LLMs) to tackle complex tasks, e.g., tasks that require a sequence of actions and dynamic interaction with tools and external environments. In this paper, we propose StateFlow, a novel LLM-based task-solving paradigm that conceptualizes complex task-solving processes as state machines. In StateFlow, we distinguish between "process grounding” (via state and state transitions) and "sub-task solving” (through actions within a state), enhancing control and interpretability of the task-solving procedure. A state represents the status of a running process. The transitions between states are controlled by heuristic rules or decisions made by the LLM, allowing for a dynamic and adaptive progression. Upon entering a state, a series of actions is executed, involving not only calling LLMs guided by different prompts, but also the utilization of external tools as needed.

FSM Group Chat -- User-specified agent transitions

FSM Group Chat

Finite State Machine (FSM) Group Chat allows the user to constrain agent transitions.

TL;DR

Recently, FSM Group Chat is released that allows the user to input a transition graph to constrain agent transitions. This is useful as the number of agents increases because the number of transition pairs (N choose 2 combinations) increases exponentially increasing the risk of sub-optimal transitions, which leads to wastage of tokens and/or poor outcomes.

AutoGen with Custom Models: Empowering Users to Use Their Own Inference Mechanism

TL;DR

AutoGen now supports custom models! This feature empowers users to define and load their own models, allowing for a more flexible and personalized inference mechanism. By adhering to a specific protocol, you can integrate your custom model for use with AutoGen and respond to prompts any way needed by using any model/API call/hardcoded response you want.

NOTE: Depending on what model you use, you may need to play with the default prompts of the Agent's

AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents

AutoGenBench

AutoGenBench is a standalone tool for evaluating AutoGen agents and workflows on common benchmarks.

TL;DR

Today we are releasing AutoGenBench - a tool for evaluating AutoGen agents and workflows on established LLM and agentic benchmarks.

AutoGenBench is a standalone command line tool, installable from PyPI, which handles downloading, configuring, running, and reporting supported benchmarks. AutoGenBench works best when run alongside Docker, since it uses Docker to isolate tests from one another.

Code execution is now by default inside docker container

TL;DR

AutoGen 0.2.8 enhances operational safety by making 'code execution inside a Docker container' the default setting, focusing on informing users about its operations and empowering them to make informed decisions regarding code execution.

The new release introduces a breaking change where the use_docker argument is set to True by default in code executing agents. This change underscores our commitment to prioritizing security and safety in AutoGen.