Open In Colab Open on GitHub

Can you create effective agents with small, 8B-parameter models? Yes, you can.

Small LLMs, like the IBM Granite 8B, are advantageous because they can run locally on a powerful laptop, eliminating the need for API calls to external service providers. But can you create effective agents with them? This answer is “yes” - with the right techniques.

This notebook demonstrates how to optimize the performance of small language models, such as the IBM Granite 8B, in a Retrieval-Augmented Generation (RAG) workflow. By combining document and web search tools with a high-level planner and reflection-driven agents, it achieves effective, dynamic problem-solving. The notebook also shows how this process can be executed locally using Ollama, with examples provided later.

Workflow Overview:

  1. Plan Creation (Planner):
  • The Planner generates an initial plan based on user instructions.
  1. Plan Execution (Iterative Process):
  • The plan is carried out step-by-step through multiple iterations:
    • Step Selection: If no prior outputs exist, the first step of the plan is executed. Otherwise, the system evaluates the results of previous steps using the Generic Assistant before advancing.
    • Information Retrieval: Each step is executed by the Research Assistant, which performs web and document searches to gather relevant data.
  1. Reflection and Adjustment (Reflection Assistant):
  • After each step, the Reflection Assistant evaluates its success and determines how to adapt the plan, if needed. This ensures the workflow remains flexible and responsive to intermediate results.
  1. Final Output:
  • The accumulated contextual information is synthesized into a final response, directly addressing the user’s original query.

By leveraging these tactics, this notebook showcases how smaller models, equipped with effective planning and reflection mechanisms, can deliver impactful results even on resource-constrained systems.

Requirements

Ensure the autogen python package is installed, along with a few LangChain libraries for retrieval.

! pip install -U ag2[openai] ag2[ollama] langchain_community langchain-milvus langchain_huggingface

Embeddings Model

Here, we need to download an embeddings model for the vector database we will be setting up for document retrieval. In this example, we are using the IBM Granite embeddings model. It comes in a few different flavors, which can be found on Hugging Face. You can also substitute any embeddings model of your choice.

from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")

Establish the vector database

In this example, we are using Milvus as our vector database.

import tempfile

from langchain_milvus import Milvus

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name  # noqa: SIM115
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

Identify documents to ingest

In the following code, you provide a list of folders and document extensions that you would like to ingest into your vector database for future retrieval. Be sure to personalize the identified directories and file extensions.

from pathlib import Path

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Specify the file extensions you would like to include (e.g., '.txt', '.md')
allowed_extensions = {".txt", ".md"}

# Update this to include the directories you want to scan
source_directories = [
    Path("~/Downloads/").expanduser(),
    Path("~/Documents/").expanduser(),  # Add more directories as needed
]

# Collect files from all specified directories
sources = []
for source_directory in source_directories:
    sources.extend(
        file
        for file in source_directory.glob("**/*")  # Includes all files and subdirectories recursively
        if file.is_file() and file.suffix in allowed_extensions  # Include only files with specific extensions
    )

# Load and process the files
documents = []
for file in sources:
    loader = TextLoader(file)  # Initialize TextLoader for each file
    documents.extend(loader.load())  # Load and extend the documents list

# Split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Now `texts` contains the processed and split documents

Ingest the documents

Finally, after chunking the documents, load them into the vector database.

vector_db.add_documents(texts)

Agent Prompts

The following are the prompts for the agents that we will define later. Try running with these prompts initially, and update them later if needed for your use case.

PLANNER_MESSAGE = """You are a task planner. You will be given some information your job is to think step by step and enumerate the steps to complete a performance assessment of a given user, using the provided context to guide you.
    You will not execute the steps yourself, but provide the steps to a helper who will execute them. Make sure each step consists of a single operation, not a series of operations. The helper has the following capabilities:
    1. Search through a collection of documents provided by the user. These are the user's own documents and will likely not have latest news or other information you can find on the internet.
    2. Synthesize, summarize and classify the information received.
    3. Search the internet
    Please output the step using a properly formatted python dictionary and list. It must be formatted exactly as below:
    ```{"plan": ["Step 1", "Step 2"]}```

    Respond only with the plan json with no additional fields and no additional text. Here are a few examples:
    Example 1:
    User query: Write a performance self-assessment for Joe, consisting of a high-level overview of achievements for the year, a listing of the business impacts for each of these achievements, a list of skills developed and ways he's collaborated with the team.
    Your response:
    ```{"plan": ["Query documents for all contributions involving Joe this year", "Quantify the business impact for Joe's contributions", "Enumerate the skills Joe has developed this year", "List several examples of how Joe's work has been accomplished via team collaboration", "Formulate the performance review based on collected information"]}```

    Example 2:
    User query: Find the latest news about the technologies I'm working on.
    Your response:
    ```{"plan": ["Query documents for technologies used", "Search the internet for the latest news about each technology"]}```
    """

ASSISTANT_PROMPT = """You are an AI assistant.
    When you receive a message, figure out a solution and provide a final answer. The message will be accompanied with contextual information. Use the contextual information to help you provide a solution.
    Make sure to provide a thorough answer that directly addresses the message you received.
    The context may contain extraneous information that does not apply to your instruction. If so, just extract whatever is useful or relevant and use it to complete your instruction.
    When the context does not include enough information to complete the task, use your available tools to retrieve the specific information you need.
    When you are using knowledge and web search tools to complete the instruction, answer the instruction only using the results from the search; do no supplement with your own knowledge.
    Be persistent in finding the information you need before giving up.
    If the task is able to be accomplished without using tools, then do not make any tool calls.
    When you have accomplished the instruction posed to you, you will reply with the text: ##SUMMARY## - followed with an answer.
    If you are using knowledge and web search tools, make sure to provide the URL for the page you are using as your source or the document name.
    Important: If you are unable to accomplish the task, whether it's because you could not retrieve sufficient data, or any other reason, reply only with ##TERMINATE##.

    # Tool Use
    You have access to the following tools. Only use these available tools and do not attempt to use anything not listed - this will cause an error.
    Respond in the format: <function_call> {"name": function name, "arguments": dictionary of argument name and its value}. Do not use variables.
    Only call one tool at a time.
    When suggesting tool calls, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.
    """

REFLECTION_ASSISTANT_PROMPT = """You are an assistant. Please tell me what is the next step that needs to be taken in a plan in order to accomplish a given task.
    You will receive json in the following format, and will respond with a single line of instruction.

    {
        "Goal": The original query from the user. Every time you create a reply, it must be guided by the task of fulfilling this goal. Do not veer off course.,
        "Plan": An array that enumerates every step of the plan,
        "Previous Step": The step taken immediately prior to this message.
        "Previous Output": The output generated by the last step taken.
        "Steps Taken": A sequential array of steps that have already been executed prior to the last step,

    }

    Instructions:
        1. If the very last step of the plan has already been executed, or the goal has already been achieved regardless of what step is next, then reply with the exact text: ##TERMINATE##
        2. Look at the "Previous Step". If the previous step was not successful and it is integral to achieving the goal, think of how it can be retried with better instructions. Inspect why the previous step was not successful, and modify the instruction to find another way to achieve the step's objective in a way that won't repeat the same error.
        3. If the last previous was successful, determine what the next step will be. Always prefer to execute the next sequential step in the plan unless the previous step was unsuccessful and you need to re-run the previous step using a modified instruction.
        4. When determining the next step, you may use the "Previous Step", "Previous Output", and "Steps Taken" to give you contextual information to decide what next step to take.

    Be persistent and resourceful to make sure you reach the goal.
    """

CRITIC_PROMPT = """The previous instruction was {last_step} \nThe following is the output of that instruction.
    if the output of the instruction completely satisfies the instruction, then reply with ##YES##.
    For example, if the instruction is to list companies that use AI, then the output contains a list of companies that use AI.
    If the output contains the phrase 'I'm sorry but...' then it is likely not fulfilling the instruction. \n
    If the output of the instruction does not properly satisfy the instruction, then reply with ##NO## and the reason why.
    For example, if the instruction was to list companies that use AI but the output does not contain a list of companies, or states that a list of companies is not available, then the output did not properly satisfy the instruction.
    If it does not satisfy the instruction, please think about what went wrong with the previous instruction and give me an explanation along with the text ##NO##. \n
    Previous step output: \n {last_output}"""

SEARCH_TERM_ASSISTANT_PROMPT = """You are an expert at creating precise, complete, and accurate web search queries. When given a description of what a user is looking for, you will generate a fully formed, optimized search query that can be used directly in a search engine to find the most relevant information.

    Key Requirements:

        Stick to the Description: Use only the information explicitly provided in the description. Do not add, assume, or invent details that are not stated.
        Be Complete and Concise: The search query must contain all necessary keywords and phrases required to fulfill the description without being unnecessarily verbose.
        Avoid Vague or Placeholder Terms: Do not include incomplete terms (e.g., no placeholder variables or references to unspecified concepts).
        Use Proper Context and Refinement: Include context, if applicable (e.g., location, date, format). Utilize search modifiers like quotes, "site:", "filetype:", or Boolean operators (AND, OR, NOT) to refine the query when appropriate.
        Avoid Hallucination: Do not make up or fabricate any details that are not explicitly stated in the description.

    Example Input:
    "Find the latest research papers about AI-driven medical imaging published in 2023."

    Example Output:
    "latest research papers on AI-driven medical imaging 2023"

    Another Example Input:
    "Find a website that lists the top restaurants in Paris with outdoor seating."

    Example Output:
    "top restaurants in Paris with outdoor seating"

    Incorrect Example Input:
    "Find the population of Atlantis."

    Incorrect Example Output:
    "population of Atlantis 2023" (This is incorrect because the existence or details about Atlantis are not explicitly stated in the input and must not be assumed.)

    Your Turn:
    Generate a complete, accurate, and optimized search query based on the description provided below:
    """

Ollama Setup

If you are running inferencing locally and do not have Ollama setup already, use the following commands to do so.

! pip install ollama
! ollama serve

Pull the Ollama model

If you are using Ollama, you will need to pull the relevant LLM. In this example we are using the Granite 3.1 dense model at 8b parameters. It is capable enough to perform well in this flow but small enough to run on a beefy personal machine.

! ollama pull granite3.1-dense:8b

Configuration Variables

Customize the following LLM configuration variables as needed

# Ollama URL
base_url = "http://localhost:11434"

# API key for LLM Inferencing. This default value  can be used when running Ollama locally.
api_key = "ollama"

# LLM to use for all agents. The value below corresponds to its name in Ollama.
default_model = "granite3.1-dense:8b"

# Model temperature. A lower temperature gives more predictable results.
model_temp = 0

# Maximum number of steps that are allowed to be executed in a plan (prevents a never-ending loop)
max_plan_steps = 6

AG2 Agent Setup

Initializes the LLM config and all of the agents of the workflow.

from autogen import ConversableAgent, LLMConfig, coding

##################
# AG2 Config
##################
# LLM Config
llm_config = LLMConfig(
    config_list=[
        {
            "model": default_model,
            "client_host": base_url,
            "api_type": "ollama",
            "cache_seed": None,
            "price": [0.0, 0.0],
        }
    ],
    temperature=model_temp,
)

# Generic Assistant - Used for general inquiry. Does not call tools.
generic_assistant = ConversableAgent(name="Generic_Assistant", llm_config=llm_config, human_input_mode="NEVER")

# Search Term Assistant - Used for finding relevant search terms for a user's query
web_search_assistant = ConversableAgent(
    name="Web_Search_Term_Assistant",
    system_message=SEARCH_TERM_ASSISTANT_PROMPT,
    llm_config=llm_config,
    human_input_mode="NEVER",
)

# Provides the initial high level plan
planner = ConversableAgent(
    name="Planner", system_message=PLANNER_MESSAGE, llm_config=llm_config, human_input_mode="NEVER"
)

# The assistant agent is responsible for executing each step of the plan, including calling tools
assistant = ConversableAgent(
    name="Research_Assistant",
    system_message=ASSISTANT_PROMPT,
    llm_config=llm_config,
    human_input_mode="NEVER",
    is_termination_msg=lambda msg: "tool_response" not in msg and msg["content"] == "",
)

# Critques the output of other agents. Prompt will be fed in for each call.
critic = ConversableAgent(name="Critic", llm_config=llm_config, human_input_mode="NEVER")

# Summary agent clarifies another agent's reply in context to the original instruction. Prompt will be fed in for each call.
summary_agent = ConversableAgent(name="SummaryAssistant", llm_config=llm_config, human_input_mode="NEVER")

# Reflection Assistant: Reflect on plan progress and give the next step
reflection_assistant = ConversableAgent(
    name="ReflectionAssistant",
    system_message=REFLECTION_ASSISTANT_PROMPT,
    llm_config=llm_config,
    human_input_mode="NEVER",
)

# User Proxy chats with assistant on behalf of user and executes tools
code_exec = coding.LocalCommandLineCodeExecutor(
    timeout=10,
    work_dir="code_exec",
)
user_proxy = ConversableAgent(
    name="User",
    human_input_mode="NEVER",
    code_execution_config={"executor": code_exec},
    is_termination_msg=lambda msg: "##SUMMARY##" in msg["content"]
    or "## Summary" in msg["content"]
    or "##TERMINATE##" in msg["content"]
    or ("tool_calls" not in msg and msg["content"] == ""),
)

Document Search Tool

The following function is registered as a tool for searching the previously created vector database for relevant documents.

from typing import Annotated


@assistant.register_for_llm(
    name="personal_knowledge_search", description="Searches personal documents according to a given query"
)
@user_proxy.register_for_execution(name="personal_knowledge_search")
def do_knowledge_search(search_instruction: Annotated[str, "search instruction"]) -> str:
    """Given an instruction on what knowledge you need to find, search the user's documents for information particular to them, their projects, and their domain.
    This is simple document search, it cannot perform any other complex tasks.
    This will not give you any results from the internet. Do not assume it can retrieve the latest news pertaining to any subject."""
    if not search_instruction:
        return "Please provide a search query."

    messages = ""
    # docs = vector_db.similarity_search(search_instruction)
    retriever = vector_db.as_retriever(search_kwargs={"fetch_k": 10, "max_tokens": 500})

    docs = retriever.invoke(search_instruction)
    print(f"{len(docs)} documents returned")
    for d in docs:
        print(d)
        print(d.page_content)
        messages += d.page_content + "\n"

    return messages

Web Search Tool

The following is a tool that can be registered for web search usage. For a quick and easy demonstration, the implementation below uses the googlesearch-python library to perform the search, which is not officially supported by Google. If you require the web search tool beyond a PoC, you should use the official library and APIs of your favorite search provider, or a multi-engine service such as Tavily.

! pip install googlesearch-python
import traceback
from datetime import date
from typing import Annotated

from googlesearch import search


@assistant.register_for_llm(name="web_search", description="Searches the web according to a given query")
@user_proxy.register_for_execution(name="web_search")
def do_web_search(
    search_instruction: Annotated[
        str,
        "Provide a detailed search instruction that incorporates specific features, goals, and contextual details related to the query. \
                                                Identify and include relevant aspects from any provided context, such as key topics, technologies, challenges, timelines, or use cases. \
                                                Construct the instruction to enable a targeted search by specifying important attributes, keywords, and relationships within the context.",
    ],
) -> str:
    """This function is used for searching the web for information that can only be found on the internet, not in the users personal notes."""
    if not search_instruction:
        return "Please provide a search query."

    # First, we convert the incoming query into a search term.
    today = date.today().strftime("%Y-%m-%d")

    chat_result = user_proxy.initiate_chat(
        recipient=web_search_assistant,
        message="Today's date is " + today + ". " + search_instruction,
        max_turns=1,
    )
    summary = chat_result.chat_history[-1]["content"]

    results = []

    try:
        response = search(summary, advanced=True)
        for result in response:
            entry = {}
            if type(result) is not str:
                entry["title"] = result.title
                entry["url"] = result.url
                entry["description"] = result.description
                results.append(entry)
            else:
                results.append(result)

    except Exception as e:
        print(e)
        print(traceback.format_exc())
        return f"Unable to execute search query due to the following exception: {e}"

    return str(results)

Plan parser

When the initial plan is formed, the LLM should output the plan in JSON format. The following parses the response from the planner and returns the response as a dictionary.

import json
from typing import Any


def parse_response(message: str) -> dict[str, Any]:
    """
    Parse the response from the planner and return the response as a dictionary.
    """
    # Parse the response content
    json_response = {}
    # if message starts with ``` and ends with ``` then remove them
    if message.startswith("```"):
        message = message[3:]
    if message.endswith("```"):
        message = message[:-3]
    if message.startswith("json"):
        message = message[4:]
    if message.startswith("python"):
        message = message[6:]
    message = message.strip()
    try:
        json_response: dict[str, Any] = json.loads(message)
    except Exception as e:
        # If the response is not a valid JSON, try pass it using string matching.
        # This should seldom be triggered
        print(
            f'LLM response was not properly formed JSON. Will try to use it as is. LLM response: "{message}". Error: {e}'
        )
        message = message.replace("\\n", "\n")
        message = message.replace("\n", " ")  # type: ignore
        if "plan" in message and "next_step" in message:
            start = message.index("plan") + len("plan")
            end = message.index("next_step")
            json_response["plan"] = message[start:end].replace('"', "").strip()

    return json_response

The Workflow

Here is the meat of the agentic workflow. They key point here is that extract maximum performance a set of agents that are called in sequence, and the data passed between them is curated and trimmed, rather than passing around the full chat history.

#########################
# Begin Agentic Workflow
#########################


def run_agentic_workflow(user_message: str) -> str:
    """
    Run the agentic workflow.
    """

    # Make a plan
    raw_plan = user_proxy.initiate_chat(message=user_message, max_turns=1, recipient=planner).chat_history[-1][
        "content"
    ]
    plan_dict = parse_response(raw_plan)

    # Start executing plan
    answer_output = []  # This variable tracks the output of previous successful steps as context for executing the next step
    steps_taken = []  # A list of steps already executed
    last_output = ""  # Output of the single previous step gets put here
    last_step = ""

    for _ in range(max_plan_steps):
        if last_output == "":
            # This is the first step of the plan since there's no previous output
            instruction = plan_dict["plan"][0]
        else:
            # Previous steps in the plan have already been executed.
            reflection_message = last_step
            # Ask the critic if the previous step was properly accomplished
            was_job_accomplished = user_proxy.initiate_chat(
                recipient=critic,
                max_turns=1,
                message=CRITIC_PROMPT.format(last_step=last_step, last_output=last_output),
            ).chat_history[-1]["content"]
            # If it was not accomplished, make sure an explanation is provided for the reflection assistant
            if "##NO##" in was_job_accomplished:
                reflection_message = f"The previous step was {last_step} but it was not accomplished satisfactorily due to the following reason: \n {was_job_accomplished}."

            # Then, ask the reflection agent for the next step
            message = {
                "Goal": user_message,
                "Plan": str(plan_dict),
                "Last Step": reflection_message,
                "Last Step Output": str(last_output),
                "Steps Taken": str(steps_taken),
            }
            instruction = user_proxy.initiate_chat(
                recipient=reflection_assistant, max_turns=1, message=str(message)
            ).chat_history[-1]["content"]

            # Only append the previous step and its output to the record if it accomplished its task successfully.
            # It was found that storing information about unsuccessful steps causes more confusion than help to the agents
            if "##NO##" not in was_job_accomplished:
                answer_output.append(last_output)
                steps_taken.append(last_step)

            if "##TERMINATE##" in instruction:
                # A termination message means there are no more steps to take. Exit the loop.
                break

        # Now that we have determined the next step to take, execute it
        prompt = instruction
        if answer_output:
            prompt += f"\n Contextual Information: \n{answer_output}"
        output = user_proxy.initiate_chat(recipient=assistant, max_turns=3, message=prompt)

        # Sort through the chat history and extract out replies from the assistant (We don't need the full results of the tool calls, just the assistant's summary)
        previous_output = []
        for chat_item in output.chat_history:
            if chat_item["content"] and chat_item["name"] == "Research_Assistant":
                previous_output.append(chat_item["content"])

        # It was found in testing that the output of the assistant will often contain the right information, but it will not be formatted in a manner that directly answers the instruction
        # Therefore, the summary assistant will take the assistant's output and reformat it to more directly answer the instruction that was given to the assistant
        summary_output = user_proxy.initiate_chat(
            recipient=summary_agent,
            max_turns=1,
            message=f"The instruction is: {instruction} Please directly answer the instruction given the following data: {previous_output}",
        )

        # The previous instruction and its output will be recorded for the next iteration to inspect before determining the next step of the plan
        last_output = summary_output.chat_history[-1]["content"]
        last_step = instruction

    # Now that we've gathered all the information we need, we will summarize it to directly answer the original prompt
    final_prompt = (
        f"Answer the user's query: {user_message}. Using the following contextual information only: {answer_output}"
    )
    final_output = user_proxy.initiate_chat(
        message=final_prompt, max_turns=1, recipient=generic_assistant
    ).chat_history[-1]["content"]

    return final_output

Run a query

Below is an example query. You can replace it with your own query based upon the types of information you have loaded into the vector database.

answer = run_agentic_workflow(
    user_message="Identify the key technical features of my projects and for each feature, fetch me the latest news articles in the tech industry related to that feature."
)