Can you create effective agents with small, 8B-parameter models? Yes,
you can.
Small LLMs, like the IBM Granite 8B, are advantageous because they can
run locally on a powerful laptop, eliminating the need for API calls to
external service providers. But can you create effective agents with
them? This answer is “yes” - with the right techniques.
This notebook demonstrates how to optimize the performance of small
language models, such as the IBM Granite 8B, in a Retrieval-Augmented
Generation (RAG) workflow. By combining document and web search tools
with a high-level planner and reflection-driven agents, it achieves
effective, dynamic problem-solving. The notebook also shows how this
process can be executed locally using Ollama, with examples provided
later.
Workflow Overview:
- Plan Creation (Planner):
- The Planner generates an initial plan based on user instructions.
- Plan Execution (Iterative Process):
- The plan is carried out step-by-step through multiple iterations:
- Step Selection: If no prior outputs exist, the first step of the
plan is executed. Otherwise, the system evaluates the results of
previous steps using the Generic Assistant before advancing.
- Information Retrieval: Each step is executed by the Research
Assistant, which performs web and document searches to gather
relevant data.
- Reflection and Adjustment (Reflection Assistant):
- After each step, the Reflection Assistant evaluates its success and
determines how to adapt the plan, if needed. This ensures the
workflow remains flexible and responsive to intermediate results.
- Final Output:
- The accumulated contextual information is synthesized into a final
response, directly addressing the user’s original query.
By leveraging these tactics, this notebook showcases how smaller models,
equipped with effective planning and reflection mechanisms, can deliver
impactful results even on resource-constrained systems.
Requirements
Ensure the autogen python package is installed, along with a few
LangChain libraries for retrieval.
Embeddings Model
Here, we need to download an embeddings model for the vector database we
will be setting up for document retrieval. In this example, we are using
the IBM Granite embeddings model. It comes in a few different flavors,
which can be found on Hugging
Face. You can also substitute any
embeddings model of your choice.
Establish the vector database
In this example, we are using Milvus as our vector database.
Identify documents to ingest
In the following code, you provide a list of folders and document
extensions that you would like to ingest into your vector database for
future retrieval. Be sure to personalize the identified directories and
file extensions.
Ingest the documents
Finally, after chunking the documents, load them into the vector
database.
Agent Prompts
The following are the prompts for the agents that we will define later.
Try running with these prompts initially, and update them later if
needed for your use case.
PLANNER_MESSAGE = """You are a task planner. You will be given some information your job is to think step by step and enumerate the steps to complete a performance assessment of a given user, using the provided context to guide you.
You will not execute the steps yourself, but provide the steps to a helper who will execute them. Make sure each step consists of a single operation, not a series of operations. The helper has the following capabilities:
1. Search through a collection of documents provided by the user. These are the user's own documents and will likely not have latest news or other information you can find on the internet.
2. Synthesize, summarize and classify the information received.
3. Search the internet
Please output the step using a properly formatted python dictionary and list. It must be formatted exactly as below:
```{"plan": ["Step 1", "Step 2"]}```
Respond only with the plan json with no additional fields and no additional text. Here are a few examples:
Example 1:
User query: Write a performance self-assessment for Joe, consisting of a high-level overview of achievements for the year, a listing of the business impacts for each of these achievements, a list of skills developed and ways he's collaborated with the team.
Your response:
```{"plan": ["Query documents for all contributions involving Joe this year", "Quantify the business impact for Joe's contributions", "Enumerate the skills Joe has developed this year", "List several examples of how Joe's work has been accomplished via team collaboration", "Formulate the performance review based on collected information"]}```
Example 2:
User query: Find the latest news about the technologies I'm working on.
Your response:
```{"plan": ["Query documents for technologies used", "Search the internet for the latest news about each technology"]}```
"""
ASSISTANT_PROMPT = """You are an AI assistant.
When you receive a message, figure out a solution and provide a final answer. The message will be accompanied with contextual information. Use the contextual information to help you provide a solution.
Make sure to provide a thorough answer that directly addresses the message you received.
The context may contain extraneous information that does not apply to your instruction. If so, just extract whatever is useful or relevant and use it to complete your instruction.
When the context does not include enough information to complete the task, use your available tools to retrieve the specific information you need.
When you are using knowledge and web search tools to complete the instruction, answer the instruction only using the results from the search; do no supplement with your own knowledge.
Be persistent in finding the information you need before giving up.
If the task is able to be accomplished without using tools, then do not make any tool calls.
When you have accomplished the instruction posed to you, you will reply with the text:
If you are using knowledge and web search tools, make sure to provide the URL for the page you are using as your source or the document name.
Important: If you are unable to accomplish the task, whether it's because you could not retrieve sufficient data, or any other reason, reply only with
You have access to the following tools. Only use these available tools and do not attempt to use anything not listed - this will cause an error.
Respond in the format: <function_call> {"name": function name, "arguments": dictionary of argument name and its value}. Do not use variables.
Only call one tool at a time.
When suggesting tool calls, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.
"""
REFLECTION_ASSISTANT_PROMPT = """You are an assistant. Please tell me what is the next step that needs to be taken in a plan in order to accomplish a given task.
You will receive json in the following format, and will respond with a single line of instruction.
{
"Goal": The original query from the user. Every time you create a reply, it must be guided by the task of fulfilling this goal. Do not veer off course.,
"Plan": An array that enumerates every step of the plan,
"Previous Step": The step taken immediately prior to this message.
"Previous Output": The output generated by the last step taken.
"Steps Taken": A sequential array of steps that have already been executed prior to the last step,
}
Instructions:
1. If the very last step of the plan has already been executed, or the goal has already been achieved regardless of what step is next, then reply with the exact text:
2. Look at the "Previous Step". If the previous step was not successful and it is integral to achieving the goal, think of how it can be retried with better instructions. Inspect why the previous step was not successful, and modify the instruction to find another way to achieve the step's objective in a way that won't repeat the same error.
3. If the last previous was successful, determine what the next step will be. Always prefer to execute the next sequential step in the plan unless the previous step was unsuccessful and you need to re-run the previous step using a modified instruction.
4. When determining the next step, you may use the "Previous Step", "Previous Output", and "Steps Taken" to give you contextual information to decide what next step to take.
Be persistent and resourceful to make sure you reach the goal.
"""
CRITIC_PROMPT = """The previous instruction was {last_step} \nThe following is the output of that instruction.
if the output of the instruction completely satisfies the instruction, then reply with
For example, if the instruction is to list companies that use AI, then the output contains a list of companies that use AI.
If the output contains the phrase 'I'm sorry but...' then it is likely not fulfilling the instruction. \n
If the output of the instruction does not properly satisfy the instruction, then reply with
For example, if the instruction was to list companies that use AI but the output does not contain a list of companies, or states that a list of companies is not available, then the output did not properly satisfy the instruction.
If it does not satisfy the instruction, please think about what went wrong with the previous instruction and give me an explanation along with the text
Previous step output: \n {last_output}"""
SEARCH_TERM_ASSISTANT_PROMPT = """You are an expert at creating precise, complete, and accurate web search queries. When given a description of what a user is looking for, you will generate a fully formed, optimized search query that can be used directly in a search engine to find the most relevant information.
Key Requirements:
Stick to the Description: Use only the information explicitly provided in the description. Do not add, assume, or invent details that are not stated.
Be Complete and Concise: The search query must contain all necessary keywords and phrases required to fulfill the description without being unnecessarily verbose.
Avoid Vague or Placeholder Terms: Do not include incomplete terms (e.g., no placeholder variables or references to unspecified concepts).
Use Proper Context and Refinement: Include context, if applicable (e.g., location, date, format). Utilize search modifiers like quotes, "site:", "filetype:", or Boolean operators (AND, OR, NOT) to refine the query when appropriate.
Avoid Hallucination: Do not make up or fabricate any details that are not explicitly stated in the description.
Example Input:
"Find the latest research papers about AI-driven medical imaging published in 2023."
Example Output:
"latest research papers on AI-driven medical imaging 2023"
Another Example Input:
"Find a website that lists the top restaurants in Paris with outdoor seating."
Example Output:
"top restaurants in Paris with outdoor seating"
Incorrect Example Input:
"Find the population of Atlantis."
Incorrect Example Output:
"population of Atlantis 2023" (This is incorrect because the existence or details about Atlantis are not explicitly stated in the input and must not be assumed.)
Your Turn:
Generate a complete, accurate, and optimized search query based on the description provided below:
"""
Ollama Setup
If you are running inferencing locally and do not have Ollama setup
already, use the following commands to do so.
Pull the Ollama model
If you are using Ollama, you will need to pull the relevant LLM. In this
example we are using the Granite 3.1 dense model at 8b parameters. It is
capable enough to perform well in this flow but small enough to run on a
beefy personal machine.
Configuration Variables
Customize the following LLM configuration variables as needed
AG2 Agent Setup
Initializes the LLM config and all of the agents of the workflow.
The following function is registered as a tool for searching the
previously created vector database for relevant documents.
The following is a tool that can be registered for web search usage. For
a quick and easy demonstration, the implementation below uses the
googlesearch-python
library to perform the search, which is not
officially supported by Google. If you require the web search tool
beyond a PoC, you should use the official library and APIs of your
favorite search provider, or a multi-engine service such as Tavily.
Plan parser
When the initial plan is formed, the LLM should output the plan in JSON
format. The following parses the response from the planner and returns
the response as a dictionary.
The Workflow
Here is the meat of the agentic workflow. They key point here is that
extract maximum performance a set of agents that are called in sequence,
and the data passed between them is curated and trimmed, rather than
passing around the full chat history.
Run a query
Below is an example query. You can replace it with your own query based
upon the types of information you have loaded into the vector database.