Cerebras has developed the world’s largest and fastest AI processor, the Wafer-Scale Engine-3 (WSE-3). Notably, the CS-3 system can run large language models like Llama-3.1-8B and Llama-3.1-70B at extremely fast speeds, making it an ideal platform for demanding AI workloads.
While it’s technically possible to adapt AG2 to work with Cerebras’ API by updating the base_url
, this approach may not fully account for minor differences in parameter support. Using this library will also allow for tracking of the API costs based on actual token usage.
For more information about Cerebras Cloud, visit cloud.cerebras.ai. Their API reference is available at inference-docs.cerebras.ai.
Requirements
To use Cerebras with AG2, install the ag2[cerebras]
extra.
Getting Started
Cerebras provides a number of models to use. See the list of models here.
See the sample OAI_CONFIG_LIST
below showing how the Cerebras client class is used by specifying the api_type
as cerebras
.
Credentials
Get an API Key from cloud.cerebras.ai and add it to your environment variables:
API parameters
The following parameters can be added to your config for the Cerebras API. See this link for further information on them and their default values.
- max_tokens (null, integer >= 0)
- seed (number)
- stream (True or False)
- temperature (number 0..1.5)
- top_p (number)
Example:
Two-Agent Coding Example
In this example, we run a two-agent chat with an AssistantAgent (primarily a coding agent) to generate code to count the number of prime numbers between 1 and 10,000 and then it will be executed.
We’ll use Meta’s LLama-3.1-70B model which is suitable for coding.
Please execute this code. I will respond with “FINISH” after you provide the result.
Replying as User. Provide feedback to Cerebras Assistant. Press enter to skip and use auto-reply, or type ‘exit’ to end the conversation:
NO HUMAN INPUT RECEIVED.
## Tool Call Example
In this example, instead of writing code, we will show how Meta's Llama-3.1-70B model can perform parallel tool calling, where it recommends calling more than one tool at a time.
We'll use a simple travel agent assistant program where we have a couple of tools for weather and currency conversion.
We start by importing libraries and setting up our configuration to use Llama-3.1-70B and the `cerebras` client class.
```python
import json
import os
from typing import Literal
from typing_extensions import Annotated
import autogen
config_list = [
{
"model": "llama-3.3-70b",
"api_key": os.environ.get("CEREBRAS_API_KEY"),
"api_type": "cerebras",
}
]
# Create the agent for tool calling
chatbot = autogen.AssistantAgent(
name="chatbot",
system_message="""
For currency exchange and weather forecasting tasks,
only use the functions you have been provided with.
When you summarize, make sure you've considered ALL previous instructions.
Output 'HAVE FUN!' when an answer has been provided.
""",
llm_config={"config_list": config_list},
)
# Note that we have changed the termination string to be "HAVE FUN!"
user_proxy = autogen.UserProxyAgent(
name="user_proxy",
is_termination_msg=lambda x: x.get("content", "") and "HAVE FUN!" in x.get("content", ""),
human_input_mode="NEVER",
max_consecutive_auto_reply=1,
)
# Create the two functions, annotating them so that those descriptions can be passed through to the LLM.
# We associate them with the agents using `register_for_execution` for the user_proxy so it can execute the function and `register_for_llm` for the chatbot (powered by the LLM) so it can pass the function definitions to the LLM.
# Currency Exchange function
CurrencySymbol = Literal["USD", "EUR"]
# Define our function that we expect to call
def exchange_rate(base_currency: CurrencySymbol, quote_currency: CurrencySymbol) -> float:
if base_currency == quote_currency:
return 1.0
elif base_currency == "USD" and quote_currency == "EUR":
return 1 / 1.1
elif base_currency == "EUR" and quote_currency == "USD":
return 1.1
else:
raise ValueError(f"Unknown currencies {base_currency}, {quote_currency}")
# Register the function with the agent
@user_proxy.register_for_execution()
@chatbot.register_for_llm(description="Currency exchange calculator.")
def currency_calculator(
base_amount: Annotated[float, "Amount of currency in base_currency"],
base_currency: Annotated[CurrencySymbol, "Base currency"] = "USD",
quote_currency: Annotated[CurrencySymbol, "Quote currency"] = "EUR",
) -> str:
quote_amount = exchange_rate(base_currency, quote_currency) * base_amount
return f"{format(quote_amount, '.2f')} {quote_currency}"
# Weather function
# Example function to make available to model
def get_current_weather(location, unit="fahrenheit"):
"""Get the weather for some location"""
if "chicago" in location.lower():
return json.dumps({"location": "Chicago", "temperature": "13", "unit": unit})
elif "san francisco" in location.lower():
return json.dumps({"location": "San Francisco", "temperature": "55", "unit": unit})
elif "new york" in location.lower():
return json.dumps({"location": "New York", "temperature": "11", "unit": unit})
else:
return json.dumps({"location": location, "temperature": "unknown"})
# Register the function with the agent
@user_proxy.register_for_execution()
@chatbot.register_for_llm(description="Weather forecast for US cities.")
def weather_forecast(
location: Annotated[str, "City name"],
) -> str:
weather_details = get_current_weather(location=location)
weather = json.loads(weather_details)
return f"{weather['location']} will be {weather['temperature']} degrees {weather['unit']}"
import time
start_time = time.time()
# start the conversation
res = user_proxy.initiate_chat(
chatbot,
message="What's the weather in New York and can you tell me how much is 123.45 EUR in USD so I can spend it on my holiday? Throw a few holiday tips in as well.",
summary_method="reflection_with_llm",
)
end_time = time.time()
print(f"LLM SUMMARY: {res.summary['content']}\n\nDuration: {(end_time - start_time) * 1000}ms")
We can see that the Cerebras Wafer-Scale Engine-3 (WSE-3) completed the query in 74ms — faster than the blink of an eye!