Skip to content

Supercharging Web Crawling with Crawl4AI#

Open In Colab Open on GitHub

Installation#

To get started with the crawl4ai integration in AG2, follow these steps:

  1. Install AG2 with the crawl4ai extra:

    pip install -U ag2[openai,crawl4ai]
    

    Note: If you have been using autogen or pyautogen, all you need to do is upgrade it using:

    pip install -U autogen[openai,crawl4ai]
    

    or

    pip install -U pyautogen[openai,crawl4ai]
    

    as pyautogen, autogen, and ag2 are aliases for the same PyPI package.

  2. Set up Playwright:

    # Installs Playwright and browsers for all OS
    playwright install
    # Additional command, mandatory for Linux only
    playwright install-deps
    
  3. For running the code in Jupyter, use nest_asyncio to allow nested event loops. bash pip install nest_asyncio

You’re all set! Now you can start using browsing features in AG2.

Imports#

import os

import nest_asyncio
from pydantic import BaseModel

from autogen import AssistantAgent, UserProxyAgent
from autogen.tools.experimental import Crawl4AITool

nest_asyncio.apply()

LLM-Free Crawl4AI#

config_list = [{"api_type": "openai", "model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]

llm_config = {
    "config_list": config_list,
}

user_proxy = UserProxyAgent(name="user_proxy", human_input_mode="NEVER")
assistant = AssistantAgent(name="assistant", llm_config=llm_config)
crawlai_tool = Crawl4AITool()

crawlai_tool.register_for_execution(user_proxy)
crawlai_tool.register_for_llm(assistant)
result = user_proxy.initiate_chat(
    recipient=assistant,
    message="Get info from https://docs.ag2.ai/docs/Home",
    max_turns=2,
)

Crawl4AI with LLM#

Note: Crawl4AI is built on top of LiteLLM and supports the same models as LiteLLM.

We had great experience with OpenAI, Anthropic, Gemini and Ollama. However, as of this writing, DeepSeek is encountering some issues.

config_list = [{"api_type": "openai", "model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]

llm_config = {
    "config_list": config_list,
}

user_proxy = UserProxyAgent(name="user_proxy", human_input_mode="NEVER")
assistant = AssistantAgent(name="assistant", llm_config=llm_config)
# Set llm_config to Crawl4AITool
crawlai_tool = Crawl4AITool(llm_config=llm_config)

crawlai_tool.register_for_execution(user_proxy)
crawlai_tool.register_for_llm(assistant)
result = user_proxy.initiate_chat(
    recipient=assistant,
    message="Get info from https://docs.ag2.ai/docs/Home",
    max_turns=2,
)

Crawl4AI with LLM & Schema for Structured Data#

config_list = [{"api_type": "openai", "model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}]

llm_config = {
    "config_list": config_list,
}

user_proxy = UserProxyAgent(name="user_proxy", human_input_mode="NEVER")
assistant = AssistantAgent(name="assistant", llm_config=llm_config)
class Blog(BaseModel):
    title: str
    url: str

# Set llm_config and extraction_model to Crawl4AITool
crawlai_tool = Crawl4AITool(llm_config=llm_config, extraction_model=Blog)

crawlai_tool.register_for_execution(user_proxy)
crawlai_tool.register_for_llm(assistant)
message = "Extract all blog posts from https://docs.ag2.ai/blog"
result = user_proxy.initiate_chat(
    recipient=assistant,
    message=message,
    max_turns=2,
)