Skip to content

Firecrawl Tool Example#

Open In Colab Open on GitHub

This notebook demonstrates how to use the FirecrawlTool to scrape, crawl, and map websites.

Setup#

First, make sure you have the required dependencies:

pip install firecrawl-py

You’ll also need a Firecrawl API key from firecrawl.dev.

from autogen.tools.experimental.firecrawl import FirecrawlTool

# Set your Firecrawl API key
# Option 1: Set as environment variable
# os.environ["FIRECRAWL_API_KEY"] = "fc-your-api-key-here"

# Option 2: Pass directly to the tool
firecrawl_api_key = "fc-your-api-key-here"  # Replace with your actual API key

# Initialize the tool
firecrawl_tool = FirecrawlTool(firecrawl_api_key=firecrawl_api_key)

Example 1: Scraping a Single URL#

The default functionality of the FirecrawlTool is to scrape a single URL.

# Scrape a single URL
results = firecrawl_tool(url="https://firecrawl.dev")

print(f"Number of results: {len(results)}")
if results:
    result = results[0]
    print(f"Title: {result['title']}")
    print(f"URL: {result['url']}")
    print(f"Content preview: {result['content'][:200]}...")
    print(f"Metadata: {result['metadata']}")

Example 2: Scraping with Options#

You can customize the scraping process with various options.

# Scrape with custom options
results = firecrawl_tool(
    url="https://firecrawl.dev",
    formats=["markdown", "html"],  # Get both markdown and HTML
    include_tags=["h1", "h2", "p"],  # Only include specific HTML tags
    exclude_tags=["script", "style"],  # Exclude these tags
    wait_for=2000,  # Wait 2 seconds for page to load
    timeout=10000,  # 10 second timeout
)

if results:
    result = results[0]
    print(f"Title: {result['title']}")
    print(f"Content preview: {result['content'][:300]}...")

Example 3: Crawling a Website#

Use the crawl method to recursively crawl multiple pages of a website.

# Crawl a website
crawl_results = firecrawl_tool.crawl(
    url="https://firecrawl.dev",
    limit=3,  # Crawl maximum 3 pages
    formats=["markdown"],
    max_depth=2,  # Maximum depth of 2 levels
    include_paths=["/docs/*"],  # Only crawl documentation pages
)

print(f"Number of crawled pages: {len(crawl_results)}")
for i, page in enumerate(crawl_results):
    print(f"\nPage {i + 1}:")
    print(f"  Title: {page['title']}")
    print(f"  URL: {page['url']}")
    print(f"  Content preview: {page['content'][:150]}...")

Example 4: Mapping a Website#

Use the map method to discover URLs from a website without scraping content.

# Map a website to get URLs
map_results = firecrawl_tool.map(
    url="https://firecrawl.dev",
    search="docs",  # Search for URLs containing "docs"
    include_subdomains=False,
    limit=10,  # Get maximum 10 URLs
)

print(f"Number of URLs found: {len(map_results)}")
for i, url_info in enumerate(map_results):
    print(f"  {i + 1}. {url_info['url']}")

Example 5: Using with AG2 Agents#

The FirecrawlTool can be easily integrated with AG2 agents.

from autogen import AssistantAgent, UserProxyAgent

# Create an assistant agent with the Firecrawl tool
assistant = AssistantAgent(
    name="web_scraper",
    system_message="You are a helpful web scraping assistant. Use the Firecrawl tool to scrape content from websites when asked.",
    llm_config={
        "model": "gpt-4o-mini",
        "api_key": "your-openai-api-key",  # Replace with your OpenAI API key
    },
)

# Register the Firecrawl tool with the assistant
firecrawl_tool.register_for_llm(assistant)

# Create a user proxy agent
user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    code_execution_config=False,
)

# Example conversation
response = user.initiate_chat(
    assistant,
    message="Please scrape the content from https://firecrawl.dev and summarize what Firecrawl is.",
    max_turns=3,
)

print("Chat completed!")

Use Cases#

The FirecrawlTool is useful for:

  1. Content Extraction: Scrape clean, formatted content from web pages
  2. Website Discovery: Map websites to understand their structure
  3. Documentation Crawling: Crawl entire documentation sites
  4. Data Collection: Gather data from multiple pages automatically
  5. Research: Extract information from various web sources

Features#

  • Multiple Formats: Get content in markdown, HTML, or other formats
  • Flexible Filtering: Include/exclude specific HTML tags
  • Path Control: Control which paths to crawl or exclude
  • Rate Limiting: Built-in rate limiting and timeout controls
  • JavaScript Support: Handles JavaScript-rendered pages
  • Clean Output: Returns clean, structured content