Firecrawl Tool#

The FirecrawlTool provides powerful web scraping and crawling capabilities using the Firecrawl API. It can scrape single pages, crawl entire websites, and map website structures to discover URLs.

Features#

Single Page Scraping: Extract clean content from individual web pages
Website Crawling: Recursively crawl multiple pages from a website
URL Mapping: Discover URLs from a website without scraping content
Multiple Formats: Get content in markdown, HTML, or other formats
Advanced Filtering: Include/exclude specific HTML tags and paths
JavaScript Support: Handles JavaScript-rendered pages
Rate Limiting: Built-in timeout and rate limiting controls

Installation#

Install the required dependency:

pip install firecrawl-py

Setup#

Get an API key from firecrawl.dev
Set it as an environment variable or pass it directly to the tool:

import os
from autogen.tools.experimental.firecrawl import FirecrawlTool

# Option 1: Environment variable
os.environ["FIRECRAWL_API_KEY"] = "fc-your-api-key-here"
firecrawl_tool = FirecrawlTool()

# Option 2: Direct parameter
firecrawl_tool = FirecrawlTool(firecrawl_api_key="fc-your-api-key-here")

Usage#

Basic Scraping#

The default functionality scrapes a single URL:

# Scrape a single page
results = firecrawl_tool(url="https://example.com")

print(f"Title: {results[0]['title']}")
print(f"Content: {results[0]['content']}")
print(f"Metadata: {results[0]['metadata']}")

Advanced Scraping Options#

Customize the scraping process:

results = firecrawl_tool(
    url="https://example.com",
    formats=["markdown", "html"],  # Output formats
    include_tags=["h1", "h2", "p"],  # Only these HTML tags
    exclude_tags=["script", "style"],  # Exclude these tags
    headers={"User-Agent": "Custom Agent"},  # Custom headers
    wait_for=2000,  # Wait 2 seconds for page load
    timeout=10000   # 10 second timeout
)

Website Crawling#

Crawl multiple pages from a website:

crawl_results = firecrawl_tool.crawl(
    url="https://example.com",
    limit=5,  # Maximum pages to crawl
    formats=["markdown"],
    max_depth=2,  # Maximum crawl depth
    include_paths=["/docs/*"],  # Only crawl documentation
    exclude_paths=["/admin/*"],  # Exclude admin pages
    allow_backward_crawling=False,
    allow_external_content_links=False
)

for page in crawl_results:
    print(f"Title: {page['title']}")
    print(f"URL: {page['url']}")
    print(f"Content: {page['content'][:200]}...")

URL Mapping#

Discover URLs from a website:

map_results = firecrawl_tool.map(
    url="https://example.com",
    search="docs",  # Filter URLs containing "docs"
    include_subdomains=False,
    ignore_sitemap=False,
    limit=100  # Maximum URLs to return
)

for url_info in map_results:
    print(f"Found URL: {url_info['url']}")

AG2 Agent Integration#

Use the FirecrawlTool with AG2 agents:

from autogen import AssistantAgent

# Create assistant with Firecrawl capabilities
assistant = AssistantAgent(
    name="web_scraper",
    system_message="You are a web scraping assistant. Use Firecrawl to extract content from websites.",
    llm_config={"model": "gpt-4o-mini"},
)

# Register the tool
firecrawl_tool.register_for_llm(assistant)

# The assistant can now use Firecrawl in conversations
response = assistant.run(
    message="Please scrape the content from https://example.com and summarize it",
    tools=assistant.tools
)

Tool Methods#

The FirecrawlTool provides three main methods:

`scrape()` (Default)#

Scrapes a single URL and returns structured content.

Parameters: - url (str): The URL to scrape - formats (list[str], optional): Output formats (default: ["markdown"]) - include_tags (list[str], optional): HTML tags to include - exclude_tags (list[str], optional): HTML tags to exclude - headers (dict[str, str], optional): HTTP headers - wait_for (int, optional): Page load wait time in milliseconds - timeout (int, optional): Request timeout in milliseconds

`crawl()`#

Recursively crawls a website starting from a URL.

Parameters: - url (str): Starting URL to crawl - limit (int): Maximum pages to crawl (default: 5) - formats (list[str], optional): Output formats - include_paths (list[str], optional): URL patterns to include - exclude_paths (list[str], optional): URL patterns to exclude - max_depth (int, optional): Maximum crawl depth - allow_backward_crawling (bool): Allow crawling backward links - allow_external_content_links (bool): Allow external links

`map()`#

Discovers URLs from a website without scraping content.

Parameters: - url (str): Website URL to map - search (str, optional): Filter URLs by search term - ignore_sitemap (bool): Whether to ignore sitemap - include_subdomains (bool): Include subdomain URLs - limit (int): Maximum URLs to return (default: 5000)

Response Format#

All methods return a list of dictionaries with the following structure:

{
    "title": "Page Title",
    "url": "https://example.com/page",
    "content": "Page content in requested format",
    "metadata": {
        "title": "Page Title",
        "sourceURL": "https://example.com/page",
        # Additional metadata from Firecrawl
    }
}

Use Cases#

Documentation Extraction: Scrape and process documentation sites
Content Research: Gather information from multiple web sources
Website Analysis: Map and analyze website structures
Data Collection: Automated content extraction for analysis
Knowledge Base Creation: Build knowledge bases from web content

Error Handling#

The tool includes proper error handling and logging:

import logging

# Enable logging to see detailed error information
logging.basicConfig(level=logging.INFO)

# Failed operations return empty lists
results = firecrawl_tool(url="https://invalid-url.com")
if not results:
    print("Scraping failed - check logs for details")

Example Notebook#

For a complete working example, see the Firecrawl Tool notebook.