Firecrawl Tool Example#
This notebook demonstrates how to use the FirecrawlTool to scrape, crawl, and map websites.
Setup#
First, make sure you have the required dependencies:
You’ll also need a Firecrawl API key from firecrawl.dev.
from autogen.tools.experimental.firecrawl import FirecrawlTool
# Set your Firecrawl API key
# Option 1: Set as environment variable
# os.environ["FIRECRAWL_API_KEY"] = "fc-your-api-key-here"
# Option 2: Pass directly to the tool
firecrawl_api_key = "fc-your-api-key-here" # Replace with your actual API key
# Initialize the tool
firecrawl_tool = FirecrawlTool(firecrawl_api_key=firecrawl_api_key)
Example 1: Scraping a Single URL#
The default functionality of the FirecrawlTool is to scrape a single URL.
# Scrape a single URL
results = firecrawl_tool(url="https://firecrawl.dev")
print(f"Number of results: {len(results)}")
if results:
result = results[0]
print(f"Title: {result['title']}")
print(f"URL: {result['url']}")
print(f"Content preview: {result['content'][:200]}...")
print(f"Metadata: {result['metadata']}")
Example 2: Scraping with Options#
You can customize the scraping process with various options.
# Scrape with custom options
results = firecrawl_tool(
url="https://firecrawl.dev",
formats=["markdown", "html"], # Get both markdown and HTML
include_tags=["h1", "h2", "p"], # Only include specific HTML tags
exclude_tags=["script", "style"], # Exclude these tags
wait_for=2000, # Wait 2 seconds for page to load
timeout=10000, # 10 second timeout
)
if results:
result = results[0]
print(f"Title: {result['title']}")
print(f"Content preview: {result['content'][:300]}...")
Example 3: Crawling a Website#
Use the crawl method to recursively crawl multiple pages of a website.
# Crawl a website
crawl_results = firecrawl_tool.crawl(
url="https://firecrawl.dev",
limit=3, # Crawl maximum 3 pages
formats=["markdown"],
max_depth=2, # Maximum depth of 2 levels
include_paths=["/docs/*"], # Only crawl documentation pages
)
print(f"Number of crawled pages: {len(crawl_results)}")
for i, page in enumerate(crawl_results):
print(f"\nPage {i + 1}:")
print(f" Title: {page['title']}")
print(f" URL: {page['url']}")
print(f" Content preview: {page['content'][:150]}...")
Example 4: Mapping a Website#
Use the map method to discover URLs from a website without scraping content.
# Map a website to get URLs
map_results = firecrawl_tool.map(
url="https://firecrawl.dev",
search="docs", # Search for URLs containing "docs"
include_subdomains=False,
limit=10, # Get maximum 10 URLs
)
print(f"Number of URLs found: {len(map_results)}")
for i, url_info in enumerate(map_results):
print(f" {i + 1}. {url_info['url']}")
Example 5: Using with AG2 Agents#
The FirecrawlTool can be easily integrated with AG2 agents.
from autogen import AssistantAgent, UserProxyAgent
# Create an assistant agent with the Firecrawl tool
assistant = AssistantAgent(
name="web_scraper",
system_message="You are a helpful web scraping assistant. Use the Firecrawl tool to scrape content from websites when asked.",
llm_config={
"model": "gpt-4o-mini",
"api_key": "your-openai-api-key", # Replace with your OpenAI API key
},
)
# Register the Firecrawl tool with the assistant
firecrawl_tool.register_for_llm(assistant)
# Create a user proxy agent
user = UserProxyAgent(
name="user",
human_input_mode="NEVER",
code_execution_config=False,
)
# Example conversation
response = user.initiate_chat(
assistant,
message="Please scrape the content from https://firecrawl.dev and summarize what Firecrawl is.",
max_turns=3,
)
print("Chat completed!")
Use Cases#
The FirecrawlTool is useful for:
- Content Extraction: Scrape clean, formatted content from web pages
- Website Discovery: Map websites to understand their structure
- Documentation Crawling: Crawl entire documentation sites
- Data Collection: Gather data from multiple pages automatically
- Research: Extract information from various web sources
Features#
- Multiple Formats: Get content in markdown, HTML, or other formats
- Flexible Filtering: Include/exclude specific HTML tags
- Path Control: Control which paths to crawl or exclude
- Rate Limiting: Built-in rate limiting and timeout controls
- JavaScript Support: Handles JavaScript-rendered pages
- Clean Output: Returns clean, structured content