Skip to content

Web Scraping using Oxylabs Web Scraper API#

Open In Colab Open on GitHub

This notebook shows how to use Oxylabs Web Scraper API with AutoGen agents to scrape data and generate automated reports.

First, you need to have Python installed on your system and access to an LLM provider. For this tutorial, we’ll use OpenAI’s API and the gpt-5-nano model. Start by creating a virtual environment:

! python -m venv venv
! source venv/bin/activate

Install the required dependencies:

! pip install aiohttp ag2[openai]

As an example, let’s use the Amazon source to search for items on Amazon based on a provided query using your credentials from the Oxylabs dashboard.

Import aiohttp components and define an AmazonScraper class with a constructor, Web Scraper API endpoint URL, and credentials:

from aiohttp import BasicAuth, ClientSession

class AmazonScraper:
    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")

NOTE: Don’t forget to replace placeholders with your credentials.

Define an asynchronous get_amazon_search_data method with a query parameter. Use type hints (str → list[dict]) for better readability.

async def get_amazon_search_data(self, query: str) -> list[dict]:
    """Gets search data for provided query from Amazon."""
    ...

Next, define your API payload together with the API call:

print(f"Fetching data for query: {query}")
payload = {
    "source": "amazon_search",
    "domain": "com",
    "query": query,
    "start_page": 1,
    "pages": 1,
    "parse": True,
}
session = ClientSession()

try:
    response = await session.post(
        self._base_url,
        auth=self._auth,
        json=payload,
    )
    response.raise_for_status()
    data = await response.json()
finally:
    await session.close()

The self._auth parameter provides Web Scraper API authentication. The finally clause ensures session cleanup regardless of outcome. You can parse the response slightly to make it easier for AI processing. Now insert the following part to return a list of dictionaries representing Amazon data.

results = data["results"][0]["content"]["results"]
return [*results.values()]

You now have a class for scraping Amazon results. This can be expanded for different sources or multiple pages. For what you have now, the full class should look like this:

from aiohttp import ClientSession

class AmazonScraper:
    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")

    async def get_amazon_search_data(self, query: str) -> list[dict]:
        """Gets search data for provided query from Amazon."""
        print(f"Fetching data for query: {query}")
        payload = {
            "source": "amazon_search",
            "domain": "com",
            "query": query,
            "start_page": 1,
            "pages": 1,
            "parse": True,
        }
        session = ClientSession()

        try:
            response = await session.post(
                self._base_url,
                auth=self._auth,
                json=payload,
            )
            response.raise_for_status()
            data = await response.json()
        finally:
            await session.close()

        results = data["results"][0]["content"]["results"]
        return [*results.values()]

Test it in main.py by initializing and calling the get_amazon_search_data method:

import asyncio
from pprint import pprint

from scraper import AmazonScraper

async def main():
    scraper = AmazonScraper()
    pprint(scraper.get_amazon_search_data("laptop"))

if __name__ == "__main__":
    asyncio.run(main())

Running python main.py shows a list of Amazon laptop results. Now, create the AmazonDataSummarizer class to implement AutoGen agents 1. Build AI agents to summarize the scraped data by defining the AmazonDataSummarizer class and constructor variables for later use. 2. Use dependency injection to link code parts in a structured way. 3. Import and define the OpenAI client for AutoGen to communicate with OpenAI models. Use your OpenAI API key here:

from autogen_ext.models.openai import OpenAIChatCompletionClient

class AmazonDataSummarizer:
    def __init__(self, scraper: AmazonScraper) -> None:
        self._client = OpenAIChatCompletionClient(
            model="gpt-5-nano",
            api_key="YOUR_API_KEY",
        )
        self._scraper = scraper

Define AI agent names using an Enum class for easier tracking:

from enum import Enum

class AgentName(str, Enum):
    """Enum for AI agent names."""

    PRICE_SUMMARIZER = "Price_Summarizer"
    DEAL_FINDER = "Deal_Finder"

This makes agent tracking easier. Import AssistantAgent and define the _initialize_agents method with agent configuration:

def _initialize_agents(self) -> list[AssistantAgent]:
    """Initializes the agents."""
    price_summarizer_agent = AssistantAgent(
        name=AgentName.PRICE_SUMMARIZER,
        model_client=self._client,
        reflect_on_tool_use=True,
        tools=[self._scraper.get_amazon_search_data],
        system_message="You are an expert in analyzing prices from online shopping data. Summarize the key price statistics, including average, min, max, and any interesting price patterns. Share your summary with the group",
    )

    deal_finder_agent = AssistantAgent(
        name=AgentName.DEAL_FINDER,
        model_client=self._client,
        tools=[self._scraper.get_amazon_search_data],
        reflect_on_tool_use=True,
        system_message="You are a skilled deal finder in online shopping data. Find the best possible deals based on price, availability, and general value. Share your findings with the group. Respond with 'SUMMARY_COMPLETE' when you've shared your findings.",
    )

    return [price_summarizer_agent, deal_finder_agent]

AssistantAgent needs: * Agent name * OpenAI client * Scraping tools * System message (clear instructions like ChatGPT prompts) The system message defines what each agent should do. SUMMARY_COMPLETE signals when to stop running agents, while the reflect_on_tool_use=True flag makes agents use function data as response context.

This integrates data sources with AutoGen agents. Define an async function in tools, and agents can use it. Prompts handle everything else.

Next, make agents work together using AutoGen teams. Teams enable agent collaboration on shared tasks. For this, use AutoGen teams for agent collaboration. RoundRobinGroupChat runs agents sequentially, so they take turns analyzing Amazon data and sharing findings.

Finally, the SUMMARY_COMPLETE termination condition tells AutoGen when agents finish and the team should stop. Here’s how it should look:

async def generate_summary(self, query: str) -> None:
    """Generates a summary using AI agents based on the given query"""
    agents = self._initialize_agents()

    text_termination = TextMentionTermination("SUMMARY_COMPLETE")
    team = RoundRobinGroupChat(
        participants=agents,
        termination_condition=text_termination,
    )

    task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
    messages = []

    async for message in team.run_stream(task=task):
        if isinstance(message, BaseChatMessage) and message.source in {
            AgentName.PRICE_SUMMARIZER,
            AgentName.DEAL_FINDER,
        }:
            messages.append(message.to_text())

Set up the team with agents and a termination condition. Pass the task to run_stream and collect agent messages. This produces a complete price and deal summary in Markdown format.

Save results to a Markdown file using this method:

def _write_to_md(self, messages: list[str]) -> None:
    """Writes the messages to a Markdown file."""
    with open("summary.md", "w") as f:
        for message in messages:
            f.write(f"{message}\n\n")

Call this at the end of generate_summary to save results. The complete method:

async def generate_summary(self, query: str) -> None:
    """Generates a summary using AI agents based on the given query"""
    agents = self._initialize_agents()

    text_termination = TextMentionTermination("SUMMARY_COMPLETE")
    team = RoundRobinGroupChat(
        participants=agents,
        termination_condition=text_termination,
    )

    task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
    messages = []

    async for message in team.run_stream(task=task):
        if isinstance(message, BaseChatMessage) and message.source in {
            AgentName.PRICE_SUMMARIZER,
            AgentName.DEAL_FINDER,
        }:
            messages.append(message.to_text())

    self._write_to_md(messages)

Now you have the complete tool for generating summaries using AutoGen agents with Web Scraper API data. Combine everything in the main file:

import asyncio

from scraper import AmazonScraper
from summary import AmazonDataSummarizer

async def main():
    scraper = AmazonScraper()
    summarizer = AmazonDataSummarizer(scraper=scraper)
    await summarizer.generate_summary(query="laptop")

if __name__ == "__main__":
    asyncio.run(main())

Running python main.py creates a summary.md file in your directory where you can view results in a Markdown-capable text editor.