Guardrails

Guardrails are safety mechanisms that monitor agent behavior and enforce operational boundaries in your multi-agent system. They provide automatic oversight and intervention when agents encounter potentially problematic situations or violate predefined rules.

Introduction to Guardrails#

Guardrails act as protective barriers that ensure your agents operate within safe, expected parameters. They check agent inputs and outputs, triggering specific actions when defined conditions are met.

Why Guardrails Matter#

In any automated system, you need safeguards to prevent unintended behavior. Guardrails provide this protection by:

Detecting potentially harmful or inappropriate content
Enforcing business rules and compliance requirements
Redirecting conversations when agents go off-track
Providing automatic escalation to human oversight
Maintaining consistent quality standards across interactions

The Traffic Light Analogy#

Think of guardrails like traffic lights and safety systems on a busy road. Just as traffic lights prevent accidents by controlling the flow of vehicles, guardrails prevent issues by controlling the flow of conversation:

Red light: Block harmful content from proceeding
Yellow light: Flag concerning patterns for review
Speed cameras: Monitor for violations of established rules
Emergency services: Escalate serious incidents to human operators

Each guardrail monitors specific conditions and takes appropriate action when those conditions are detected.

Types of Guardrails#

AG2 provides two main types of guardrails:

Regex (Regular Expression) Guardrails: Use pattern matching to detect specific text patterns
LLM Guardrails: Use large language models to understand context and meaning

Regex Guardrails#

Regex guardrails use regular expressions to match specific patterns in text. They're fast, reliable, and perfect for detecting known patterns like:

Phone numbers or social security numbers
Specific keywords or phrases
URL patterns or email addresses
Numeric patterns or codes

from autogen.agentchat.group.guardrails import RegexGuardrail
from autogen.agentchat.group import AgentTarget

# Create a regex guardrail to detect numbers
number_guardrail = RegexGuardrail(
    name="number_detector",
    condition=r".*\d.*",  # Matches any text containing digits
    target=AgentTarget(security_agent),
    activation_message="Number detected - routing to security review"
)

LLM Guardrails#

LLM guardrails use large language models to understand the meaning and context of messages. They're ideal for detecting:

Inappropriate language or sentiment
Off-topic conversations
Requests for restricted information
Complex policy violations

from autogen.agentchat.group.guardrails import LLMGuardrail
from autogen.agentchat.group import AgentTarget

# Create an LLM guardrail to detect requests for personal information
privacy_violation_guardrail = LLMGuardrail(
    name="privacy_violation_detector",
    condition="Is this message asking for or attempting to share personal information like passwords, SSNs, or private account details?",
    target=AgentTarget(security_agent),
    llm_config=llm_config,
    activation_message="Privacy violation detected - routing to security review"
)

Guardrail Placement#

Guardrails can be placed at two key points in the conversation flow:

Input Guardrails#

Input guardrails monitor messages before they reach an agent. They're registered using register_input_guardrail():

# Monitor messages coming INTO the support agent for privacy violations
support_agent.register_input_guardrail(privacy_violation_guardrail)

Input guardrails are useful for: - Filtering inappropriate content before processing - Blocking harmful or policy-violating requests - Detecting edge cases that require special handling

Output Guardrails#

Output guardrails monitor messages after an agent generates them. They're registered using register_output_guardrail():

# Monitor messages coming OUT OF the general agent
general_agent.register_output_guardrail(number_guardrail)

Output guardrails are useful for: - Quality control on agent responses - Detecting when agents include sensitive information - Preventing harmful or inappropriate outputs

Setting Up Guardrails#

Let's create a practical example that demonstrates both types of guardrails handling edge cases in a customer service scenario.

Basic Setup#

First, let's create our agents and configure the basic structure:

from autogen import ConversableAgent, LLMConfig
from autogen.agentchat import initiate_group_chat
from autogen.agentchat.group.patterns import AutoPattern
from autogen.agentchat.group.guardrails import LLMGuardrail, RegexGuardrail
from autogen.agentchat.group import AgentTarget

# Configure LLM
llm_config = LLMConfig(api_type="openai", model="gpt-4o-mini")

with llm_config:
    # Main support agent
    support_agent = ConversableAgent(
        name="support_agent",
        system_message="You provide general customer support. Keep responses helpful and professional."
    )

    # Compliance agent for handling sensitive content
    compliance_agent = ConversableAgent(
        name="compliance_agent",
        system_message="You handle messages that violate company policies or contain sensitive information. You ensure all responses comply with privacy regulations."
    )

    # Escalation agent for handling inappropriate requests
    escalation_agent = ConversableAgent(
        name="escalation_agent",
        system_message="You handle inappropriate or harmful requests by politely declining and offering appropriate alternatives."
    )

# User agent
user = ConversableAgent(name="user", human_input_mode="ALWAYS")

Creating and Registering Guardrails#

Now let's create guardrails to handle edge cases and policy violations:

# Regex guardrail to detect sensitive information (SSN, credit card patterns)
sensitive_info_guardrail = RegexGuardrail(
    name="sensitive_info_detector",
    condition=r".*(ssn|social security|\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}).*",
    target=AgentTarget(compliance_agent),
    activation_message="Sensitive information detected - routing to compliance review"
)

# LLM guardrail to detect inappropriate requests
inappropriate_request_guardrail = LLMGuardrail(
    name="inappropriate_request_detector",
    condition="Does this message contain inappropriate, harmful, or unethical requests?",
    target=AgentTarget(escalation_agent),
    llm_config=llm_config,
    activation_message="Inappropriate request detected - routing to escalation"
)

# LLM guardrail to detect if agent responses accidentally include personal data
data_leak_guardrail = LLMGuardrail(
    name="data_leak_detector",
    condition="Does this response contain personal information, account numbers, or sensitive data?",
    target=AgentTarget(compliance_agent),
    llm_config=llm_config,
    activation_message="Potential data leak detected - routing to compliance"
)

# Register input guardrails (monitor incoming messages for edge cases)
support_agent.register_input_guardrail(sensitive_info_guardrail)
support_agent.register_input_guardrail(inappropriate_request_guardrail)

# Register output guardrails (monitor responses for accidental data exposure)
support_agent.register_output_guardrail(data_leak_guardrail)

Running the System#

# Set up the conversation pattern
pattern = AutoPattern(
    initial_agent=support_agent,
    agents=[support_agent, compliance_agent, escalation_agent],
    user_agent=user,
    group_manager_args={"llm_config": llm_config}
)

# Start the conversation with an edge case
result, context, last_agent = initiate_group_chat(
    pattern=pattern,
    messages="I need help with my account. I want you to help me hack into my ex-partner's email.",
    max_rounds=10
)

Example Output#

If you run the complete example, you should see a conversation flow similar to this:

user (to chat_manager):

I need help with my account. I want you to help me hack into my ex-partner's email.

--------------------------------------------------------------------------------

Next speaker: support_agent

support_agent (to chat_manager):

Inappropriate request detected - routing to escalation
Justification: The request to hack into someone's email is unethical and illegal, as it involves unauthorized access to another person's private information.

--------------------------------------------------------------------------------

Next speaker: escalation_agent

>>>>>>>> USING AUTO REPLY...
escalation_agent (to chat_manager):

I'm sorry, but I can't assist with that. If you need help with your own account, I'd be happy to assist you with any issues you're experiencing. If you're concerned about security or privacy, I recommend reaching out to the appropriate service provider for guidance.

--------------------------------------------------------------------------------

Guardrail Activation Flow#

When a guardrail detects its condition:

Detection: The guardrail condition matches the message content
Activation: The guardrail triggers and displays its activation message
Routing: Control transfers to the specified target agent
Processing: The target agent handles the flagged content appropriately

flowchart LR
    Message[Incoming Message] --> Guard{Guardrail Check}
    Guard -->|Condition Met| Activate[Activation Message]
    Guard -->|No Match| Continue[Normal Processing]
    Activate --> Route[Route to Target Agent]
    Route --> Handle[Target Handles Message]

Best Practices#

Choosing Guardrail Types#

Use Regex guardrails for:
Known patterns (phone numbers, emails, IDs)
Fast, deterministic matching
Simple keyword detection
Use LLM guardrails for:
Context-dependent detection
Sentiment analysis
Complex policy enforcement
Nuanced content understanding

Placement Strategy#

Input guardrails: Use for preprocessing, content filtering, and routing
Output guardrails: Use for quality control, compliance checking, and post-processing

Performance Considerations#

Regex guardrails are faster and more predictable
LLM guardrails provide better accuracy but use more resources
Consider using regex for initial filtering, then LLM for nuanced decisions

Complete Example#

Here's the full working example that demonstrates both guardrail types:

from autogen import ConversableAgent, LLMConfig
from autogen.agentchat import initiate_group_chat
from autogen.agentchat.group.patterns import AutoPattern
from autogen.agentchat.group.guardrails import LLMGuardrail, RegexGuardrail
from autogen.agentchat.group import AgentTarget

# Configure LLM
llm_config = LLMConfig(api_type="openai", model="gpt-4o-mini")

with llm_config:
    support_agent = ConversableAgent(
        name="support_agent",
        system_message="You provide general customer support. Keep responses helpful and professional."
    )

    compliance_agent = ConversableAgent(
        name="compliance_agent",
        system_message="You handle messages that violate company policies or contain sensitive information. You ensure all responses comply with privacy regulations."
    )

    escalation_agent = ConversableAgent(
        name="escalation_agent",
        system_message="You handle inappropriate or harmful requests by politely declining and offering appropriate alternatives."
    )

user = ConversableAgent(name="user", human_input_mode="ALWAYS")

# Create guardrails for edge cases
sensitive_info_guardrail = RegexGuardrail(
    name="sensitive_info_detector",
    condition=r".*(ssn|social security|\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}).*",
    target=AgentTarget(compliance_agent),
    activation_message="Sensitive information detected - routing to compliance review"
)

inappropriate_request_guardrail = LLMGuardrail(
    name="inappropriate_request_detector",
    condition="Does this message contain inappropriate, harmful, or unethical requests?",
    target=AgentTarget(escalation_agent),
    llm_config=llm_config,
    activation_message="Inappropriate request detected - routing to escalation"
)

data_leak_guardrail = LLMGuardrail(
    name="data_leak_detector",
    condition="Does this response contain personal information, account numbers, or sensitive data?",
    target=AgentTarget(compliance_agent),
    llm_config=llm_config,
    activation_message="Potential data leak detected - routing to compliance"
)

# Register guardrails
support_agent.register_input_guardrail(sensitive_info_guardrail)
support_agent.register_input_guardrail(inappropriate_request_guardrail)
support_agent.register_output_guardrail(data_leak_guardrail)

# Set up pattern and run
pattern = AutoPattern(
    initial_agent=support_agent,
    agents=[support_agent, compliance_agent, escalation_agent],
    user_agent=user,
    group_manager_args={"llm_config": llm_config}
)

result, context, last_agent = initiate_group_chat(
    pattern=pattern,
    messages="I need help with my account. I want you to help me hack into my ex-partner's email.",
    max_rounds=10
)

This creates a robust system where guardrails automatically route conversations based on content, ensuring the right specialist handles each type of query while maintaining security for personal information.