| layout | default | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| title | Chapter 4: User Mode: Deep Research System | ||||||||
| nav_order | 4 | ||||||||
| parent | AutoAgent Tutorial | ||||||||
| format_version | v2 | ||||||||
| why | User Mode is the most frequently used AutoAgent capability. Understanding how SystemTriageAgent routes between WebSurferAgent, FileSurferAgent, and ProgrammingAgent — and how case_resolved signals termination — lets you write better prompts and diagnose when the system gets stuck in routing loops. | ||||||||
| mental_model | SystemTriageAgent is a dispatcher: it analyzes your request, transfers control to the right specialist, waits for the specialist to return, then decides whether the task is done or needs another specialist. The transfer_to_X() functions are the routing mechanism. | ||||||||
| learning_outcomes |
|
||||||||
| snapshot |
|
||||||||
| chapter_map |
|
||||||||
| sources |
General-purpose research tasks don't fit neatly into a single tool. A question like "What is the latest Python performance benchmark and how does it compare to the 2023 results?" requires:
- Web search and browsing to find current benchmarks
- Document reading to parse PDFs or DOCX files
- Code execution to run statistical comparisons
- File writing to save the final report
A single agent trying to do all of this becomes confused about which tool to use when. AutoAgent solves this with a triage + specialist architecture: SystemTriageAgent handles routing, and four specialist agents each do one thing well.
flowchart TD
U[User Request] --> STA[SystemTriageAgent\nOrchestrator]
STA -->|web browsing / search| WSA[WebSurferAgent\nPlaywright + Screenshots]
STA -->|file reading / parsing| FSA[FileSurferAgent\nMarkdownBrowser]
STA -->|code / computation| PA[ProgrammingAgent\nDockerEnv]
WSA -->|transfer_back| STA
FSA -->|transfer_back| STA
PA -->|transfer_back| STA
STA -->|task complete| CR[case_resolved]
STA -->|task failed| CNR[case_not_resolved]
CR --> End([Return Response])
CNR --> End
Each specialist agent has access only to the tools it needs. This keeps tool schemas small and reduces LLM confusion about which tool to call.
SystemTriageAgent is the entry point for all User Mode interactions. It:
- Analyzes the user's request
- Decides which specialist(s) are needed
- Transfers control via
transfer_to_X()functions - Receives results back and synthesizes a final answer
- Calls
case_resolvedwhen the task is complete
The transfer functions are injected into SystemTriageAgent's function list at initialization:
# autoagent/system_triage_agent.py
from autoagent.types import Agent, Result
def transfer_to_websurfer(context_variables: dict) -> Result:
"""Transfer to WebSurferAgent for web browsing and search tasks.
Use when: the task requires browsing websites, searching the web,
or extracting information from online sources.
"""
return Result(
value="Transferring to WebSurferAgent for web research",
agent=websurfer_agent,
)
def transfer_to_filesurfer(context_variables: dict) -> Result:
"""Transfer to FileSurferAgent for reading local files and documents.
Use when: the task involves reading PDFs, DOCX, or other local files.
"""
return Result(
value="Transferring to FileSurferAgent for document reading",
agent=filesurfer_agent,
)
def transfer_to_programming(context_variables: dict) -> Result:
"""Transfer to ProgrammingAgent for code execution and data analysis.
Use when: the task requires writing or running Python code.
"""
return Result(
value="Transferring to ProgrammingAgent",
agent=programming_agent,
)
def case_resolved(context_variables: dict, summary: str) -> Result:
"""Signal that the task has been successfully completed."""
return Result(value=f"CASE_RESOLVED: {summary}")
def case_not_resolved(context_variables: dict, reason: str) -> Result:
"""Signal that the task could not be completed."""
return Result(value=f"CASE_NOT_RESOLVED: {reason}")
system_triage_agent = Agent(
name="SystemTriageAgent",
model="gpt-4o",
instructions="""You are a research coordinator. Analyze user requests and
route them to the appropriate specialist. After the specialist completes
their work, synthesize the results and call case_resolved with a summary.
Always route to specialists rather than attempting the task directly.
""",
functions=[
transfer_to_websurfer,
transfer_to_filesurfer,
transfer_to_programming,
case_resolved,
case_not_resolved,
],
)When SystemTriageAgent calls transfer_to_websurfer(), the MetaChain run loop detects the Result.agent field and switches the active agent:
sequenceDiagram
participant MC as MetaChain
participant STA as SystemTriageAgent
participant WSA as WebSurferAgent
MC->>STA: LLM call: "Research Python benchmarks"
STA-->>MC: tool_call: transfer_to_websurfer()
MC->>MC: handle_tool_calls() → Result(agent=websurfer_agent)
MC->>MC: active_agent = websurfer_agent
MC->>WSA: LLM call with same conversation history
WSA-->>MC: tool_calls: browse_web(), scroll_page(), etc.
MC->>MC: Execute browser tools, append results
WSA-->>MC: tool_call: transfer_back_to_triage()
MC->>MC: active_agent = system_triage_agent
MC->>STA: LLM call with accumulated results
STA-->>MC: tool_call: case_resolved(summary=...)
MC->>MC: Terminate loop
The key insight: the conversation history persists across handoffs. When control returns to SystemTriageAgent, it sees all the messages from WebSurferAgent's work and can synthesize them.
WebSurferAgent controls the BrowserEnv (Playwright) to navigate websites and extract information:
# autoagent/websurfer_agent.py (tool functions)
def browse_web(url: str, context_variables: dict) -> str:
"""Navigate to a URL and return page content with screenshot reference."""
web_env: BrowserEnv = context_variables["web_env"]
obs = web_env.navigate(url)
return f"URL: {obs.url}\n\nContent:\n{obs.content[:4000]}"
def search_web(query: str, context_variables: dict) -> str:
"""Search the web using the browser."""
web_env: BrowserEnv = context_variables["web_env"]
search_url = f"https://www.google.com/search?q={quote(query)}"
obs = web_env.navigate(search_url)
return obs.content[:4000]
def scroll_down(context_variables: dict) -> str:
"""Scroll down on the current page."""
web_env: BrowserEnv = context_variables["web_env"]
web_env.page.keyboard.press("PageDown")
obs = web_env._get_observation()
return obs.content[:2000]
def click_element(selector: str, context_variables: dict) -> str:
"""Click an element on the current page."""
web_env: BrowserEnv = context_variables["web_env"]
obs = web_env.click(selector)
return obs.content[:2000]
def transfer_back_to_triage(context_variables: dict, summary: str) -> Result:
"""Return to SystemTriageAgent with research results."""
return Result(
value=f"WebSurfer completed: {summary}",
agent=system_triage_agent,
)For visual navigation tasks, WebSurferAgent uses GPT-4V-style message construction:
# autoagent/websurfer_agent.py
def get_visual_observation(context_variables: dict) -> list[dict]:
"""Return current page screenshot as a multimodal message part."""
web_env: BrowserEnv = context_variables["web_env"]
obs = web_env._get_observation()
# Encode screenshot as base64 for vision models
screenshot_b64 = base64.b64encode(obs.screenshot).decode()
return [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{screenshot_b64}",
"detail": "high",
}
},
{
"type": "text",
"text": f"Current URL: {obs.url}\n\nPage content summary:\n{obs.content[:1000]}"
}
]This allows WebSurferAgent to navigate pages that require visual understanding (CAPTCHA-free sites, pages with complex layouts, image-heavy content).
FileSurferAgent uses RequestsMarkdownBrowser for document reading and file operations:
# autoagent/filesurfer_agent.py (tool functions)
def read_file(file_path: str, context_variables: dict) -> str:
"""Read a file from the workspace, converting to Markdown."""
file_env: RequestsMarkdownBrowser = context_variables["file_env"]
return file_env.visit_page(file_path)
def page_down_file(context_variables: dict) -> str:
"""Scroll to the next page of the current document."""
file_env: RequestsMarkdownBrowser = context_variables["file_env"]
return file_env.page_down()
def list_workspace_files(context_variables: dict) -> str:
"""List all files in the workspace directory."""
workspace = context_variables.get("workspace", "./workspace")
files = []
for path in Path(workspace).rglob("*"):
if path.is_file():
files.append(str(path.relative_to(workspace)))
return "\n".join(files)
def write_file(file_path: str, content: str, context_variables: dict) -> str:
"""Write content to a file in the workspace."""
workspace = context_variables.get("workspace", "./workspace")
full_path = Path(workspace) / file_path
full_path.parent.mkdir(parents=True, exist_ok=True)
full_path.write_text(content)
return f"Written to {full_path}"Users can upload files for analysis via the workspace directory:
# Copy a file into the workspace before starting the session
cp my_research_paper.pdf workspace/
# Then in AutoAgent:
# AutoAgent> Summarize the PDF in workspace/my_research_paper.pdfFileSurferAgent uses RequestsMarkdownBrowser._convert_pdf() to extract text and then processes it page by page within the LLM's context window.
ProgrammingAgent writes and executes Python code in the Docker sandbox:
# autoagent/programming_agent.py (tool functions)
def execute_python(code: str, context_variables: dict) -> str:
"""Execute Python code in the Docker sandbox.
The sandbox maintains state between calls — variables and imports
persist within a session.
"""
code_env: DockerEnv = context_variables["code_env"]
stdout, stderr, result = code_env.execute_code(code)
output = ""
if stdout:
output += f"STDOUT:\n{stdout}"
if stderr:
output += f"\nSTDERR:\n{stderr}"
if result:
output += f"\nRESULT: {result}"
return output or "Code executed successfully (no output)"
def install_package(package: str, context_variables: dict) -> str:
"""Install a Python package in the Docker sandbox."""
code_env: DockerEnv = context_variables["code_env"]
install_code = f"import subprocess; subprocess.run(['pip', 'install', '{package}'], capture_output=True)"
stdout, stderr, _ = code_env.execute_code(install_code)
return f"Installed {package}"
def list_workspace_contents(context_variables: dict) -> str:
"""List files in the mounted workspace directory."""
code_env: DockerEnv = context_variables["code_env"]
stdout, _, _ = code_env.execute_code("import os; print(os.listdir('/workspace'))")
return stdoutWhen code fails, ProgrammingAgent retries with error context in the conversation history:
[ProgrammingAgent] Writing code to parse CSV...
[Tool: execute_python]
STDERR: ImportError: No module named 'pandas'
[ProgrammingAgent] Need to install pandas first
[Tool: install_package] package=pandas
[Tool: execute_python] (retry with same code)
STDOUT: Parsed 1000 rows successfully
This happens naturally through the conversation history — no special retry logic is needed in the agent code itself.
The @AgentName syntax in inner.py allows bypassing SystemTriageAgent:
# autoagent/inner.py (simplified)
def parse_user_input(message: str, registered_agents: dict) -> tuple[Agent, str]:
"""Check if the message starts with @AgentName and route directly."""
if message.startswith("@"):
parts = message.split(" ", 1)
agent_name = parts[0][1:] # Strip the @
actual_message = parts[1] if len(parts) > 1 else ""
if agent_name in registered_agents:
return registered_agents[agent_name], actual_message
# Default: route through SystemTriageAgent
return system_triage_agent, messageExamples:
# Route directly to WebSurferAgent
AutoAgent> @WebSurferAgent find the latest PyPI release of litellm
# Route directly to ProgrammingAgent
AutoAgent> @ProgrammingAgent run this code: import sys; print(sys.version)
# Route directly to a custom registered agent
AutoAgent> @SalesAgent recommend a product for a $50 budget in electronics
The academic paper (arxiv:2502.05957) evaluates AutoAgent on the GAIA benchmark, which tests general AI assistants on real-world tasks requiring multi-step reasoning across web, file, and code capabilities:
| GAIA Level | Task Type | AutoAgent Performance |
|---|---|---|
| Level 1 | Simple factual lookups | ~85% |
| Level 2 | Multi-step reasoning with tools | ~67% |
| Level 3 | Complex multi-source synthesis | ~40% |
GAIA Level 1 tasks are single-step (e.g., "What is the capital of France?"). Level 3 tasks require chaining 5-10 tool calls across multiple sources with complex reasoning.
The benchmark is run via evaluation/gaia/run_infer.py — Chapter 8 covers the evaluation infrastructure in detail.
| Component | File | Role |
|---|---|---|
SystemTriageAgent |
system_triage_agent.py |
Orchestrator: routes to specialists, synthesizes results |
WebSurferAgent |
websurfer_agent.py |
Web browsing via Playwright + multimodal screenshots |
FileSurferAgent |
filesurfer_agent.py |
Document reading via MarkdownBrowser + file writing |
ProgrammingAgent |
programming_agent.py |
Python code execution via DockerEnv |
transfer_to_X() |
All agent files | Agent handoff via Result(agent=next_agent) |
case_resolved |
system_triage_agent.py |
Task completion signal |
case_not_resolved |
system_triage_agent.py |
Task failure signal |
@mention routing |
inner.py |
Bypass triage, route directly to named agent |
| GAIA benchmark | evaluation/gaia/ |
Multi-level task evaluation (Levels 1-3) |
Continue to Chapter 5: Agent Editor: From NL to Deployed Agents to learn how the 4-phase pipeline generates, tests, and registers new agents from natural language descriptions.