Name	Name	Last commit message	Last commit date
parent directory ..
tests	tests
README.md	README.md
agent.py	agent.py
pyproject.toml	pyproject.toml

LangGraph Multi-Agent Supervisor Example

A multi-agent supervisor workflow built with LangGraph and langgraph-supervisor. A central supervisor coordinates a research expert and a math expert to answer a compound question. Tested with agentverify.

The example mirrors the official langgraph-supervisor quickstart. Given the question "What's the combined headcount of the FAANG companies in 2024?" the supervisor delegates: the research expert fetches the per-company numbers via a web_search tool, then the math expert adds them up with an add tool. The cassette captures ~12 observable steps, which agentverify asserts against step-by-step.

Why this example?

This is the one place in agentverify where the step-level + data-flow testing story shines the brightest:

Multi-agent handoff: the supervisor's routing decisions are distinct steps, separate from the sub-agents' tool work.
Parallel tool calls: the research expert issues multiple web_search calls in a single step; assert_step with expected_tools + OrderMode.ANY_ORDER handles this cleanly.
Cross-step numeric data flow: each add call's running total feeds the next add's input. assert_step_uses_result_from verifies the chain didn't drop a value.

Prerequisites

Python 3.10+
OpenAI API key (only for re-recording cassettes; not needed for replay)

Setup

See the main README for setup instructions (git clone, venv, pip install).

Running the Agent

cd examples/langgraph-multi-agent-supervisor

export OPENAI_API_KEY=sk-your_key_here
python agent.py

The agent prints each message in the conversation. You'll see the supervisor route to research_expert, the researcher call web_search and summarise, the supervisor hand off to math_expert, and finally the math expert chain several add calls into the final answer.

Running Tests

Cassette Replay (No API Key Required)

Tests ship with a pre-recorded cassette under tests/cassettes/. Just run:

pytest

The cassette replays deterministically. No OpenAI API calls, zero cost.

What the Tests Verify

Test Class	Test	agentverify Assertion	Description
`TestSupervisorFlat`	`test_tool_sequence_in_order`	`assert_tool_calls()`	End-to-end sequence: `transfer_to_research_expert` → `web_search` → `transfer_to_math_expert` → `add` (IN_ORDER)
`TestSupervisorFlat`	`test_safety_no_dangerous_tools`	`assert_no_tool_call()`	The supervisor never hands off to non-existent agents and never runs destructive tools
`TestSupervisorFlat`	`test_budget_and_output`	`assert_cost()` + `assert_final_output()` via `assert_all`	Token budget ≤ 20k and final answer contains a numeric total
`TestSupervisorStepLevel`	`test_supervisor_routes_to_research_first`	`assert_step()`	Step 0 is the supervisor handing off to the research expert, with no domain tool calls yet
`TestSupervisorStepLevel`	`test_research_agent_uses_web_search`	`assert_step(order=ANY_ORDER)`	The research step contains at least one `web_search` call whose query mentions FAANG / headcount / employees. Uses `ANY_ORDER` to handle parallel tool calls in the same step
`TestSupervisorStepLevel`	`test_math_agent_adds_with_research_derived_numbers`	`assert_step_uses_result_from()`	The second `add` step consumes the first `add` step's result. Catches the "agent dropped the running total" bug
`TestSupervisorStepLevel`	`test_final_answer_includes_a_number`	`assert_step_output()`	The final summary step produces text matching `\d[\d,]*`

Step-level testing works on cassette replay because agentverify backfills tool results from the next step's LLM input. See Step-Level Assertions in the main README.

Recording Mode (Real OpenAI API)

To re-record the cassette with real LLM calls:

Set the API key:
```
export OPENAI_API_KEY=sk-your_key_here
```
Run the tests with --cassette-mode=record:
```
pytest --cassette-mode=record
```
The cassette file is overwritten with a fresh recording.
Commit the updated cassette.

Note: the supervisor's exact routing and the number of web_search calls can vary between recordings because gpt-5.4-mini is a reasoning-style model whose internal sampling is not exposed via the temperature parameter (the API rejects any non-default value, so the example omits temperature for the GPT-5 series). The test assertions are written to tolerate that variance: OrderMode.IN_ORDER for the flat sequence, ANY_ORDER for in-step parallel tool calls, and helper functions that locate steps by tool name rather than hard-coded indices.

Framework Adapter

This example uses the built-in adapter:

from agentverify.frameworks.langgraph import from_langgraph

raw = app.invoke({"messages": [{"role": "user", "content": query}]})
execution_result = from_langgraph(raw)

The adapter walks the messages list, promoting each AIMessage to a Step and attaching subsequent ToolMessage content as the step's tool_results. See Framework Integration in the main README for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

LangGraph Multi-Agent Supervisor Example

Why this example?

Prerequisites

Setup

Running the Agent

Running Tests

Cassette Replay (No API Key Required)

What the Tests Verify

Recording Mode (Real OpenAI API)

Framework Adapter

FilesExpand file tree

langgraph-multi-agent-supervisor

Directory actions

More options

Directory actions

More options

Latest commit

History

langgraph-multi-agent-supervisor

Folders and files

parent directory

README.md

LangGraph Multi-Agent Supervisor Example

Why this example?

Prerequisites

Setup

Running the Agent

Running Tests

Cassette Replay (No API Key Required)

What the Tests Verify

Recording Mode (Real OpenAI API)

Framework Adapter