Skip to content

Commit 85547ae

Browse files
authored
Add tau 2 bench (#16)
* Implement tau2-bench benchmark for evaluating LLM agents on customer service tasks across airline, retail, and telecom domains * Add Tau2Benchmark, Tau2Environment, Tau2User, Tau2Evaluator components * Add DefaultAgentTau2Benchmark with reference agent implementation * Add ModelAdapter.chat() method with ChatResponse dataclass * Add AnthropicModelAdapter for direct Claude integration * Add AgenticUser class for tool-using user simulations * Add AgenticUserLLMSimulator for LLM-based user simulation with tools * Support multiple stop tokens in User class with stop reason tracking * Change Task.id from UUID to str for human-readable IDs * Make Benchmark.agent_data parameter optional * Fix task reports to use task.id directly
1 parent eda3c05 commit 85547ae

83 files changed

Lines changed: 17324 additions & 235 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Custom
22
.idea/
33
.DS_Store
4+
.devcontainer/
45

56
# Byte-compiled / optimized / DLL files
67
__pycache__/

AGENTS.md

Lines changed: 70 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,75 @@ uv run pytest -m smolagents -v
5555
uv run pytest -m interface -v
5656
```
5757

58+
## Coverage
59+
60+
View coverage by feature area (auto-discovers benchmarks/interfaces):
61+
62+
```bash
63+
uv run python scripts/coverage_by_feature.py
64+
```
65+
66+
Manual coverage for specific modules:
67+
68+
```bash
69+
pytest --cov=maseval.core.agent --cov-report=term-missing
70+
```
71+
72+
## Typing
73+
74+
### Type Checker
75+
76+
The project uses the `ty` type checker.
77+
78+
```bash
79+
# Check types across the project
80+
uv run ty check
81+
82+
# View documentation
83+
uv run ty --help
84+
```
85+
86+
Documentation: [https://docs.astral.sh/ty/](https://docs.astral.sh/ty/)
87+
88+
### Philosophy & Priorities
89+
90+
**Types exist to help users and catch bugs—not to satisfy theoretical purity.**
91+
92+
This library uses type hinting to:
93+
94+
- Provide better IDE autocomplete and error detection
95+
- Document expected types clearly
96+
- Catch real errors before runtime
97+
98+
However, **pragmatism over pedantry**: if a typing pattern improves usability and robustness, use it—even if it technically violates some typing rule.
99+
100+
**Example:** `MACSBenchmark` narrows its environment type to `MACSEnvironment` (instead of generic `Environment`). This violates strict subtyping rules but provides users with:
101+
102+
- Precise autocomplete for MACS-specific methods
103+
- Clear documentation that MACS requires its own environment
104+
- Better error messages during development
105+
- Prevents mixing incompatible components (e.g., `Tau2Environment` cannot be passed to `MACSBenchmark`)
106+
107+
Unless there's a graceful alternative that preserves usability, choose the pattern that helps users most.
108+
109+
**Guiding principle:** Orient yourself on existing patterns in the codebase. Consistency matters more than theoretical correctness.
110+
111+
### Syntax Rules
112+
113+
- **Unions:** Use `A | B` notation (not `Union[A, B]`)
114+
- **Optional:** Prefer `Optional[X]` over `X | None` for explicitness
115+
- **Collections:** Use `List[...]`, `Dict[..., ...]`, `Sequence[...]` instead of `list`, `dict`, `sequence`
116+
117+
**Example:**
118+
119+
```python
120+
def process_data(
121+
items: List[str],
122+
config: Optional[Dict[str, Any]] = None
123+
) -> str | int:
124+
...
125+
```
126+
58127
## Dependency Management
59128

60129
Three types of dependencies:
@@ -220,7 +289,7 @@ Example workflow:
220289
uv sync --all-extras --all-groups
221290

222291
# Before committing
223-
uv run ruff format . && uv run ruff check . --fix && uv run pytest -v
292+
uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check
224293

225294
# Run example
226295
uv run python examples/amazon_collab.py
@@ -235,11 +304,6 @@ uv add --optional <extra-name> <package-name>
235304
uv run pytest tests/test_core/test_agent.py -v
236305
```
237306

238-
## Type Hinting
239-
240-
This repository uses proper type hinting. For unions use the `A | B` notation. For optional imports, prefer `Optional[...]` as it is more explicit.
241-
For lists and dictionaries, use `Dict[...,...]`, `List[...]`, `Sequence[...]` etc. instead of `list`, `dict`.
242-
243307
## Security and Confidentiality
244308

245309
**IMPORTANT:** This project contains confidential research material.

CHANGELOG.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2020
- Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
2121
- Added `anthropic` optional dependency: `pip install maseval[anthropic]`
2222

23+
**Benchmarks**
24+
25+
- Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
26+
- `Tau2Benchmark`, `Tau2Environment`, `Tau2User`, `Tau2Evaluator` components for framework-agnostic evaluation (PR: #16)
27+
- `DefaultAgentTau2Benchmark` using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
28+
- Data loading utilities: `load_tasks()`, `ensure_data_exists()`, `configure_model_ids()` (PR: #16)
29+
- Metrics: `compute_benchmark_metrics()`, `compute_pass_at_k()`, `compute_pass_hat_k()` for tau2-style scoring (PR: #16)
30+
- Domain implementations with tool kits: `AirlineTools`, `RetailTools`, `TelecomTools` with full database simulation (PR: #16)
31+
32+
**User**
33+
34+
- `AgenticUser` class for users that can use tools during conversations (PR: #16)
35+
- Multiple stop token support: `User` now accepts `stop_tokens` (list) instead of single `stop_token`, enabling different termination reasons (PR: #16)
36+
- Stop reason tracking: `User` traces now include `stop_reason`, `max_turns`, `turns_used`, and `stopped_by_user` for detailed termination analysis (PR: #16)
37+
38+
**Simulator**
39+
40+
- `AgenticUserLLMSimulator` for LLM-based user simulation with tool use capabilities (PR: #16)
41+
42+
**Examples**
43+
44+
- Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)
45+
2346
### Changed
2447

48+
**Benchmark**
49+
50+
- `Benchmark.agent_data` parameter is now optional (defaults to empty dict) (PR: #16)
51+
52+
**Task**
53+
54+
- `Task.id` is now `str` type instead of `UUID`. Benchmarks can provide human-readable IDs directly (e.g., `Task(id="retail_001", ...)`). Auto-generates UUID string if not provided. (PR: #16)
55+
2556
### Fixed
2657

58+
- Task reports now use `task.id` directly instead of `metadata["task_id"]` (PR: #16)
59+
2760
### Removed
2861

2962
## [0.2.0] - 2025-12-05

docs/benchmark/tau2.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Tau2: Tool-Agent-User Interaction Benchmark
2+
3+
The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users.
4+
5+
## Overview
6+
7+
[Tau2-bench](https://github.com/sierra-research/tau2-bench) (Tool-Agent-User) is designed to evaluate single-agent customer service systems. The benchmark features:
8+
9+
- **Real tool implementations** that modify actual database state
10+
- **Deterministic evaluation** via database state comparison
11+
- **Three domains**: airline (50 tasks), retail (114 tasks), telecom (114 tasks)
12+
- **Pass@k metrics** for robust evaluation with multiple runs
13+
14+
Reference Paper: [Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045)
15+
16+
Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
17+
18+
## Quick Start
19+
20+
```python
21+
from maseval.benchmark.tau2 import (
22+
Tau2Benchmark, Tau2Environment, Tau2Evaluator, Tau2User,
23+
load_tasks, configure_model_ids, ensure_data_exists,
24+
compute_benchmark_metrics, compute_pass_at_k,
25+
)
26+
27+
# Ensure domain data is downloaded
28+
ensure_data_exists(domain="retail")
29+
30+
# Load tasks and configure model IDs
31+
tasks = load_tasks("retail", split="base", limit=5)
32+
configure_model_ids(
33+
tasks,
34+
user_model_id="gpt-4o",
35+
evaluator_model_id="gpt-4o",
36+
)
37+
38+
# Create your framework-specific benchmark subclass
39+
class MyTau2Benchmark(Tau2Benchmark):
40+
def setup_agents(self, agent_data, environment, task, user):
41+
tools = environment.tools
42+
# Create your agent with these tools
43+
...
44+
45+
def get_model_adapter(self, model_id, **kwargs):
46+
adapter = MyModelAdapter(model_id)
47+
if "register_name" in kwargs:
48+
self.register("models", kwargs["register_name"], adapter)
49+
return adapter
50+
51+
# Run benchmark
52+
benchmark = MyTau2Benchmark(agent_data={}, n_task_repeats=4)
53+
results = benchmark.run(tasks)
54+
55+
# Compute metrics
56+
metrics = compute_benchmark_metrics(results)
57+
pass_k = compute_pass_at_k(results, k_values=[1, 2, 3, 4])
58+
```
59+
60+
For baseline comparisons, use `DefaultAgentTau2Benchmark` which mirrors the original tau2-bench implementation:
61+
62+
```python
63+
from maseval.benchmark.tau2 import DefaultAgentTau2Benchmark
64+
65+
benchmark = DefaultAgentTau2Benchmark(
66+
agent_data={"model_id": "gpt-4o"},
67+
n_task_repeats=4,
68+
)
69+
results = benchmark.run(tasks)
70+
```
71+
72+
::: maseval.benchmark.tau2.Tau2Benchmark
73+
74+
::: maseval.benchmark.tau2.Tau2User
75+
76+
::: maseval.benchmark.tau2.Tau2Environment
77+
78+
::: maseval.benchmark.tau2.Tau2Evaluator
79+
80+
::: maseval.benchmark.tau2.DefaultAgentTau2Benchmark
81+
82+
::: maseval.benchmark.tau2.DefaultTau2Agent
83+
84+
::: maseval.benchmark.tau2.load_tasks
85+
86+
::: maseval.benchmark.tau2.configure_model_ids
87+
88+
::: maseval.benchmark.tau2.ensure_data_exists
89+
90+
::: maseval.benchmark.tau2.compute_benchmark_metrics
91+
92+
::: maseval.benchmark.tau2.compute_pass_at_k
93+
94+
::: maseval.benchmark.tau2.compute_pass_hat_k

docs/reference/simulator.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,6 @@ Simulators in MASEval are used to create reproducible and scalable testing envir
1010

1111
::: maseval.core.simulator.UserLLMSimulator
1212

13+
::: maseval.core.simulator.AgenticUserLLMSimulator
14+
1315
::: maseval.core.simulator.SimulatorCallStatus

docs/reference/user.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The User is initialized with a persona and a scenario, both of which are typical
88

99
::: maseval.core.user.User
1010

11+
::: maseval.core.user.AgenticUser
12+
1113
## Interfaces
1214

1315
Some integrations provide convenience user/tool implementations for specific agent frameworks. For example:

examples/five_a_day_benchmark/five_a_day_benchmark.ipynb

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -178,10 +178,11 @@
178178
" task_id = task_dict[\"metadata\"][\"task_id\"]\n",
179179
" task_dict[\"environment_data\"][\"agent_framework\"] = framework\n",
180180
"\n",
181-
" # Create Task object\n",
181+
" # Create Task object with id from metadata\n",
182182
" tasks_data.append(\n",
183183
" Task(\n",
184184
" query=task_dict[\"query\"],\n",
185+
" id=task_id,\n",
185186
" environment_data=task_dict[\"environment_data\"],\n",
186187
" evaluation_data=task_dict[\"evaluation_data\"],\n",
187188
" metadata=task_dict[\"metadata\"],\n",
@@ -575,7 +576,7 @@
575576
"# Build agents using the build_agents function\n",
576577
"agents_to_run, agents_to_monitor = build_agents(config_0, environment_0)\n",
577578
"\n",
578-
"print(f\"\\nBuilt Agents for Task: {task_0.metadata['task_id']}\")\n",
579+
"print(f\"\\nBuilt Agents for Task: {task_0.id}\")\n",
579580
"print(f\"{'=' * 60}\")\n",
580581
"print(f\"\\nAgents to run: {[agent.name for agent in agents_to_run]}\")\n",
581582
"print(f\"Agents to monitor: {list(agents_to_monitor.keys())}\")\n",
@@ -720,7 +721,7 @@
720721
"\n",
721722
"print(f\"Loaded {len(tasks)} tasks:\")\n",
722723
"for i, task in enumerate(tasks):\n",
723-
" print(f\" {i}. {task.metadata['task_id']}: {task.metadata['description']}\")"
724+
" print(f\" {i}. {task.id}: {task.metadata['description']}\")"
724725
]
725726
},
726727
{

examples/five_a_day_benchmark/five_a_day_benchmark.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -873,10 +873,11 @@ def load_benchmark_data(
873873
task_id = task_dict["metadata"]["task_id"]
874874
task_dict["environment_data"]["agent_framework"] = framework
875875

876-
# Create task
876+
# Create task with id from metadata
877877
tasks_data.append(
878878
Task(
879879
query=task_dict["query"],
880+
id=task_id,
880881
environment_data=task_dict["environment_data"],
881882
evaluation_data=task_dict["evaluation_data"],
882883
metadata=task_dict["metadata"],

examples/introduction/tutorial.ipynb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -390,12 +390,13 @@
390390
"# Create a Task instance\n",
391391
"task = Task(\n",
392392
" query=task_data[\"query\"],\n",
393+
" id=task_data[\"metadata\"][\"task_id\"],\n",
393394
" environment_data=task_data[\"environment_data\"],\n",
394395
" evaluation_data=task_data[\"evaluation_data\"],\n",
395396
" metadata=task_data[\"metadata\"],\n",
396397
")\n",
397398
"\n",
398-
"print(f\"Created task: {task.metadata['task_id']}\")\n",
399+
"print(f\"Created task: {task.id}\")\n",
399400
"print(f\"Complexity: {task.metadata['complexity']}\")\n",
400401
"print(f\"Skills tested: {', '.join(task.metadata['skills_tested'])}\")"
401402
]

examples/tau2_benchmark/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
results/*

0 commit comments

Comments
 (0)