|
| 1 | +# Tau2: Tool-Agent-User Interaction Benchmark |
| 2 | + |
| 3 | +The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +[Tau2-bench](https://github.com/sierra-research/tau2-bench) (Tool-Agent-User) is designed to evaluate single-agent customer service systems. The benchmark features: |
| 8 | + |
| 9 | +- **Real tool implementations** that modify actual database state |
| 10 | +- **Deterministic evaluation** via database state comparison |
| 11 | +- **Three domains**: airline (50 tasks), retail (114 tasks), telecom (114 tasks) |
| 12 | +- **Pass@k metrics** for robust evaluation with multiple runs |
| 13 | + |
| 14 | +Reference Paper: [Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045) |
| 15 | + |
| 16 | +Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses. |
| 17 | + |
| 18 | +## Quick Start |
| 19 | + |
| 20 | +```python |
| 21 | +from maseval.benchmark.tau2 import ( |
| 22 | + Tau2Benchmark, Tau2Environment, Tau2Evaluator, Tau2User, |
| 23 | + load_tasks, configure_model_ids, ensure_data_exists, |
| 24 | + compute_benchmark_metrics, compute_pass_at_k, |
| 25 | +) |
| 26 | + |
| 27 | +# Ensure domain data is downloaded |
| 28 | +ensure_data_exists(domain="retail") |
| 29 | + |
| 30 | +# Load tasks and configure model IDs |
| 31 | +tasks = load_tasks("retail", split="base", limit=5) |
| 32 | +configure_model_ids( |
| 33 | + tasks, |
| 34 | + user_model_id="gpt-4o", |
| 35 | + evaluator_model_id="gpt-4o", |
| 36 | +) |
| 37 | + |
| 38 | +# Create your framework-specific benchmark subclass |
| 39 | +class MyTau2Benchmark(Tau2Benchmark): |
| 40 | + def setup_agents(self, agent_data, environment, task, user): |
| 41 | + tools = environment.tools |
| 42 | + # Create your agent with these tools |
| 43 | + ... |
| 44 | + |
| 45 | + def get_model_adapter(self, model_id, **kwargs): |
| 46 | + adapter = MyModelAdapter(model_id) |
| 47 | + if "register_name" in kwargs: |
| 48 | + self.register("models", kwargs["register_name"], adapter) |
| 49 | + return adapter |
| 50 | + |
| 51 | +# Run benchmark |
| 52 | +benchmark = MyTau2Benchmark(agent_data={}, n_task_repeats=4) |
| 53 | +results = benchmark.run(tasks) |
| 54 | + |
| 55 | +# Compute metrics |
| 56 | +metrics = compute_benchmark_metrics(results) |
| 57 | +pass_k = compute_pass_at_k(results, k_values=[1, 2, 3, 4]) |
| 58 | +``` |
| 59 | + |
| 60 | +For baseline comparisons, use `DefaultAgentTau2Benchmark` which mirrors the original tau2-bench implementation: |
| 61 | + |
| 62 | +```python |
| 63 | +from maseval.benchmark.tau2 import DefaultAgentTau2Benchmark |
| 64 | + |
| 65 | +benchmark = DefaultAgentTau2Benchmark( |
| 66 | + agent_data={"model_id": "gpt-4o"}, |
| 67 | + n_task_repeats=4, |
| 68 | +) |
| 69 | +results = benchmark.run(tasks) |
| 70 | +``` |
| 71 | + |
| 72 | +::: maseval.benchmark.tau2.Tau2Benchmark |
| 73 | + |
| 74 | +::: maseval.benchmark.tau2.Tau2User |
| 75 | + |
| 76 | +::: maseval.benchmark.tau2.Tau2Environment |
| 77 | + |
| 78 | +::: maseval.benchmark.tau2.Tau2Evaluator |
| 79 | + |
| 80 | +::: maseval.benchmark.tau2.DefaultAgentTau2Benchmark |
| 81 | + |
| 82 | +::: maseval.benchmark.tau2.DefaultTau2Agent |
| 83 | + |
| 84 | +::: maseval.benchmark.tau2.load_tasks |
| 85 | + |
| 86 | +::: maseval.benchmark.tau2.configure_model_ids |
| 87 | + |
| 88 | +::: maseval.benchmark.tau2.ensure_data_exists |
| 89 | + |
| 90 | +::: maseval.benchmark.tau2.compute_benchmark_metrics |
| 91 | + |
| 92 | +::: maseval.benchmark.tau2.compute_pass_at_k |
| 93 | + |
| 94 | +::: maseval.benchmark.tau2.compute_pass_hat_k |
0 commit comments