parameterlab
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 70 additions & 6 deletions b/‎AGENTS.md‎
Lines changed: 70 additions & 6 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 33 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎docs/benchmark/tau2.md‎
Lines changed: 94 additions & 0 deletions b/‎docs/benchmark/tau2.md‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎docs/reference/simulator.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/reference/simulator.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/reference/user.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/reference/user.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/five_a_day_benchmark/five_a_day_benchmark.ipynb‎
Lines changed: 4 additions & 3 deletions b/‎examples/five_a_day_benchmark/five_a_day_benchmark.ipynb‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎examples/five_a_day_benchmark/five_a_day_benchmark.py‎
Lines changed: 2 additions & 1 deletion b/‎examples/five_a_day_benchmark/five_a_day_benchmark.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/introduction/tutorial.ipynb‎
Lines changed: 2 additions & 1 deletion b/‎examples/introduction/tutorial.ipynb‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎examples/tau2_benchmark/.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎examples/tau2_benchmark/.gitignore‎
Lines changed: 1 addition & 0 deletions
@@ -1,6 +1,7 @@
 # Custom
 .idea/
 .DS_Store
+.devcontainer/
 
 # Byte-compiled / optimized / DLL files
 __pycache__/
 
@@ -55,6 +55,75 @@ uv run pytest -m smolagents -v
 uv run pytest -m interface -v
 ```
 
+## Coverage
+
+View coverage by feature area (auto-discovers benchmarks/interfaces):
+
+```bash
+uv run python scripts/coverage_by_feature.py
+```
+
+Manual coverage for specific modules:
+
+```bash
+pytest --cov=maseval.core.agent --cov-report=term-missing
+```
+
+## Typing
+
+### Type Checker
+
+The project uses the `ty` type checker.
+
+```bash
+# Check types across the project
+uv run ty check
+
+# View documentation
+uv run ty --help
+```
+
+Documentation: [https://docs.astral.sh/ty/](https://docs.astral.sh/ty/)
+
+### Philosophy & Priorities
+
+**Types exist to help users and catch bugs—not to satisfy theoretical purity.**
+
+This library uses type hinting to:
+
+- Provide better IDE autocomplete and error detection
+- Document expected types clearly
+- Catch real errors before runtime
+
+However, **pragmatism over pedantry**: if a typing pattern improves usability and robustness, use it—even if it technically violates some typing rule.
+
+**Example:** `MACSBenchmark` narrows its environment type to `MACSEnvironment` (instead of generic `Environment`). This violates strict subtyping rules but provides users with:
+
+- Precise autocomplete for MACS-specific methods
+- Clear documentation that MACS requires its own environment
+- Better error messages during development
+- Prevents mixing incompatible components (e.g., `Tau2Environment` cannot be passed to `MACSBenchmark`)
+
+Unless there's a graceful alternative that preserves usability, choose the pattern that helps users most.
+
+**Guiding principle:** Orient yourself on existing patterns in the codebase. Consistency matters more than theoretical correctness.
+
+### Syntax Rules
+
+- **Unions:** Use `A | B` notation (not `Union[A, B]`)
+- **Optional:** Prefer `Optional[X]` over `X | None` for explicitness
+- **Collections:** Use `List[...]`, `Dict[..., ...]`, `Sequence[...]` instead of `list`, `dict`, `sequence`
+
+**Example:**
+
+```python
+def process_data(
+    items: List[str],
+    config: Optional[Dict[str, Any]] = None
+) -> str | int:
+    ...
+```
+
 ## Dependency Management
 
 Three types of dependencies:
@@ -220,7 +289,7 @@ Example workflow:
 uv sync --all-extras --all-groups
 
 # Before committing
-uv run ruff format . && uv run ruff check . --fix && uv run pytest -v
+uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check
 
 # Run example
 uv run python examples/amazon_collab.py
@@ -235,11 +304,6 @@ uv add --optional <extra-name> <package-name>
 uv run pytest tests/test_core/test_agent.py -v
 ```
 
-## Type Hinting
-
-This repository uses proper type hinting. For unions use the `A | B` notation. For optional imports, prefer `Optional[...]` as it is more explicit.
-For lists and dictionaries, use `Dict[...,...]`, `List[...]`, `Sequence[...]` etc. instead of `list`, `dict`.
-
 ## Security and Confidentiality
 
 **IMPORTANT:** This project contains confidential research material.
 
@@ -20,10 +20,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Handles Anthropic-specific message format conversion (system messages, tool_use/tool_result blocks) internally while accepting OpenAI-compatible input
 - Added `anthropic` optional dependency: `pip install maseval[anthropic]`
 
+**Benchmarks**
+
+- Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
+- `Tau2Benchmark`, `Tau2Environment`, `Tau2User`, `Tau2Evaluator` components for framework-agnostic evaluation (PR: #16)
+- `DefaultAgentTau2Benchmark` using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
+- Data loading utilities: `load_tasks()`, `ensure_data_exists()`, `configure_model_ids()` (PR: #16)
+- Metrics: `compute_benchmark_metrics()`, `compute_pass_at_k()`, `compute_pass_hat_k()` for tau2-style scoring (PR: #16)
+- Domain implementations with tool kits: `AirlineTools`, `RetailTools`, `TelecomTools` with full database simulation (PR: #16)
+
+**User**
+
+- `AgenticUser` class for users that can use tools during conversations (PR: #16)
+- Multiple stop token support: `User` now accepts `stop_tokens` (list) instead of single `stop_token`, enabling different termination reasons (PR: #16)
+- Stop reason tracking: `User` traces now include `stop_reason`, `max_turns`, `turns_used`, and `stopped_by_user` for detailed termination analysis (PR: #16)
+
+**Simulator**
+
+- `AgenticUserLLMSimulator` for LLM-based user simulation with tool use capabilities (PR: #16)
+
+**Examples**
+
+- Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)
+
 ### Changed
 
+**Benchmark**
+
+- `Benchmark.agent_data` parameter is now optional (defaults to empty dict) (PR: #16)
+
+**Task**
+
+- `Task.id` is now `str` type instead of `UUID`. Benchmarks can provide human-readable IDs directly (e.g., `Task(id="retail_001", ...)`). Auto-generates UUID string if not provided. (PR: #16)
+
 ### Fixed
 
+- Task reports now use `task.id` directly instead of `metadata["task_id"]` (PR: #16)
+
 ### Removed
 
 ## [0.2.0] - 2025-12-05
 
@@ -0,0 +1,94 @@
+# Tau2: Tool-Agent-User Interaction Benchmark
+
+The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users.
+
+## Overview
+
+[Tau2-bench](https://github.com/sierra-research/tau2-bench) (Tool-Agent-User) is designed to evaluate single-agent customer service systems. The benchmark features:
+
+- **Real tool implementations** that modify actual database state
+- **Deterministic evaluation** via database state comparison
+- **Three domains**: airline (50 tasks), retail (114 tasks), telecom (114 tasks)
+- **Pass@k metrics** for robust evaluation with multiple runs
+
+Reference Paper: [Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045)
+
+Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
+
+## Quick Start
+
+```python
+from maseval.benchmark.tau2 import (
+    Tau2Benchmark, Tau2Environment, Tau2Evaluator, Tau2User,
+    load_tasks, configure_model_ids, ensure_data_exists,
+    compute_benchmark_metrics, compute_pass_at_k,
+)
+
+# Ensure domain data is downloaded
+ensure_data_exists(domain="retail")
+
+# Load tasks and configure model IDs
+tasks = load_tasks("retail", split="base", limit=5)
+configure_model_ids(
+    tasks,
+    user_model_id="gpt-4o",
+    evaluator_model_id="gpt-4o",
+)
+
+# Create your framework-specific benchmark subclass
+class MyTau2Benchmark(Tau2Benchmark):
+    def setup_agents(self, agent_data, environment, task, user):
+        tools = environment.tools
+        # Create your agent with these tools
+        ...
+
+    def get_model_adapter(self, model_id, **kwargs):
+        adapter = MyModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+# Run benchmark
+benchmark = MyTau2Benchmark(agent_data={}, n_task_repeats=4)
+results = benchmark.run(tasks)
+
+# Compute metrics
+metrics = compute_benchmark_metrics(results)
+pass_k = compute_pass_at_k(results, k_values=[1, 2, 3, 4])
+```
+
+For baseline comparisons, use `DefaultAgentTau2Benchmark` which mirrors the original tau2-bench implementation:
+
+```python
+from maseval.benchmark.tau2 import DefaultAgentTau2Benchmark
+
+benchmark = DefaultAgentTau2Benchmark(
+    agent_data={"model_id": "gpt-4o"},
+    n_task_repeats=4,
+)
+results = benchmark.run(tasks)
+```
+
+::: maseval.benchmark.tau2.Tau2Benchmark
+
+::: maseval.benchmark.tau2.Tau2User
+
+::: maseval.benchmark.tau2.Tau2Environment
+
+::: maseval.benchmark.tau2.Tau2Evaluator
+
+::: maseval.benchmark.tau2.DefaultAgentTau2Benchmark
+
+::: maseval.benchmark.tau2.DefaultTau2Agent
+
+::: maseval.benchmark.tau2.load_tasks
+
+::: maseval.benchmark.tau2.configure_model_ids
+
+::: maseval.benchmark.tau2.ensure_data_exists
+
+::: maseval.benchmark.tau2.compute_benchmark_metrics
+
+::: maseval.benchmark.tau2.compute_pass_at_k
+
+::: maseval.benchmark.tau2.compute_pass_hat_k
@@ -10,4 +10,6 @@ Simulators in MASEval are used to create reproducible and scalable testing envir
 
 ::: maseval.core.simulator.UserLLMSimulator
 
+::: maseval.core.simulator.AgenticUserLLMSimulator
+
 ::: maseval.core.simulator.SimulatorCallStatus
@@ -8,6 +8,8 @@ The User is initialized with a persona and a scenario, both of which are typical
 
 ::: maseval.core.user.User
 
+::: maseval.core.user.AgenticUser
+
 ## Interfaces
 
 Some integrations provide convenience user/tool implementations for specific agent frameworks. For example:
 
@@ -178,10 +178,11 @@
     "        task_id = task_dict[\"metadata\"][\"task_id\"]\n",
     "        task_dict[\"environment_data\"][\"agent_framework\"] = framework\n",
     "\n",
-    "        # Create Task object\n",
+    "        # Create Task object with id from metadata\n",
     "        tasks_data.append(\n",
     "            Task(\n",
     "                query=task_dict[\"query\"],\n",
+    "                id=task_id,\n",
     "                environment_data=task_dict[\"environment_data\"],\n",
     "                evaluation_data=task_dict[\"evaluation_data\"],\n",
     "                metadata=task_dict[\"metadata\"],\n",
@@ -575,7 +576,7 @@
     "# Build agents using the build_agents function\n",
     "agents_to_run, agents_to_monitor = build_agents(config_0, environment_0)\n",
     "\n",
-    "print(f\"\\nBuilt Agents for Task: {task_0.metadata['task_id']}\")\n",
+    "print(f\"\\nBuilt Agents for Task: {task_0.id}\")\n",
     "print(f\"{'=' * 60}\")\n",
     "print(f\"\\nAgents to run: {[agent.name for agent in agents_to_run]}\")\n",
     "print(f\"Agents to monitor: {list(agents_to_monitor.keys())}\")\n",
@@ -720,7 +721,7 @@
     "\n",
     "print(f\"Loaded {len(tasks)} tasks:\")\n",
     "for i, task in enumerate(tasks):\n",
-    "    print(f\"  {i}. {task.metadata['task_id']}: {task.metadata['description']}\")"
+    "    print(f\"  {i}. {task.id}: {task.metadata['description']}\")"
    ]
   },
   {
 
@@ -873,10 +873,11 @@ def load_benchmark_data(
         task_id = task_dict["metadata"]["task_id"]
         task_dict["environment_data"]["agent_framework"] = framework
 
-        # Create task
+        # Create task with id from metadata
         tasks_data.append(
             Task(
                 query=task_dict["query"],
+                id=task_id,
                 environment_data=task_dict["environment_data"],
                 evaluation_data=task_dict["evaluation_data"],
                 metadata=task_dict["metadata"],
 
@@ -390,12 +390,13 @@
     "# Create a Task instance\n",
     "task = Task(\n",
     "    query=task_data[\"query\"],\n",
+    "    id=task_data[\"metadata\"][\"task_id\"],\n",
     "    environment_data=task_data[\"environment_data\"],\n",
     "    evaluation_data=task_data[\"evaluation_data\"],\n",
     "    metadata=task_data[\"metadata\"],\n",
     ")\n",
     "\n",
-    "print(f\"Created task: {task.metadata['task_id']}\")\n",
+    "print(f\"Created task: {task.id}\")\n",
     "print(f\"Complexity: {task.metadata['complexity']}\")\n",
     "print(f\"Skills tested: {', '.join(task.metadata['skills_tested'])}\")"
    ]
 
@@ -0,0 +1 @@
+results/*