You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Tau2 Benchmark: Full implementation of the tau2-bench benchmark for evaluating LLM-based agents on customer service tasks across airline, retail, and telecom domains (PR: #16)
26
+
-`Tau2Benchmark`, `Tau2Environment`, `Tau2User`, `Tau2Evaluator` components for framework-agnostic evaluation (PR: #16)
27
+
-`DefaultAgentTau2Benchmark` using an agent setup closely resembeling to the original tau2-bench implementation (PR: #16)
28
+
- Data loading utilities: `load_tasks()`, `ensure_data_exists()`, `configure_model_ids()` (PR: #16)
29
+
- Metrics: `compute_benchmark_metrics()`, `compute_pass_at_k()`, `compute_pass_hat_k()` for tau2-style scoring (PR: #16)
30
+
- Domain implementations with tool kits: `AirlineTools`, `RetailTools`, `TelecomTools` with full database simulation (PR: #16)
31
+
32
+
**User**
33
+
34
+
-`AgenticUser` class for users that can use tools during conversations (PR: #16)
35
+
- Multiple stop token support: `User` now accepts `stop_tokens` (list) instead of single `stop_token`, enabling different termination reasons (PR: #16)
36
+
- Stop reason tracking: `User` traces now include `stop_reason`, `max_turns`, `turns_used`, and `stopped_by_user` for detailed termination analysis (PR: #16)
37
+
38
+
**Simulator**
39
+
40
+
-`AgenticUserLLMSimulator` for LLM-based user simulation with tool use capabilities (PR: #16)
41
+
42
+
**Examples**
43
+
44
+
- Tau2 benchmark example with default agent implementation and result comparison scripts (PR: #16)
45
+
23
46
### Changed
24
47
48
+
**Benchmark**
49
+
50
+
-`Benchmark.agent_data` parameter is now optional (defaults to empty dict) (PR: #16)
51
+
52
+
**Task**
53
+
54
+
-`Task.id` is now `str` type instead of `UUID`. Benchmarks can provide human-readable IDs directly (e.g., `Task(id="retail_001", ...)`). Auto-generates UUID string if not provided. (PR: #16)
55
+
25
56
### Fixed
26
57
58
+
- Task reports now use `task.id` directly instead of `metadata["task_id"]` (PR: #16)
"source": "# ruff: noqa E402\n# Setup: Set working directory to project root for proper imports\n# This must happen FIRST before any other imports\nimport os\nimport sys\nfrom pathlib import Path\nimport json\nfrom typing import Any, Dict, List, Sequence\nfrom rich.console import Console\nfrom rich.panel import Panel\n\n# Determine notebook directory and set working directory to project root\n_notebook_dir = Path(__file__).parent if \"__file__\" in dir() else Path.cwd()\nif _notebook_dir.name == \"five_a_day_benchmark\":\n _project_root = _notebook_dir.parent.parent\n os.chdir(_project_root)\n # Add project root to path so `examples.five_a_day_benchmark.*` imports work\n if str(_project_root) not in sys.path:\n sys.path.insert(0, str(_project_root))\n # Also add the example directory for local imports (utils, tools, evaluators)\n if str(_notebook_dir) not in sys.path:\n sys.path.insert(0, str(_notebook_dir))\n print(f\"Working directory set to: {os.getcwd()}\")\n\n\n# Utility functions from this example\n# - derive_seed(): Creates reproducible seeds from task_id + agent_id\n# - sanitize_name(): Cleans agent names for framework compatibility\nfrom utils import derive_seed, sanitize_name\n\n# Tool collection classes and helpers\n# - EmailToolCollection, BankingToolCollection: Pre-built tool groups\n# - filter_tool_adapters_by_prefix(): Selects tools by name prefix\n# - get_states(): Initializes tool state objects (email inboxes, bank accounts, etc.)\nfrom tools import (\n EmailToolCollection,\n BankingToolCollection,\n CalculatorToolCollection,\n CodeExecutionToolCollection,\n FamilyInfoToolCollection,\n StockPriceToolCollection,\n CalendarToolCollection,\n HotelSearchToolCollection,\n MCPCalendarToolCollection,\n filter_tool_adapters_by_prefix,\n get_states,\n)\n\n# smolagents: Our chosen agent framework\nfrom smolagents import ToolCallingAgent, LiteLLMModel, FinalAnswerTool\n\n# MASEval core components\nfrom maseval import Benchmark, Environment, Task, TaskCollection, AgentAdapter, Evaluator, ModelAdapter\nfrom maseval.interface.agents.smolagents import SmolAgentAdapter\n\n# Import evaluators module (dynamically loaded later)\nimport evaluators\n\n\ndef load_benchmark_data(\n config_type: str = \"multi\",\n framework: str = \"smolagents\",\n model_id: str = \"gemini-2.5-flash\",\n temperature: float = 0.7,\n limit: int | None = None,\n seed: int | None = None,\n task_indices: list[int] | None = None,\n) -> tuple[TaskCollection, list[Dict[str, Any]]]:\n \"\"\"Load tasks and agent configurations.\n\n Args:\n config_type: 'single' or 'multi' agent configuration\n framework: Agent framework to use\n model_id: Model identifier\n temperature: Model temperature\n limit: Optional limit on number of tasks (None = all 5)\n seed: Random seed for reproducibility\n task_indices: Optional list of task indices to load (e.g., [0, 2, 4])\n\n Returns:\n Tuple of (TaskCollection, list of agent configs)\n \"\"\"\n data_dir = Path(\"examples/five_a_day_benchmark/data\")\n\n with open(data_dir / \"tasks.json\", \"r\") as f:\n tasks_raw = json.load(f)\n with open(data_dir / f\"{config_type}agent.json\", \"r\") as f:\n configs_raw = json.load(f)\n\n # Apply limit first\n if limit:\n tasks_raw = tasks_raw[:limit]\n configs_raw = configs_raw[:limit]\n\n # Then apply task_indices filter if specified\n if task_indices is not None:\n tasks_raw = [tasks_raw[i] for i in task_indices if i < len(tasks_raw)]\n configs_raw = [configs_raw[i] for i in task_indices if i < len(configs_raw)]\n\n tasks_data = []\n configs_data = []\n\n for task_dict, config in zip(tasks_raw, configs_raw):\n task_id = task_dict[\"metadata\"][\"task_id\"]\n task_dict[\"environment_data\"][\"agent_framework\"] = framework\n\n # Create Task object with id from metadata\n tasks_data.append(\n Task(\n query=task_dict[\"query\"],\n id=task_id,\n environment_data=task_dict[\"environment_data\"],\n evaluation_data=task_dict[\"evaluation_data\"],\n metadata=task_dict[\"metadata\"],\n )\n )\n\n # Enrich config with framework and model info\n config[\"framework\"] = framework\n config[\"model_config\"] = {\"model_id\": model_id, \"temperature\": temperature}\n\n # Derive seeds for reproducibility\n if seed is not None:\n for agent_spec in config[\"agents\"]:\n agent_spec[\"seed\"] = derive_seed(seed, task_id, agent_spec[\"agent_id\"])\n\n configs_data.append(config)\n\n return TaskCollection(tasks_data), configs_data"
204
75
},
205
76
{
206
77
"cell_type": "markdown",
@@ -558,41 +429,7 @@
558
429
"id": "5fbb228f",
559
430
"metadata": {},
560
431
"outputs": [],
561
-
"source": [
562
-
"# Build the agents for task 0\n",
563
-
"# Note: model_config is already set by load_benchmark_data()\n",
"source": "# Build the agents for task 0\n# Note: model_config is already set by load_benchmark_data()\n\n# Create environment from task data\nenvironment_0 = FiveADayEnvironment(\n {\n \"environment_data\": task_0.environment_data,\n \"query\": task_0.query,\n \"evaluation_data\": task_0.evaluation_data,\n \"metadata\": task_0.metadata,\n }\n)\n\n# Build agents using the build_agents function\nagents_to_run, agents_to_monitor = build_agents(config_0, environment_0)\n\nprint(f\"\\nBuilt Agents for Task: {task_0.id}\")\nprint(f\"{'=' * 60}\")\nprint(f\"\\nAgents to run: {[agent.name for agent in agents_to_run]}\")\nprint(f\"Agents to monitor: {list(agents_to_monitor.keys())}\")\n\n# Print details for each agent\nfor agent in agents_to_run:\n print(f\"\\n Agent: {agent.name}\")\n # smolagents stores tools as a dict with string keys\n print(f\" Tools: {list(agent.tools.keys())}\")\n if hasattr(agent, \"managed_agents\") and agent.managed_agents:\n # managed_agents is also a dict with string keys\n print(f\" Managed agents: {list(agent.managed_agents.keys())}\")\n for agent_name, managed in agent.managed_agents.items():\n print(f\" - {managed.name}: {list(managed.tools.keys())}\")\n\nprint(\"\\nAll agents built successfully.\")"
"source": "# Reload all 5 tasks for the benchmark\ntasks, agent_configs = load_benchmark_data(\n config_type=\"multi\",\n framework=\"smolagents\",\n model_id=\"gemini-2.5-flash\",\n temperature=0.7,\n seed=42,\n # No task_indices = load all tasks\n)\n\nprint(f\"Loaded {len(tasks)} tasks:\")\nfor i, task in enumerate(tasks):\n print(f\" {i}. {task.id}: {task.metadata['description']}\")"
0 commit comments