Skip to content

Commit e93a75c

Browse files
authored
GAIA2, Tau2 and MultiAgentBench Fixes (#30)
Comprehensive fixes across all three benchmarks to align with their upstream reference implementations (ARE, MARBLE, tau2-bench), discovered through real-data integration testing. GAIA2 (ARE alignment): - Rewrote scenario lifecycle: initialize_scenario() → run() with preprocess_scenario() so oracle events, judges, and durations are correctly initialized - Fixed data loader: removed fabricated task_instruction field (doesn't exist in ARE), removed nonexistent HF configs (agent2agent, noise), fixed default config, pinned HF revision - Fixed evaluator: judge.evaluate() → judge.validate(env) with per-turn intermediate judge(env) calls for multi-turn scenarios - Added two-level notification loop (outer turn + inner step) matching ARE's reference agent architecture, with poll_notifications(), get_turn_notifications(), pause()/resume_with_offset() on environment - Fixed tool filtering: removed 4 extra AUI tools ARE filters out, set wait_for_user_response=False - Fixed default agent: error propagation (ERROR: vs Observation:), step counter incrementing on errors, boolean normalization (True→true), reasoning model compatibility (stop/temperature), client-side stop-token truncation - Fixed simulation time: uses get_scenario_duration() with capability- specific limits, passes correct start_time/time_increment to EnvironmentConfig - Added Gaia2JudgeEngineConfig for configuring judge LLM provider - Removed duplicate code, now delegates to ARE's parse_json_tool_call and get_offset_from_time_config_mode - Renamed AREToolWrapper → Gaia2GenericTool MultiAgentBench (MARBLE alignment): - Fixed bargaining domain: mapped to WorldSimulationEnvironment (was nonexistent BargainingEnvironment) - Removed fake domains (web, worldsimulation) not in paper or data - Added werewolf domain with config-based YAML task loading and WerewolfEnv constructor handling - Added minecraft evaluation prompt template - Added result summarization (\_summarize_results/\_summarize_output) matching MARBLE's engine truncation + LLM summarization before evaluation - Switched to MARBLE fork (cemde/MARBLE) with upstream bug fixes - Fixed environment constructors with proper dynamic imports Tau2 (tau2-bench alignment): - Fixed telecom domain schema: network_status → network_connection_status, has_sim_card → sim_card_missing, signal_strength to per-technology dict, added missing fields (wifi_calling, line_active, data_usage_exceeded) - Added agent/user state synchronization (\_sync_tools_internal): agent-side changes now update user-side state after every tool call - Added deterministic simulate_network_search() based on SIM status, network mode, surroundings, airplane mode, APN settings, line status - Fixed initialization flow: initialization_actions now execute after toolkits are created - Fixed tool result serialization: model_dump() + json.dumps() matching tau2-bench (was Python str() mangling Pydantic models) Dependencies: - Added multiagentbench optional extra with full MARBLE dependency tree - Added tau2 extra (docstring-parser) - Added [tool.uv] overrides to relax ARE's overly-strict version pins - Relaxed pydantic to >=2.10.6 for ARE compatibility Testing: - Data integrity tests for GAIA2 and MultiAgentBench - Real-data integration tests for GAIA2 and MultiAgentBench - Tau2 initialization action tests - Updated CI with slow-test workflow and benchmark data caching
1 parent 36602b0 commit e93a75c

57 files changed

Lines changed: 8435 additions & 1684 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/test.yml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,10 @@ jobs:
4545
- name: Install dependencies
4646
run: |
4747
pip install uv
48-
uv sync --group dev
48+
uv sync --all-extras --group dev
4949
- name: Run benchmark tests
5050
run: |
51-
uv run pytest -m benchmark -v
51+
uv run pytest -m "benchmark and not (slow or live)" -v
5252
5353
test-all:
5454
name: All Tests (With Optional Deps)
@@ -86,14 +86,6 @@ jobs:
8686
run: |
8787
pip install uv
8888
uv sync --all-extras --group dev
89-
- name: Cache benchmark data
90-
uses: actions/cache@v4
91-
with:
92-
path: |
93-
maseval/benchmark/tau2/data/
94-
maseval/benchmark/macs/data/
95-
maseval/benchmark/macs/prompt_templates/
96-
key: benchmark-data-${{ hashFiles('maseval/benchmark/tau2/data_loader.py', 'maseval/benchmark/macs/data_loader.py') }}
9789
- name: Run slow tests
9890
run: |
9991
uv run pytest -m "slow and not credentialed" -v

AGENTS.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -494,3 +494,50 @@ MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade
494494
- Focus on getting it right, not keeping it the same
495495

496496
We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
497+
498+
## Scientific Integrity
499+
500+
MASEval is a scientific library. Scientific integrity is paramount. **Never introduce defaults that could silently alter benchmark behavior or experimental outcomes.**
501+
502+
### The Boundary
503+
504+
**Guiding principle:** If a researcher would need to report a parameter in a paper's "Experimental Setup" section, **do not invent a default for it.**
505+
506+
**Acceptable (infrastructure/convenience):** `TaskQueue(limit=None)`, `Logger(verbose=False)`, `num_workers=1`, `print_results(color=True)` — these don't affect scientific results.
507+
508+
**Unacceptable (experimental parameters):** Temperature, seed, model version, prompt format, simulation duration, agent limits, dataset splits, scoring functions — these alter what's being measured.
509+
510+
### Reproducing Benchmarks
511+
512+
When integrating external benchmarks, match the source implementation exactly. Never invent fallback values.
513+
514+
```python
515+
# BAD: Invented defaults
516+
config = EnvironmentConfig(
517+
duration=getattr(scenario, "duration", 86400), # Made-up fallback!
518+
)
519+
start_time = getattr(scenario, "start_time", None) # Hides missing attributes
520+
521+
# GOOD: Pass through directly, let errors surface
522+
config = EnvironmentConfig(
523+
duration=scenario.duration, # Trust the source
524+
)
525+
start_time = scenario.start_time # AttributeError if missing
526+
527+
# GOOD: Copy source defaults with documentation
528+
# Default value copied from original_library/evaluator.py:L45
529+
EVAL_TEMPERATURE = 0.7
530+
531+
class Evaluator:
532+
def run(self, temperature: Optional[float] = None):
533+
if temperature is None:
534+
temperature = EVAL_TEMPERATURE # From source:L45
535+
536+
# also good:
537+
class Evaluator:
538+
# default temperature from source:L45
539+
def run(self, temperature: Optional[float] = 0.7):
540+
...
541+
```
542+
543+
**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.

BENCHMARKS.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,14 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co
3333

3434
### Source and License
3535

36-
- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
36+
- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
37+
- **Fork Used:** [https://github.com/cemde/MARBLE](https://github.com/cemde/MARBLE) (contains bug fixes for MASEval integration)
3738
- **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
3839
- **Code License:** MIT
3940
- **Data License:** MIT
4041

42+
> **Note**: MASEval uses a fork with bug fixes. All credit for the original work goes to the MARBLE team (Zhu et al., 2025).
43+
4144
---
4245

4346
## 4. GAIA2

CHANGELOG.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1616
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
1717
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
1818
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
19-
- Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
19+
- Generic tool wrapper (`Gaia2GenericTool`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
2020
- Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
21+
- `Gaia2JudgeEngineConfig` for configuring the judge's LLM model and provider (PR: #30)
2122
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
22-
- Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
23+
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
2324
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
2425

25-
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
26+
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
2627
- `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
2728
- `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
2829
- `MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
@@ -55,7 +56,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5556
- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
5657
- Marker implication hook: `credentialed` implies `live`, so `-m "not live"` always gives a fully offline run (PR: #29)
5758
- Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for tests needing API keys (PR: #29)
58-
- Data integrity tests for Tau2 and MACS benchmarks validating download pipelines, file structures, and database content (PR: #29)
59+
- Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
60+
- Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
5961
- HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using `respx` mocks — no API keys needed (PR: #29)
6062
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
6163
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
@@ -93,6 +95,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
9395

9496
### Fixed
9597

98+
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
99+
- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
100+
- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
96101
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
97102

98103
### Removed

docs/benchmark/gaia2.md

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The **Gaia2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenar
88

99
- **ARE simulation environment** with real-time dynamics and event scheduling
1010
- **Tool-based time control** via `wait_for_notification()` for temporal reasoning
11-
- **7 capability dimensions**: execution, search, adaptability, time, ambiguity, agent2agent, noise
11+
- **5 capability dimensions**: execution, search, adaptability, time, ambiguity
1212
- **Deterministic evaluation** via GraphPerEventJudge comparing completed vs expected events
1313
- **12 app tools**: Calendar, Email, Messaging, Contacts, Shopping, Cab, City, FileSystem, Browser, ChatsApp, SystemApp, Timer
1414

@@ -97,8 +97,6 @@ Gaia2 tasks are organized by capability dimension:
9797
| `adaptability` | Adapting to changing requirements |
9898
| `time` | Temporal reasoning tasks |
9999
| `ambiguity` | Handling ambiguous instructions |
100-
| `agent2agent` | Multi-agent collaboration |
101-
| `noise` | Handling noisy inputs |
102100

103101
Load specific capabilities:
104102

@@ -110,6 +108,45 @@ tasks = load_tasks(capability="time", limit=10)
110108
tasks = load_tasks(limit=50)
111109
```
112110

111+
## Multi-Turn Notification Loop
112+
113+
GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
114+
115+
The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
116+
117+
1. Agent calls `SystemApp__wait_for_notification(timeout=N)` as a normal tool.
118+
2. The ARE environment processes scheduled events, advances simulation time, and queues resulting notifications — all synchronously during the tool call.
119+
3. The tool returns. The agent's loop continues (it does **not** terminate).
120+
4. Before the next LLM call, the agent polls `environment.poll_notifications()` to retrieve messages that arrived during the wait.
121+
5. The agent injects those messages into its context and continues reasoning.
122+
6. Eventually the agent calls `AgentUserInterface__send_message_to_user` — the **only** termination signal.
123+
124+
### What custom agents must implement
125+
126+
The ARE tools handle all environment-side mechanics automatically (event processing, time advancement, notification queuing). No callbacks or hooks required. Custom agents must handle two things:
127+
128+
**1. Do not terminate on `wait_for_notification`.** Treat it as a regular tool call. Only terminate on `AgentUserInterface__send_message_to_user`.
129+
130+
**2. Poll notifications between steps.** After `wait_for_notification` returns, new messages are in the queue. Call `environment.poll_notifications()` to drain them:
131+
132+
```python
133+
# Between agent steps (e.g., before each LLM call):
134+
user_msgs, env_notifs, has_stop = environment.poll_notifications()
135+
136+
# Inject into agent context (format matches ARE's convention):
137+
if user_msgs:
138+
content = "\n".join(user_msgs)
139+
messages.append({"role": "user", "content": f"User messages updates:\n***\n{content}\n***\n"})
140+
if env_notifs:
141+
content = "\n".join(env_notifs)
142+
messages.append({"role": "user", "content": f"Environment notifications updates:\n***\n{content}\n***\n"})
143+
if has_stop:
144+
# Environment signalled simulation end — stop the agent loop
145+
break
146+
```
147+
148+
See `DefaultGaia2Agent` source for the canonical single-loop implementation.
149+
113150
## Key Differences from Tau2
114151

115152
| Aspect | Gaia2 | Tau2 |
@@ -132,7 +169,7 @@ tasks = load_tasks(limit=50)
132169

133170
::: maseval.benchmark.gaia2.DefaultGaia2Agent
134171

135-
::: maseval.benchmark.gaia2.AREToolWrapper
172+
::: maseval.benchmark.gaia2.Gaia2GenericTool
136173

137174
::: maseval.benchmark.gaia2.load_tasks
138175

docs/benchmark/multiagentbench.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
44

5-
[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
5+
[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework, where the original work was done) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. We use a [bug-fixed fork](https://github.com/cemde/MARBLE) for MASEval integration. The benchmark features:
66

7-
- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
7+
- **6 diverse domains**: research, bargaining, coding, database, werewolf, minecraft (minecraft is untested)
88
- **Multiple coordination modes**: cooperative, star, tree, hierarchical
99
- **LLM-based evaluation**: Matches MARBLE's evaluation methodology
1010
- **Framework-agnostic**: Use with any agent framework or MARBLE's native agents

maseval/benchmark/gaia2/PROVENANCE.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -44,16 +44,16 @@ MASEval provides:
4444

4545
| MASEval Method | ARE Method/Component | Notes |
4646
| -------------------------------- | ------------------------------------- | ------------------------------------ |
47-
| `Gaia2Environment.setup_state()` | `Environment.initialize_scenario()` | Initializes ARE simulation |
47+
| `Gaia2Environment.setup_state()` | `Environment.run(scenario, wait_for_end=False)` | Starts ARE simulation in background |
4848
| `Gaia2Environment.create_tools()`| `App.get_tools()` for all apps | Wraps all app tools with tracing |
4949
| `Gaia2Environment.cleanup()` | `Environment.stop()` | Ensures proper resource cleanup |
50-
| `get_simulation_time()` | `TimeManager.current_time` | Exposes simulation time for tracing |
50+
| `get_simulation_time()` | `Environment.current_time` | Exposes simulation time for tracing |
5151

5252
### Evaluator Integration
5353

5454
| MASEval Method | ARE Component | Notes |
5555
| ------------------------------- | --------------------------------------- | ------------------------------------ |
56-
| `Gaia2Evaluator.__call__()` | `GraphPerEventJudge.evaluate()` | Delegates to ARE's deterministic judge |
56+
| `Gaia2Evaluator.__call__()` | `GraphPerEventJudge.validate(env)` | Delegates to ARE's deterministic judge |
5757
| `filter_traces()` | N/A | MASEval-specific trace extraction |
5858
| `compute_gaia2_metrics()` | N/A | MASEval-specific metrics aggregation |
5959

@@ -73,16 +73,15 @@ Scenarios are loaded from HuggingFace:
7373
https://huggingface.co/datasets/meta-agents-research-environments/gaia2
7474
```
7575

76-
| Config | Description | Split |
77-
| ----------- | ------------------------------------------ | ---------- |
78-
| `validation`| Full validation set (all capabilities) | validation |
79-
| `execution` | Execution capability only | validation |
80-
| `search` | Search capability only | validation |
81-
| `adaptability` | Adaptability capability only | validation |
82-
| `time` | Temporal reasoning only | validation |
83-
| `ambiguity` | Ambiguity handling only | validation |
84-
| `agent2agent` | Multi-agent collaboration only | validation |
85-
| `noise` | Noise handling only | validation |
76+
Revision: `78ea3bdbdeec2bdcd6afa5420915d8a22f23ed99`
77+
78+
| Config | Description | Split |
79+
| -------------- | ------------------------------ | ---------- |
80+
| `execution` | Execution capability only | validation |
81+
| `search` | Search capability only | validation |
82+
| `adaptability` | Adaptability capability only | validation |
83+
| `time` | Temporal reasoning only | validation |
84+
| `ambiguity` | Ambiguity handling only | validation |
8685

8786
## MASEval-Specific Additions
8887

maseval/benchmark/gaia2/__init__.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@
1313
- adaptability: Adapting to changing requirements
1414
- time: Temporal reasoning tasks
1515
- ambiguity: Handling ambiguous instructions
16-
- agent2agent: Multi-agent collaboration
17-
- noise: Handling noisy inputs
1816
1917
Usage:
2018
from maseval.benchmark.gaia2 import (
@@ -68,17 +66,19 @@ def get_model_adapter(self, model_id, **kwargs):
6866

6967
# Tool wrapper
7068
from maseval.benchmark.gaia2.tool_wrapper import (
71-
AREToolWrapper,
69+
Gaia2GenericTool,
7270
wrap_are_tools,
7371
)
7472

75-
# Data loading
73+
# Data loading and configuration
7674
from maseval.benchmark.gaia2.data_loader import (
7775
load_tasks,
7876
configure_model_ids,
77+
Gaia2JudgeEngineConfig,
7978
VALID_CAPABILITIES,
8079
VALID_SPLITS,
8180
HF_DATASET_ID,
81+
HF_DATASET_REVISION,
8282
)
8383

8484

@@ -95,12 +95,14 @@ def get_model_adapter(self, model_id, **kwargs):
9595
"Gaia2Evaluator",
9696
"compute_gaia2_metrics",
9797
# Tool wrapper
98-
"AREToolWrapper",
98+
"Gaia2GenericTool",
9999
"wrap_are_tools",
100-
# Data loading
100+
# Data loading and configuration
101101
"load_tasks",
102102
"configure_model_ids",
103+
"Gaia2JudgeEngineConfig",
103104
"VALID_CAPABILITIES",
104105
"VALID_SPLITS",
105106
"HF_DATASET_ID",
107+
"HF_DATASET_REVISION",
106108
]

0 commit comments

Comments
 (0)