maseval
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 2 additions & 10 deletions b/‎.github/workflows/test.yml‎
Lines changed: 2 additions & 10 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 47 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎BENCHMARKS.md‎
Lines changed: 4 additions & 1 deletion b/‎BENCHMARKS.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 9 additions & 4 deletions b/‎CHANGELOG.md‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎docs/benchmark/gaia2.md‎
Lines changed: 41 additions & 4 deletions b/‎docs/benchmark/gaia2.md‎
Lines changed: 41 additions & 4 deletions
diff --git a/‎docs/benchmark/multiagentbench.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/benchmark/multiagentbench.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎maseval/benchmark/gaia2/PROVENANCE.md‎
Lines changed: 12 additions & 13 deletions b/‎maseval/benchmark/gaia2/PROVENANCE.md‎
Lines changed: 12 additions & 13 deletions
diff --git a/‎maseval/benchmark/gaia2/__init__.py‎
Lines changed: 8 additions & 6 deletions b/‎maseval/benchmark/gaia2/__init__.py‎
Lines changed: 8 additions & 6 deletions
@@ -45,10 +45,10 @@ jobs:
       - name: Install dependencies
         run: |
           pip install uv
-          uv sync --group dev
+          uv sync --all-extras --group dev
       - name: Run benchmark tests
         run: |
-          uv run pytest -m benchmark -v
+          uv run pytest -m "benchmark and not (slow or live)" -v
 
   test-all:
     name: All Tests (With Optional Deps)
@@ -86,14 +86,6 @@ jobs:
         run: |
           pip install uv
           uv sync --all-extras --group dev
-      - name: Cache benchmark data
-        uses: actions/cache@v4
-        with:
-          path: |
-            maseval/benchmark/tau2/data/
-            maseval/benchmark/macs/data/
-            maseval/benchmark/macs/prompt_templates/
-          key: benchmark-data-${{ hashFiles('maseval/benchmark/tau2/data_loader.py', 'maseval/benchmark/macs/data_loader.py') }}
       - name: Run slow tests
         run: |
           uv run pytest -m "slow and not credentialed" -v
 
@@ -494,3 +494,50 @@ MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade
 - Focus on getting it right, not keeping it the same
 
 We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
+
+## Scientific Integrity
+
+MASEval is a scientific library. Scientific integrity is paramount. **Never introduce defaults that could silently alter benchmark behavior or experimental outcomes.**
+
+### The Boundary
+
+**Guiding principle:** If a researcher would need to report a parameter in a paper's "Experimental Setup" section, **do not invent a default for it.**
+
+**Acceptable (infrastructure/convenience):** `TaskQueue(limit=None)`, `Logger(verbose=False)`, `num_workers=1`, `print_results(color=True)` — these don't affect scientific results.
+
+**Unacceptable (experimental parameters):** Temperature, seed, model version, prompt format, simulation duration, agent limits, dataset splits, scoring functions — these alter what's being measured.
+
+### Reproducing Benchmarks
+
+When integrating external benchmarks, match the source implementation exactly. Never invent fallback values.
+
+```python
+# BAD: Invented defaults
+config = EnvironmentConfig(
+    duration=getattr(scenario, "duration", 86400),  # Made-up fallback!
+)
+start_time = getattr(scenario, "start_time", None)  # Hides missing attributes
+
+# GOOD: Pass through directly, let errors surface
+config = EnvironmentConfig(
+    duration=scenario.duration,  # Trust the source
+)
+start_time = scenario.start_time  # AttributeError if missing
+
+# GOOD: Copy source defaults with documentation
+# Default value copied from original_library/evaluator.py:L45
+EVAL_TEMPERATURE = 0.7
+
+class Evaluator:
+    def run(self, temperature: Optional[float] = None):
+        if temperature is None:
+            temperature = EVAL_TEMPERATURE  # From source:L45
+
+# also good:
+class Evaluator:
+    # default temperature from source:L45
+    def run(self, temperature: Optional[float] = 0.7):
+        ...
+```
+
+**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
@@ -33,11 +33,14 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co
 
 ### Source and License
 
-- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
+- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
+- **Fork Used:** [https://github.com/cemde/MARBLE](https://github.com/cemde/MARBLE) (contains bug fixes for MASEval integration)
 - **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
 - **Code License:** MIT
 - **Data License:** MIT
 
+> **Note**: MASEval uses a fork with bug fixes. All credit for the original work goes to the MARBLE team (Zhu et al., 2025).
+
 ---
 
 ## 4. GAIA2
 
@@ -16,13 +16,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
   - `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
   - `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
-  - Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
+  - Generic tool wrapper (`Gaia2GenericTool`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
   - Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
+  - `Gaia2JudgeEngineConfig` for configuring the judge's LLM model and provider (PR: #30)
   - Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
-  - Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
+  - Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
   - Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
 
-- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
+- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
   - `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
   - `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
   - `MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
@@ -55,7 +56,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
 - Marker implication hook: `credentialed` implies `live`, so `-m "not live"` always gives a fully offline run (PR: #29)
 - Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for tests needing API keys (PR: #29)
-- Data integrity tests for Tau2 and MACS benchmarks validating download pipelines, file structures, and database content (PR: #29)
+- Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
+- Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
 - HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using `respx` mocks — no API keys needed (PR: #29)
 - Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
 - CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
@@ -93,6 +95,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
+- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
+- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
 - Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
 
 ### Removed
 
@@ -8,7 +8,7 @@ The **Gaia2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenar
 
 - **ARE simulation environment** with real-time dynamics and event scheduling
 - **Tool-based time control** via `wait_for_notification()` for temporal reasoning
-- **7 capability dimensions**: execution, search, adaptability, time, ambiguity, agent2agent, noise
+- **5 capability dimensions**: execution, search, adaptability, time, ambiguity
 - **Deterministic evaluation** via GraphPerEventJudge comparing completed vs expected events
 - **12 app tools**: Calendar, Email, Messaging, Contacts, Shopping, Cab, City, FileSystem, Browser, ChatsApp, SystemApp, Timer
 
@@ -97,8 +97,6 @@ Gaia2 tasks are organized by capability dimension:
 | `adaptability` | Adapting to changing requirements                |
 | `time`         | Temporal reasoning tasks                         |
 | `ambiguity`    | Handling ambiguous instructions                  |
-| `agent2agent`  | Multi-agent collaboration                        |
-| `noise`        | Handling noisy inputs                            |
 
 Load specific capabilities:
 
@@ -110,6 +108,45 @@ tasks = load_tasks(capability="time", limit=10)
 tasks = load_tasks(limit=50)
 ```
 
+## Multi-Turn Notification Loop
+
+GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
+
+The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
+
+1. Agent calls `SystemApp__wait_for_notification(timeout=N)` as a normal tool.
+2. The ARE environment processes scheduled events, advances simulation time, and queues resulting notifications — all synchronously during the tool call.
+3. The tool returns. The agent's loop continues (it does **not** terminate).
+4. Before the next LLM call, the agent polls `environment.poll_notifications()` to retrieve messages that arrived during the wait.
+5. The agent injects those messages into its context and continues reasoning.
+6. Eventually the agent calls `AgentUserInterface__send_message_to_user` — the **only** termination signal.
+
+### What custom agents must implement
+
+The ARE tools handle all environment-side mechanics automatically (event processing, time advancement, notification queuing). No callbacks or hooks required. Custom agents must handle two things:
+
+**1. Do not terminate on `wait_for_notification`.** Treat it as a regular tool call. Only terminate on `AgentUserInterface__send_message_to_user`.
+
+**2. Poll notifications between steps.** After `wait_for_notification` returns, new messages are in the queue. Call `environment.poll_notifications()` to drain them:
+
+```python
+# Between agent steps (e.g., before each LLM call):
+user_msgs, env_notifs, has_stop = environment.poll_notifications()
+
+# Inject into agent context (format matches ARE's convention):
+if user_msgs:
+    content = "\n".join(user_msgs)
+    messages.append({"role": "user", "content": f"User messages updates:\n***\n{content}\n***\n"})
+if env_notifs:
+    content = "\n".join(env_notifs)
+    messages.append({"role": "user", "content": f"Environment notifications updates:\n***\n{content}\n***\n"})
+if has_stop:
+    # Environment signalled simulation end — stop the agent loop
+    break
+```
+
+See `DefaultGaia2Agent` source for the canonical single-loop implementation.
+
 ## Key Differences from Tau2
 
 | Aspect           | Gaia2                                    | Tau2                              |
@@ -132,7 +169,7 @@ tasks = load_tasks(limit=50)
 
 ::: maseval.benchmark.gaia2.DefaultGaia2Agent
 
-::: maseval.benchmark.gaia2.AREToolWrapper
+::: maseval.benchmark.gaia2.Gaia2GenericTool
 
 ::: maseval.benchmark.gaia2.load_tasks
 
 
@@ -2,9 +2,9 @@
 
 The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
 
-[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
+[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework, where the original work was done) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. We use a [bug-fixed fork](https://github.com/cemde/MARBLE) for MASEval integration. The benchmark features:
 
-- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
+- **6 diverse domains**: research, bargaining, coding, database, werewolf, minecraft (minecraft is untested)
 - **Multiple coordination modes**: cooperative, star, tree, hierarchical
 - **LLM-based evaluation**: Matches MARBLE's evaluation methodology
 - **Framework-agnostic**: Use with any agent framework or MARBLE's native agents
 
@@ -44,16 +44,16 @@ MASEval provides:
 
 | MASEval Method                   | ARE Method/Component                  | Notes                                |
 | -------------------------------- | ------------------------------------- | ------------------------------------ |
-| `Gaia2Environment.setup_state()` | `Environment.initialize_scenario()`   | Initializes ARE simulation           |
+| `Gaia2Environment.setup_state()` | `Environment.run(scenario, wait_for_end=False)` | Starts ARE simulation in background  |
 | `Gaia2Environment.create_tools()`| `App.get_tools()` for all apps        | Wraps all app tools with tracing     |
 | `Gaia2Environment.cleanup()`     | `Environment.stop()`                  | Ensures proper resource cleanup      |
-| `get_simulation_time()`          | `TimeManager.current_time`            | Exposes simulation time for tracing  |
+| `get_simulation_time()`          | `Environment.current_time`            | Exposes simulation time for tracing  |
 
 ### Evaluator Integration
 
 | MASEval Method                  | ARE Component                           | Notes                                |
 | ------------------------------- | --------------------------------------- | ------------------------------------ |
-| `Gaia2Evaluator.__call__()`     | `GraphPerEventJudge.evaluate()`         | Delegates to ARE's deterministic judge |
+| `Gaia2Evaluator.__call__()`     | `GraphPerEventJudge.validate(env)`      | Delegates to ARE's deterministic judge |
 | `filter_traces()`               | N/A                                     | MASEval-specific trace extraction    |
 | `compute_gaia2_metrics()`       | N/A                                     | MASEval-specific metrics aggregation |
 
@@ -73,16 +73,15 @@ Scenarios are loaded from HuggingFace:
 https://huggingface.co/datasets/meta-agents-research-environments/gaia2
 ```
 
-| Config      | Description                                | Split      |
-| ----------- | ------------------------------------------ | ---------- |
-| `validation`| Full validation set (all capabilities)     | validation |
-| `execution` | Execution capability only                  | validation |
-| `search`    | Search capability only                     | validation |
-| `adaptability` | Adaptability capability only            | validation |
-| `time`      | Temporal reasoning only                    | validation |
-| `ambiguity` | Ambiguity handling only                    | validation |
-| `agent2agent` | Multi-agent collaboration only           | validation |
-| `noise`     | Noise handling only                        | validation |
+Revision: `78ea3bdbdeec2bdcd6afa5420915d8a22f23ed99`
+
+| Config         | Description                    | Split      |
+| -------------- | ------------------------------ | ---------- |
+| `execution`    | Execution capability only      | validation |
+| `search`       | Search capability only         | validation |
+| `adaptability` | Adaptability capability only   | validation |
+| `time`         | Temporal reasoning only        | validation |
+| `ambiguity`    | Ambiguity handling only        | validation |
 
 ## MASEval-Specific Additions
 
 
@@ -13,8 +13,6 @@
     - adaptability: Adapting to changing requirements
     - time: Temporal reasoning tasks
     - ambiguity: Handling ambiguous instructions
-    - agent2agent: Multi-agent collaboration
-    - noise: Handling noisy inputs
 
 Usage:
     from maseval.benchmark.gaia2 import (
@@ -68,17 +66,19 @@ def get_model_adapter(self, model_id, **kwargs):
 
 # Tool wrapper
 from maseval.benchmark.gaia2.tool_wrapper import (
-    AREToolWrapper,
+    Gaia2GenericTool,
     wrap_are_tools,
 )
 
-# Data loading
+# Data loading and configuration
 from maseval.benchmark.gaia2.data_loader import (
     load_tasks,
     configure_model_ids,
+    Gaia2JudgeEngineConfig,
     VALID_CAPABILITIES,
     VALID_SPLITS,
     HF_DATASET_ID,
+    HF_DATASET_REVISION,
 )
 
 
@@ -95,12 +95,14 @@ def get_model_adapter(self, model_id, **kwargs):
     "Gaia2Evaluator",
     "compute_gaia2_metrics",
     # Tool wrapper
-    "AREToolWrapper",
+    "Gaia2GenericTool",
     "wrap_are_tools",
-    # Data loading
+    # Data loading and configuration
     "load_tasks",
     "configure_model_ids",
+    "Gaia2JudgeEngineConfig",
     "VALID_CAPABILITIES",
     "VALID_SPLITS",
     "HF_DATASET_ID",
+    "HF_DATASET_REVISION",
 ]