You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+47Lines changed: 47 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -494,3 +494,50 @@ MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade
494
494
- Focus on getting it right, not keeping it the same
495
495
496
496
We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
497
+
498
+
## Scientific Integrity
499
+
500
+
MASEval is a scientific library. Scientific integrity is paramount. **Never introduce defaults that could silently alter benchmark behavior or experimental outcomes.**
501
+
502
+
### The Boundary
503
+
504
+
**Guiding principle:** If a researcher would need to report a parameter in a paper's "Experimental Setup" section, **do not invent a default for it.**
**Unacceptable (experimental parameters):** Temperature, seed, model version, prompt format, simulation duration, agent limits, dataset splits, scoring functions — these alter what's being measured.
509
+
510
+
### Reproducing Benchmarks
511
+
512
+
When integrating external benchmarks, match the source implementation exactly. Never invent fallback values.
start_time = scenario.start_time # AttributeError if missing
526
+
527
+
# GOOD: Copy source defaults with documentation
528
+
# Default value copied from original_library/evaluator.py:L45
529
+
EVAL_TEMPERATURE=0.7
530
+
531
+
classEvaluator:
532
+
defrun(self, temperature: Optional[float] =None):
533
+
if temperature isNone:
534
+
temperature =EVAL_TEMPERATURE# From source:L45
535
+
536
+
# also good:
537
+
classEvaluator:
538
+
# default temperature from source:L45
539
+
defrun(self, temperature: Optional[float] =0.7):
540
+
...
541
+
```
542
+
543
+
**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+9-4Lines changed: 9 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,13 +16,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
16
16
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
17
17
-`Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
18
18
-`DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
19
-
-Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
19
+
-Generic tool wrapper (`Gaia2GenericTool`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
20
20
- Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
21
+
-`Gaia2JudgeEngineConfig` for configuring the judge's LLM model and provider (PR: #30)
21
22
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
22
-
- Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
23
+
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity(PR: #26, #30)
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
26
+
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
26
27
-`MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
27
28
-`MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
28
29
-`MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
@@ -55,7 +56,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
55
56
- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
56
57
- Marker implication hook: `credentialed` implies `live`, so `-m "not live"` always gives a fully offline run (PR: #29)
57
58
- Skip decorators (`requires_openai`, `requires_anthropic`, `requires_google`) for tests needing API keys (PR: #29)
58
-
- Data integrity tests for Tau2 and MACS benchmarks validating download pipelines, file structures, and database content (PR: #29)
59
+
- Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
60
+
- Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
59
61
- HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using `respx` mocks — no API keys needed (PR: #29)
60
62
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
61
63
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
@@ -93,6 +95,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
93
95
94
96
### Fixed
95
97
98
+
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
96
101
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
114
+
115
+
The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
116
+
117
+
1. Agent calls `SystemApp__wait_for_notification(timeout=N)` as a normal tool.
118
+
2. The ARE environment processes scheduled events, advances simulation time, and queues resulting notifications — all synchronously during the tool call.
119
+
3. The tool returns. The agent's loop continues (it does **not** terminate).
120
+
4. Before the next LLM call, the agent polls `environment.poll_notifications()` to retrieve messages that arrived during the wait.
121
+
5. The agent injects those messages into its context and continues reasoning.
122
+
6. Eventually the agent calls `AgentUserInterface__send_message_to_user` — the **only** termination signal.
123
+
124
+
### What custom agents must implement
125
+
126
+
The ARE tools handle all environment-side mechanics automatically (event processing, time advancement, notification queuing). No callbacks or hooks required. Custom agents must handle two things:
127
+
128
+
**1. Do not terminate on `wait_for_notification`.** Treat it as a regular tool call. Only terminate on `AgentUserInterface__send_message_to_user`.
129
+
130
+
**2. Poll notifications between steps.** After `wait_for_notification` returns, new messages are in the queue. Call `environment.poll_notifications()` to drain them:
131
+
132
+
```python
133
+
# Between agent steps (e.g., before each LLM call):
Copy file name to clipboardExpand all lines: docs/benchmark/multiagentbench.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,9 +2,9 @@
2
2
3
3
The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
4
4
5
-
[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
5
+
[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework, where the original work was done) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. We use a [bug-fixed fork](https://github.com/cemde/MARBLE) for MASEval integration. The benchmark features:
6
6
7
-
-**7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
7
+
-**6 diverse domains**: research, bargaining, coding, database, werewolf, minecraft (minecraft is untested)
0 commit comments