Skip to content

Commit c21750d

Browse files
authored
Fix benchmark implementations based on faithfulness audits (#39)
Core: - Add UserExhaustedError; User.respond() raises instead of returning empty string - Add exhausted_response parameter to LLMUser for tool-based integrations - Replace markdown-fence stripping with robust _extract_json_object() in LLM simulators - Preserve stop tokens outside JSON objects in UserLLMSimulator/AgenticUserLLMSimulator Interface: - GoogleGenAIModelAdapter: merge consecutive tool-response messages into single entry - SmolAgentAdapter: detect crashed steps with no output fields (AgentGenerationError) - LlamaIndexAgentAdapter: add max_iterations forwarded to AgentWorkflow.run() Tau2: - Fix user tool routing, environment state sync, tool result serialization - Add initial agent greeting to user simulator message history - Fix tool call counter resetting per turn, telecom domain models/tools - Fix evaluator assertion logic, add addict dependency - Document architectural divergences in PROVENANCE.md MultiAgentBench: - Implement full multi-iteration coordination loop (graph/star/chain/tree) - Fix data loading defaults, import paths, Minecraft registration - Fix bargaining evaluation (use both buyer/seller prompts) - Document upstream MARBLE bugs in PROVENANCE.md MACS: - Preserve items sub-schema for array-type tool properties - Simplify user profile extraction Converse: - Remove silent gpt-4o default for attacker_model_id; raise ValueError Packaging & docs: - Fix setuptools: include subpackages and package data in PyPI installs - Label Tau2, MultiAgentBench, GAIA2, Converse as Beta - Expand test suites for Tau2 and MultiAgentBench
1 parent a4ec486 commit c21750d

69 files changed

Lines changed: 8123 additions & 2341 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

BENCHMARKS.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,22 +15,27 @@ This benchmark is designed to test and evaluate the collaborative problem-solvin
1515

1616
---
1717

18-
## 2. $\tau^2$-bench
18+
## 2. $\tau^2$-bench (Beta)
1919

2020
$\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi-turn interactive environments.
2121

22+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
23+
2224
### Source and License
2325

2426
- **Original Repository:** [https://github.com/sierra-research/tau2-bench](https://github.com/sierra-research/tau2-bench)
27+
- **Paper:** [Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045)
2528
- **Code License:** MIT
2629
- **Data License:** MIT
2730

2831
---
2932

30-
## 3. MultiAgentBench (MARBLE)
33+
## 3. MultiAgentBench (MARBLE) (Beta)
3134

3235
MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
3336

37+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
38+
3439
### Source and License
3540

3641
- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
@@ -43,23 +48,28 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co
4348
4449
---
4550

46-
## 4. GAIA2
51+
## 4. GAIA2 (Beta)
4752

4853
Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.
4954

55+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
56+
5057
### Source and License
5158

5259
- **Original Repository:** [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments)
60+
- **Paper:** [Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments](https://openreview.net/forum?id=9gw03JpKK4) (ICLR 2026)
5361
- **Dataset:** [https://huggingface.co/datasets/meta-agents-research-environments/gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2)
5462
- **Code License:** MIT
5563
- **Data License:** Subject to Meta's data usage terms (see HuggingFace dataset page)
5664

5765
---
5866

59-
## 5. CONVERSE
67+
## 5. CONVERSE (Beta)
6068

6169
CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.
6270

71+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
72+
6373
### Source and License
6474

6575
- **Original Repository:** [https://github.com/amrgomaaelhady/ConVerse](https://github.com/amrgomaaelhady/ConVerse)

CHANGELOG.md

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4343
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
4444
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
4545
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
46+
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
4647

4748
**Interface**
4849

@@ -69,6 +70,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6970

7071
- Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
7172
- Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with `fail_on_evaluation_error`. (PR: #28)
73+
- `User.respond()` now raises `UserExhaustedError` instead of returning an empty string when the user has no more turns. Set the new `exhausted_response` parameter to return a configurable message instead (e.g. for tool-based integrations where agents call `ask_user`). Affects `LLMUser`, `AgenticLLMUser`, `Tau2User`, and `MACSUser`. (PR: #39)
74+
- `_extract_json_object()` helper in `maseval.core.simulator` replaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (`ToolLLMSimulator`, `UserLLMSimulator`, `AgenticUserLLMSimulator`). (PR: #39)
75+
- `UserLLMSimulator` and `AgenticUserLLMSimulator` now preserve stop tokens that appear outside the JSON object in raw LLM output, so `User._check_stop_token` can detect them. (PR: #39)
76+
77+
**Interface**
78+
79+
- `LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
80+
- `SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
81+
- `GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
7282

7383
**Benchmarks**
7484

@@ -89,17 +99,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8999
- `LangGraphUser``LangGraphLLMUser`
90100
- `LlamaIndexUser``LlamaIndexLLMUser`
91101

102+
**Documentation**
103+
104+
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
105+
92106
**Testing**
93107

94108
- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)
95109

96110
### Fixed
97111

112+
**Core**
113+
98114
- `ResultLogger._filter_report()` now includes `status` and `error` fields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (`error` is always present, `None` on success). (PR: #38)
99-
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
100-
- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
115+
- Packaging: Fixed `setuptools` configuration — `packages` now uses `find` with `include = ["maseval*"]` so subpackages and package data (`.json`, `.jsonl`, `.md`, etc.) are included in PyPI installs. (PR: #39)
116+
117+
**Benchmarks**
118+
101119
- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
102-
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
120+
- Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected `max_steps` comments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39)
121+
- Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and `addict` dependency for nested dict access. (PR: #39)
122+
- Tau2: Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
123+
- MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports `buyer_score`, `seller_score`, and `mean_score` scaled to 0-100. (PR: #39)
124+
- MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
125+
- MultiAgentBench: `MarbleMultiAgentBenchBenchmark` now implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed default `coordinate_mode` from `"star"` to `"graph"` matching 1215/1226 MARBLE configs. Uses per-task `max_iterations` from task config (matching `engine.py:97`), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39)
126+
- MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (`marble.utils.utils``marble.llms.model_prompting`), added Minecraft agent registration, per-domain defaults for `max_iterations`/`coordinate_mode`/`environment.type`/`memory.type` from MARBLE YAML configs, resolved hardcoded relative paths for `score.json` and `workspace/solution.py` via `_MARBLE_ROOT`, unified `coordinate_mode` defaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39)
127+
- MultiAgentBench: Fixed bargaining evaluation crash from `.format()` on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE's `engine.py`. (PR: #39)
128+
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
129+
- MACS: `MACSGenericTool._schema_to_inputs()` now preserves `items` sub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39)
130+
- MACS: Simplified `MACSUser._extract_user_profile()` — no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39)
131+
- Converse: Removed silent `"gpt-4o"` default for `attacker_model_id`; now raises `ValueError` if not provided, preventing accidental benchmark misconfiguration. (PR: #39)
103132
- ConVerse: Various fixes for faithful reproduction of original. (PR: #32)
104133

105134
### Removed

docs/benchmark/converse.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
# CONVERSE Benchmark
1+
# CONVERSE Benchmark (Beta)
2+
3+
!!! warning "Beta"
4+
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
25

36
CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.
47

docs/benchmark/gaia2.md

Lines changed: 16 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
1-
# Gaia2: Dynamic Multi-Step Scenario Benchmark
1+
# GAIA2: Dynamic Multi-Step Scenario Benchmark (Beta)
22

3-
The **Gaia2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
3+
!!! warning "Beta"
4+
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
5+
6+
The **GAIA2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
47

58
## Overview
69

7-
[Gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
10+
[GAIA2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
811

912
- **ARE simulation environment** with real-time dynamics and event scheduling
1013
- **Tool-based time control** via `wait_for_notification()` for temporal reasoning
@@ -18,7 +21,7 @@ Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/
1821

1922
## Installation
2023

21-
Gaia2 requires additional dependencies:
24+
GAIA2 requires additional dependencies:
2225

2326
```bash
2427
pip install maseval[gaia2]
@@ -88,15 +91,15 @@ results = benchmark.run(tasks)
8891

8992
## Capabilities
9093

91-
Gaia2 tasks are organized by capability dimension:
94+
GAIA2 tasks are organized by capability dimension:
9295

93-
| Capability | Description |
94-
| -------------- | ------------------------------------------------ |
95-
| `execution` | Basic task execution |
96-
| `search` | Information retrieval tasks |
97-
| `adaptability` | Adapting to changing requirements |
98-
| `time` | Temporal reasoning tasks |
99-
| `ambiguity` | Handling ambiguous instructions |
96+
| Capability | Description |
97+
| -------------- | --------------------------------- |
98+
| `execution` | Basic task execution |
99+
| `search` | Information retrieval tasks |
100+
| `adaptability` | Adapting to changing requirements |
101+
| `time` | Temporal reasoning tasks |
102+
| `ambiguity` | Handling ambiguous instructions |
100103

101104
Load specific capabilities:
102105

@@ -110,7 +113,7 @@ tasks = load_tasks(limit=50)
110113

111114
## Multi-Turn Notification Loop
112115

113-
GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
116+
GAIA2 uses an **event-driven** multi-turn architecture. Scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
114117

115118
The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
116119

@@ -147,16 +150,6 @@ if has_stop:
147150

148151
See `DefaultGaia2Agent` source for the canonical single-loop implementation.
149152

150-
## Key Differences from Tau2
151-
152-
| Aspect | Gaia2 | Tau2 |
153-
| ---------------- | ---------------------------------------- | --------------------------------- |
154-
| Interaction | Event-driven simulation | Turn-based user simulation |
155-
| Time Control | Agent calls `wait_for_notification()` | Fixed turns |
156-
| Tools | ARE app tools (12 apps) | Domain-specific tools (3 domains) |
157-
| Evaluation | Event DAG comparison | Database state comparison |
158-
| User Simulator | None (events are scheduled) | LLM-based customer simulator |
159-
160153
## API Reference
161154

162155
::: maseval.benchmark.gaia2.Gaia2Benchmark

docs/benchmark/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@
22

33
MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.
44

5+
!!! warning "Beta Benchmarks"
6+
Several benchmarks are currently in **Beta**. They have been implemented carefully, but these are highly complex systems and we have not yet validated the results against the original implementations. Use with caution when comparing with existing results or original paper numbers. Contributions and compute donations welcome!
7+
8+
**MACS** is the only benchmark that has been fully validated.
9+
510
## Adding Custom Benchmarks
611

712
You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.

docs/benchmark/multiagentbench.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
# MultiAgentBench: Multi-Agent Collaboration Benchmark
1+
# MultiAgentBench: Multi-Agent Collaboration Benchmark (Beta)
2+
3+
!!! warning "Beta"
4+
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
25

36
The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
47

docs/benchmark/tau2.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
# Tau2: Tool-Agent-User Interaction Benchmark
1+
# Tau2: Tool-Agent-User Interaction Benchmark (Beta)
2+
3+
!!! warning "Beta"
4+
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
25

36
The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users.
47

0 commit comments

Comments
 (0)