You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: BENCHMARKS.md
+14-4Lines changed: 14 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,22 +15,27 @@ This benchmark is designed to test and evaluate the collaborative problem-solvin
15
15
16
16
---
17
17
18
-
## 2. $\tau^2$-bench
18
+
## 2. $\tau^2$-bench (Beta)
19
19
20
20
$\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi-turn interactive environments.
21
21
22
+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
-**Paper:**[Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains](https://arxiv.org/abs/2406.12045)
25
28
-**Code License:** MIT
26
29
-**Data License:** MIT
27
30
28
31
---
29
32
30
-
## 3. MultiAgentBench (MARBLE)
33
+
## 3. MultiAgentBench (MARBLE) (Beta)
31
34
32
35
MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
33
36
37
+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
38
+
34
39
### Source and License
35
40
36
41
-**Original Repository:**[https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE) (where the original work was done)
@@ -43,23 +48,28 @@ MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent co
43
48
44
49
---
45
50
46
-
## 4. GAIA2
51
+
## 4. GAIA2 (Beta)
47
52
48
53
Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.
49
54
55
+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
-**Data License:** Subject to Meta's data usage terms (see HuggingFace dataset page)
56
64
57
65
---
58
66
59
-
## 5. CONVERSE
67
+
## 5. CONVERSE (Beta)
60
68
61
69
CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.
62
70
71
+
> **Beta:** This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
Copy file name to clipboardExpand all lines: CHANGELOG.md
+32-3Lines changed: 32 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,6 +43,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
43
43
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
44
44
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
45
45
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
46
+
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
46
47
47
48
**Interface**
48
49
@@ -69,6 +70,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
69
70
70
71
- Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
71
72
- Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with `fail_on_evaluation_error`. (PR: #28)
73
+
-`User.respond()` now raises `UserExhaustedError` instead of returning an empty string when the user has no more turns. Set the new `exhausted_response` parameter to return a configurable message instead (e.g. for tool-based integrations where agents call `ask_user`). Affects `LLMUser`, `AgenticLLMUser`, `Tau2User`, and `MACSUser`. (PR: #39)
74
+
-`_extract_json_object()` helper in `maseval.core.simulator` replaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (`ToolLLMSimulator`, `UserLLMSimulator`, `AgenticUserLLMSimulator`). (PR: #39)
75
+
-`UserLLMSimulator` and `AgenticUserLLMSimulator` now preserve stop tokens that appear outside the JSON object in raw LLM output, so `User._check_stop_token` can detect them. (PR: #39)
76
+
77
+
**Interface**
78
+
79
+
-`LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
80
+
-`SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
81
+
-`GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
72
82
73
83
**Benchmarks**
74
84
@@ -89,17 +99,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
89
99
-`LangGraphUser` → `LangGraphLLMUser`
90
100
-`LlamaIndexUser` → `LlamaIndexLLMUser`
91
101
102
+
**Documentation**
103
+
104
+
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
105
+
92
106
**Testing**
93
107
94
108
- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)
95
109
96
110
### Fixed
97
111
112
+
**Core**
113
+
98
114
-`ResultLogger._filter_report()` now includes `status` and `error` fields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (`error` is always present, `None` on success). (PR: #38)
99
-
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
- Packaging: Fixed `setuptools` configuration — `packages` now uses `find` with `include = ["maseval*"]` so subpackages and package data (`.json`, `.jsonl`, `.md`, etc.) are included in PyPI installs. (PR: #39)
116
+
117
+
**Benchmarks**
118
+
101
119
- Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
102
-
- Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` in Tau2 benchmark — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
120
+
- Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected `max_steps` comments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39)
121
+
- Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and `addict` dependency for nested dict access. (PR: #39)
122
+
- Tau2: Fixed incorrect return type annotations on `DB.load()` and `DB.copy_deep()` — now use `Self` instead of `"DB"`, so subclass methods return the correct type (PR: #29)
123
+
- MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports `buyer_score`, `seller_score`, and `mean_score` scaled to 0-100. (PR: #39)
- MultiAgentBench: `MarbleMultiAgentBenchBenchmark` now implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed default `coordinate_mode` from `"star"` to `"graph"` matching 1215/1226 MARBLE configs. Uses per-task `max_iterations` from task config (matching `engine.py:97`), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39)
126
+
- MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (`marble.utils.utils` → `marble.llms.model_prompting`), added Minecraft agent registration, per-domain defaults for `max_iterations`/`coordinate_mode`/`environment.type`/`memory.type` from MARBLE YAML configs, resolved hardcoded relative paths for `score.json` and `workspace/solution.py` via `_MARBLE_ROOT`, unified `coordinate_mode` defaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39)
127
+
- MultiAgentBench: Fixed bargaining evaluation crash from `.format()` on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE's `engine.py`. (PR: #39)
128
+
- GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
129
+
- MACS: `MACSGenericTool._schema_to_inputs()` now preserves `items` sub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39)
130
+
- MACS: Simplified `MACSUser._extract_user_profile()` — no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39)
131
+
- Converse: Removed silent `"gpt-4o"` default for `attacker_model_id`; now raises `ValueError` if not provided, preventing accidental benchmark misconfiguration. (PR: #39)
103
132
- ConVerse: Various fixes for faithful reproduction of original. (PR: #32)
Copy file name to clipboardExpand all lines: docs/benchmark/converse.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,7 @@
1
-
# CONVERSE Benchmark
1
+
# CONVERSE Benchmark (Beta)
2
+
3
+
!!! warning "Beta"
4
+
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
2
5
3
6
CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.
The **Gaia2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
3
+
!!! warning "Beta"
4
+
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
5
+
6
+
The **GAIA2 Benchmark** evaluates LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across multiple capability dimensions in a simulated mobile environment.
4
7
5
8
## Overview
6
9
7
-
[Gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
10
+
[GAIA2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) is designed to evaluate agents in realistic, time-sensitive scenarios. The benchmark features:
8
11
9
12
-**ARE simulation environment** with real-time dynamics and event scheduling
10
13
-**Tool-based time control** via `wait_for_notification()` for temporal reasoning
@@ -18,7 +21,7 @@ Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/
|`adaptability`| Adapting to changing requirements |
101
+
|`time`| Temporal reasoning tasks |
102
+
|`ambiguity`| Handling ambiguous instructions |
100
103
101
104
Load specific capabilities:
102
105
@@ -110,7 +113,7 @@ tasks = load_tasks(limit=50)
110
113
111
114
## Multi-Turn Notification Loop
112
115
113
-
GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
116
+
GAIA2 uses an **event-driven** multi-turn architecture. Scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
114
117
115
118
The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
116
119
@@ -147,16 +150,6 @@ if has_stop:
147
150
148
151
See `DefaultGaia2Agent` source for the canonical single-loop implementation.
Copy file name to clipboardExpand all lines: docs/benchmark/index.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,11 @@
2
2
3
3
MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.
4
4
5
+
!!! warning "Beta Benchmarks"
6
+
Several benchmarks are currently in **Beta**. They have been implemented carefully, but these are highly complex systems and we have not yet validated the results against the original implementations. Use with caution when comparing with existing results or original paper numbers. Contributions and compute donations welcome!
7
+
8
+
**MACS** is the only benchmark that has been fully validated.
9
+
5
10
## Adding Custom Benchmarks
6
11
7
12
You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
2
5
3
6
The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
2
5
3
6
The **Tau2 Benchmark** evaluates LLM-based agents on customer service tasks across multiple real-world domains, testing their ability to use tools, follow policies, and interact with users.
0 commit comments