Skip to content

Commit 153a5be

Browse files
authored
Implement MultiAgentBench (#25)
Add MultiAgentBench (MARBLE) benchmark integration (#25) Integrate MARBLE's MultiAgentBench benchmark for evaluating multi-agent collaboration and competition across 7 domains (research, bargaining, coding, database, web, worldsimulation, minecraft). Components: - MultiAgentBenchBenchmark: Framework-agnostic abstract base class - MarbleMultiAgentBenchBenchmark: MARBLE reproduction mode - MultiAgentBenchEnvironment and MultiAgentBenchEvaluator - MarbleAgentAdapter for tracing MARBLE agents - Data loading utilities with auto-download support
1 parent 7661984 commit 153a5be

27 files changed

Lines changed: 5495 additions & 12 deletions

BENCHMARKS.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,20 @@ $\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi
2727

2828
---
2929

30-
## 3. Gaia2 (ARE)
30+
## 3. MultiAgentBench (MARBLE)
31+
32+
MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
33+
34+
### Source and License
35+
36+
- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
37+
- **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
38+
- **Code License:** MIT
39+
- **Data License:** MIT
40+
41+
---
42+
43+
## 4. GAIA2
3144

3245
Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.
3346

CHANGELOG.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212
**Benchmarks**
1313

1414
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
15-
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
16-
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
17-
- Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
18-
- Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
19-
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
20-
- Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
21-
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
15+
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
16+
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
17+
- Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
18+
- Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
19+
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
20+
- Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
21+
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
22+
23+
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
24+
- `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
25+
- `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
26+
- `MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
27+
- Data loading utilities: `load_tasks()`, `configure_model_ids()`, `get_domain_info()`, `ensure_marble_exists()` (PR: #25)
28+
- MARBLE adapter: `MarbleAgentAdapter` for wrapping MARBLE agents with MASEval tracing (PR: #25)
2229

2330
**Examples**
2431

2532
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
2633

27-
**Seeding System**
34+
**Core**
2835

2936
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
3037
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
@@ -36,9 +43,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3643
**Interface**
3744

3845
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
39-
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
40-
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
41-
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
46+
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
47+
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
48+
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
4249

4350
### Changed
4451

docs/benchmark/multiagentbench.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# MultiAgentBench: Multi-Agent Collaboration Benchmark
2+
3+
The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
4+
5+
[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
6+
7+
- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
8+
- **Multiple coordination modes**: cooperative, star, tree, hierarchical
9+
- **LLM-based evaluation**: Matches MARBLE's evaluation methodology
10+
- **Framework-agnostic**: Use with any agent framework or MARBLE's native agents
11+
12+
Reference Paper: [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
13+
14+
Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
15+
16+
::: maseval.benchmark.multiagentbench.MultiAgentBenchBenchmark
17+
18+
::: maseval.benchmark.multiagentbench.MarbleMultiAgentBenchBenchmark
19+
20+
::: maseval.benchmark.multiagentbench.MultiAgentBenchEnvironment
21+
22+
::: maseval.benchmark.multiagentbench.MultiAgentBenchEvaluator
23+
24+
::: maseval.benchmark.multiagentbench.MarbleAgentAdapter
25+
26+
::: maseval.benchmark.multiagentbench.load_tasks
27+
28+
::: maseval.benchmark.multiagentbench.configure_model_ids
29+
30+
::: maseval.benchmark.multiagentbench.ensure_marble_exists
31+
32+
::: maseval.benchmark.multiagentbench.download_marble
33+
34+
::: maseval.benchmark.multiagentbench.get_domain_info
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Vendored MARBLE source (clone manually)
2+
marble/
3+
4+
# Python cache
5+
__pycache__/
6+
*.pyc
7+
*.pyo
8+
9+
# Test artifacts
10+
.pytest_cache/
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# MARBLE Integration Provenance
2+
3+
## Source Information
4+
5+
- **Source Repository**: https://github.com/ulab-uiuc/MARBLE
6+
- **Version**: Not yet pinned (clone latest and test)
7+
- **License**: MIT (Copyright 2024 Haofei Yu)
8+
- **Vendoring**: Permitted by MIT license with attribution
9+
10+
## Reference
11+
12+
**Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
13+
14+
- arXiv: https://arxiv.org/abs/2503.01935
15+
- Authors: Haofei Yu et al.
16+
- Publication Date: 2025
17+
18+
## License Text (MIT)
19+
20+
```
21+
MIT License
22+
23+
Copyright (c) 2024 Haofei Yu
24+
25+
Permission is hereby granted, free of charge, to any person obtaining a copy
26+
of this software and associated documentation files (the "Software"), to deal
27+
in the Software without restriction, including without limitation the rights
28+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
29+
copies of the Software, and to permit persons to whom the Software is
30+
furnished to do so, subject to the following conditions:
31+
32+
The above copyright notice and this permission notice shall be included in all
33+
copies or substantial portions of the Software.
34+
35+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
36+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
37+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
38+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
39+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
40+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
41+
SOFTWARE.
42+
```
43+
44+
## Known Issues in MARBLE
45+
46+
1. **Missing method**: `AgentGraph.get_agent_profiles_linked()` does not exist but is
47+
called in `engine.py:702`. This breaks chain coordination mode.
48+
49+
2. **SharedMemory naming**: Despite the name, `SharedMemory` is instantiated per-agent
50+
in `BaseAgent.__init__()` and is NOT shared between agents. Use `msg_box` for
51+
inter-agent communication.
52+
53+
3. **Environment constructor signature**: Some environments expect different constructor
54+
arguments. Check each environment's `__init__` signature before use.
55+
56+
## Local Patches Applied
57+
58+
None currently. Document any patches here if applied.
59+
60+
## Update Process
61+
62+
To update MARBLE to a newer version:
63+
64+
1. `cd maseval/benchmark/multiagentbench/marble`
65+
2. `git fetch origin`
66+
3. `git log --oneline origin/main` (review changes)
67+
4. `git checkout <new-commit-hash>`
68+
5. Run integration tests
69+
6. Update this file with new version info
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# MultiAgentBench Integration
2+
3+
Framework-agnostic implementation of the MultiAgentBench benchmark suite from MARBLE
4+
(Multi-Agent Coordination Backbone with LLM Engine) for evaluating multi-agent collaboration.
5+
6+
**Original Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
7+
(arXiv:2503.01935)
8+
9+
**Original Repository**: https://github.com/ulab-uiuc/MARBLE
10+
11+
## Setup
12+
13+
This benchmark requires the MARBLE source code. You can set it up automatically or manually.
14+
15+
### Option 1: Automatic Setup (Recommended)
16+
17+
MARBLE will be automatically downloaded when you first use it:
18+
19+
```python
20+
from maseval.benchmark.multiagentbench import ensure_marble_exists, load_tasks
21+
22+
# This downloads MARBLE if not present (about 50MB)
23+
ensure_marble_exists()
24+
25+
# Now load tasks
26+
tasks = load_tasks("research", limit=1)
27+
print(f"Loaded {len(tasks)} task(s)")
28+
```
29+
30+
### Option 2: Manual Clone
31+
32+
If you prefer to clone manually:
33+
34+
```bash
35+
cd maseval/benchmark/multiagentbench
36+
git clone https://github.com/ulab-uiuc/MARBLE.git marble
37+
cd marble
38+
# Pin to tested version (recommended)
39+
git checkout <pinned-commit-hash>
40+
```
41+
42+
### Install MARBLE Dependencies
43+
44+
MARBLE requires additional dependencies. Add them to your environment:
45+
46+
```bash
47+
# If using uv (recommended)
48+
uv add litellm ruamel.yaml
49+
50+
# Or with pip
51+
pip install litellm ruamel.yaml
52+
```
53+
54+
### Verify Setup
55+
56+
```python
57+
from maseval.benchmark.multiagentbench import load_tasks
58+
59+
# Should load without error
60+
tasks = load_tasks("research", limit=1)
61+
print(f"Loaded {len(tasks)} task(s)")
62+
```
63+
64+
## Usage
65+
66+
### Basic Usage (Abstract Base)
67+
68+
The abstract `MultiAgentBenchBenchmark` provides task loading, environment setup,
69+
and evaluation infrastructure. You implement `setup_agents()` with your framework:
70+
71+
```python
72+
from maseval.benchmark.multiagentbench import (
73+
MultiAgentBenchBenchmark,
74+
MultiAgentBenchEnvironment,
75+
load_tasks,
76+
configure_model_ids,
77+
)
78+
79+
class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
80+
def setup_agents(self, agent_data, environment, task, user, seed_generator=None):
81+
# Your framework-specific agent creation
82+
...
83+
84+
def get_model_adapter(self, model_id, **kwargs):
85+
adapter = MyModelAdapter(model_id)
86+
if "register_name" in kwargs:
87+
self.register("models", kwargs["register_name"], adapter)
88+
return adapter
89+
90+
# Load and configure tasks
91+
tasks = load_tasks("research", limit=5)
92+
configure_model_ids(tasks, agent_model_id="gpt-4o")
93+
94+
# Run
95+
benchmark = MyMultiAgentBenchmark()
96+
results = benchmark.run(tasks)
97+
```
98+
99+
### MARBLE Reproduction
100+
101+
Use `MarbleMultiAgentBenchBenchmark` for exact reproduction of MARBLE's published results:
102+
103+
```python
104+
from maseval.benchmark.multiagentbench import (
105+
MarbleMultiAgentBenchBenchmark,
106+
load_tasks,
107+
configure_model_ids,
108+
)
109+
110+
# Load tasks from a simple domain (no Docker required)
111+
tasks = load_tasks("research", limit=5)
112+
configure_model_ids(tasks, agent_model_id="gpt-4o")
113+
114+
# Create benchmark with model adapter implementation
115+
class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
116+
def get_model_adapter(self, model_id, **kwargs):
117+
from maseval.interface.openai import OpenAIModelAdapter
118+
adapter = OpenAIModelAdapter(model_id)
119+
if "register_name" in kwargs:
120+
self.register("models", kwargs["register_name"], adapter)
121+
return adapter
122+
123+
benchmark = MyMarbleBenchmark()
124+
results = benchmark.run(tasks)
125+
126+
# Print results
127+
for result in results:
128+
print(f"Task: {result['task_id']}")
129+
print(f"Status: {result['status']}")
130+
if result['eval']:
131+
print(f"Passed: {result['eval'][0]['passed']}")
132+
```
133+
134+
## Domains
135+
136+
MultiAgentBench includes 7 domains with different requirements:
137+
138+
| Domain | External Dependencies | Initial Support |
139+
| --------------- | --------------------- | --------------- |
140+
| Research | None | Yes |
141+
| Bargaining | None | Yes |
142+
| Coding | Filesystem access | Yes |
143+
| Web | Network access | Yes |
144+
| WorldSimulation | None | Yes |
145+
| Database | Docker + PostgreSQL | Optional |
146+
| Minecraft | External game server | Deferred |
147+
148+
### Domain-Specific Notes
149+
150+
- **Research/Bargaining**: Recommended for initial testing - no infrastructure required
151+
- **Coding**: Creates files in a workspace directory
152+
- **Database**: Requires Docker with PostgreSQL image
153+
- **Minecraft**: Not currently supported (requires external game server)
154+
155+
## Known Limitations
156+
157+
1. **Chain coordination mode bug**: MARBLE's `engine.py` references `get_agent_profiles_linked()`
158+
which doesn't exist in `AgentGraph`. Tasks using chain coordination may fail.
159+
160+
2. **SharedMemory is per-agent**: Despite the name, each MARBLE agent creates its own
161+
`SharedMemory` instance. Use `msg_box` for inter-agent communication.
162+
163+
3. **Requires manual MARBLE clone**: MARBLE must be cloned manually into the `marble/`
164+
subdirectory (gitignored by default).
165+
166+
## File Structure
167+
168+
```
169+
multiagentbench/
170+
├── __init__.py # Public API exports
171+
├── README.md # This file
172+
├── PROVENANCE.md # MARBLE version and license info
173+
├── .gitignore # Ignores marble/ directory
174+
├── multiagentbench.py # Benchmark classes
175+
├── environment.py # MultiAgentBenchEnvironment
176+
├── data_loader.py # Task loading utilities
177+
├── adapters/
178+
│ ├── __init__.py
179+
│ └── marble_adapter.py # MarbleAgentAdapter
180+
└── marble/ # ← Vendored MARBLE (gitignored)
181+
└── ...
182+
```

0 commit comments

Comments
 (0)