Skip to content

Commit 0667cc0

Browse files
committed
added to docs
1 parent 9ce4840 commit 0667cc0

6 files changed

Lines changed: 300 additions & 13 deletions

File tree

BENCHMARKS.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,22 @@ $\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi
2727

2828
---
2929

30-
## 3. [Name of Next Benchmark]
30+
## 3. MultiAgentBench (MARBLE)
3131

32-
(Description for the third benchmark...)
32+
MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
33+
34+
### Source and License
35+
36+
- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
37+
- **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
38+
- **Code License:** MIT
39+
- **Data License:** MIT
40+
41+
---
42+
43+
## 4. [Name of Next Benchmark]
44+
45+
(Description for the next benchmark...)
3346

3447
### Source and License
3548

docs/benchmark/multiagentbench.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# MultiAgentBench: Multi-Agent Collaboration Benchmark
2+
3+
The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
4+
5+
## Overview
6+
7+
[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
8+
9+
- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
10+
- **Multiple coordination modes**: cooperative, star, tree, hierarchical
11+
- **LLM-based evaluation**: Matches MARBLE's evaluation methodology
12+
- **Framework-agnostic**: Use with any agent framework or MARBLE's native agents
13+
14+
Reference Paper: [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
15+
16+
Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
17+
18+
## Quick Start
19+
20+
```python
21+
from maseval.benchmark.multiagentbench import (
22+
MultiAgentBenchBenchmark,
23+
MultiAgentBenchEnvironment,
24+
MultiAgentBenchEvaluator,
25+
load_tasks,
26+
configure_model_ids,
27+
ensure_marble_exists,
28+
)
29+
30+
# Ensure MARBLE is installed (auto-downloads if needed)
31+
ensure_marble_exists()
32+
33+
# Load and configure tasks
34+
tasks = load_tasks("research", limit=5)
35+
configure_model_ids(tasks, agent_model_id="gpt-4o")
36+
37+
# Create your framework-specific benchmark subclass
38+
class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
39+
def setup_agents(self, agent_data, environment, task, user):
40+
# Your framework-specific agent creation
41+
agent_configs = task.environment_data.get("agents", [])
42+
# Create agents based on configs...
43+
...
44+
45+
def get_model_adapter(self, model_id, **kwargs):
46+
adapter = MyModelAdapter(model_id)
47+
if "register_name" in kwargs:
48+
self.register("models", kwargs["register_name"], adapter)
49+
return adapter
50+
51+
# Run benchmark
52+
benchmark = MyMultiAgentBenchmark()
53+
results = benchmark.run(tasks, agent_data={})
54+
```
55+
56+
## MARBLE Reproduction Mode
57+
58+
For exact reproduction of MARBLE's published results, use `MarbleMultiAgentBenchBenchmark` which wraps MARBLE's native agents:
59+
60+
```python
61+
from maseval.benchmark.multiagentbench import (
62+
MarbleMultiAgentBenchBenchmark,
63+
load_tasks,
64+
configure_model_ids,
65+
ensure_marble_exists,
66+
)
67+
68+
# Ensure MARBLE is installed
69+
ensure_marble_exists()
70+
71+
# Load tasks
72+
tasks = load_tasks("research", limit=5)
73+
configure_model_ids(tasks, agent_model_id="gpt-4o")
74+
75+
# Create benchmark with model adapter
76+
class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
77+
def get_model_adapter(self, model_id, **kwargs):
78+
from maseval.interface.openai import OpenAIModelAdapter
79+
adapter = OpenAIModelAdapter(model_id)
80+
if "register_name" in kwargs:
81+
self.register("models", kwargs["register_name"], adapter)
82+
return adapter
83+
84+
benchmark = MyMarbleBenchmark()
85+
results = benchmark.run(tasks, agent_data={})
86+
```
87+
88+
## Available Domains
89+
90+
| Domain | Description | Infrastructure |
91+
|--------|-------------|----------------|
92+
| `research` | Research idea generation and collaboration | None |
93+
| `bargaining` | Negotiation scenarios (buyer/seller) | None |
94+
| `coding` | Software development collaboration | Filesystem |
95+
| `database` | Database manipulation and querying | Docker + PostgreSQL |
96+
| `web` | Web-based task completion | Network |
97+
| `worldsimulation` | World simulation and interaction | None |
98+
| `minecraft` | Collaborative building | External server |
99+
100+
## API Reference
101+
102+
::: maseval.benchmark.multiagentbench.MultiAgentBenchBenchmark
103+
104+
::: maseval.benchmark.multiagentbench.MarbleMultiAgentBenchBenchmark
105+
106+
::: maseval.benchmark.multiagentbench.MultiAgentBenchEnvironment
107+
108+
::: maseval.benchmark.multiagentbench.MultiAgentBenchEvaluator
109+
110+
::: maseval.benchmark.multiagentbench.MarbleAgentAdapter
111+
112+
::: maseval.benchmark.multiagentbench.load_tasks
113+
114+
::: maseval.benchmark.multiagentbench.configure_model_ids
115+
116+
::: maseval.benchmark.multiagentbench.ensure_marble_exists
117+
118+
::: maseval.benchmark.multiagentbench.download_marble
119+
120+
::: maseval.benchmark.multiagentbench.get_domain_info

maseval/benchmark/multiagentbench/README.md

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,26 @@ Framework-agnostic implementation of the MultiAgentBench benchmark suite from MA
1010

1111
## Setup
1212

13-
This benchmark requires the MARBLE source code to be cloned locally. The benchmark
14-
does NOT install MARBLE as a dependency - instead, it imports directly from a local copy.
13+
This benchmark requires the MARBLE source code. You can set it up automatically or manually.
1514

16-
### 1. Clone MARBLE
15+
### Option 1: Automatic Setup (Recommended)
16+
17+
MARBLE will be automatically downloaded when you first use it:
18+
19+
```python
20+
from maseval.benchmark.multiagentbench import ensure_marble_exists, load_tasks
21+
22+
# This downloads MARBLE if not present (about 50MB)
23+
ensure_marble_exists()
24+
25+
# Now load tasks
26+
tasks = load_tasks("research", limit=1)
27+
print(f"Loaded {len(tasks)} task(s)")
28+
```
29+
30+
### Option 2: Manual Clone
31+
32+
If you prefer to clone manually:
1733

1834
```bash
1935
cd maseval/benchmark/multiagentbench
@@ -23,7 +39,7 @@ cd marble
2339
git checkout <pinned-commit-hash>
2440
```
2541

26-
### 2. Install MARBLE Dependencies
42+
### Install MARBLE Dependencies
2743

2844
MARBLE requires additional dependencies. Add them to your environment:
2945

@@ -35,7 +51,7 @@ uv add litellm ruamel.yaml
3551
pip install litellm ruamel.yaml
3652
```
3753

38-
### 3. Verify Setup
54+
### Verify Setup
3955

4056
```python
4157
from maseval.benchmark.multiagentbench import load_tasks

maseval/benchmark/multiagentbench/__init__.py

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,17 @@
1717
- worldsimulation: World simulation and interaction
1818
1919
Setup:
20-
This benchmark requires MARBLE source code to be cloned locally:
20+
This benchmark requires MARBLE source code. It will be automatically
21+
downloaded when you first use `load_tasks()` or you can set it up manually:
2122
22-
```bash
23-
cd maseval/benchmark/multiagentbench
24-
git clone https://github.com/ulab-uiuc/MARBLE.git marble
23+
```python
24+
# Option 1: Automatic download (recommended)
25+
from maseval.benchmark.multiagentbench import ensure_marble_exists
26+
ensure_marble_exists() # Downloads MARBLE if not present
27+
28+
# Option 2: Manual clone
29+
# cd maseval/benchmark/multiagentbench
30+
# git clone https://github.com/ulab-uiuc/MARBLE.git marble
2531
```
2632
2733
See README.md in this directory for detailed setup instructions.
@@ -35,10 +41,14 @@
3541
MarbleAgentAdapter,
3642
load_tasks,
3743
configure_model_ids,
44+
ensure_marble_exists,
3845
get_domain_info,
3946
VALID_DOMAINS,
4047
)
4148
49+
# Ensure MARBLE is installed (auto-downloads if needed)
50+
ensure_marble_exists()
51+
4252
# Load and configure tasks
4353
tasks = load_tasks("research", limit=5)
4454
configure_model_ids(tasks, agent_model_id="gpt-4o")
@@ -103,11 +113,13 @@ def get_model_adapter(self, model_id, **kwargs):
103113
create_marble_agents,
104114
)
105115

106-
# Data loading
116+
# Data loading and setup
107117
from maseval.benchmark.multiagentbench.data_loader import (
108118
load_tasks,
109119
configure_model_ids,
110120
get_domain_info,
121+
ensure_marble_exists,
122+
download_marble,
111123
VALID_DOMAINS,
112124
INFRASTRUCTURE_DOMAINS as INFRASTRUCTURE_REQUIRED_DOMAINS,
113125
)
@@ -126,10 +138,12 @@ def get_model_adapter(self, model_id, **kwargs):
126138
# Agent adapters
127139
"MarbleAgentAdapter",
128140
"create_marble_agents",
129-
# Data loading
141+
# Data loading and setup
130142
"load_tasks",
131143
"configure_model_ids",
132144
"get_domain_info",
145+
"ensure_marble_exists",
146+
"download_marble",
133147
"VALID_DOMAINS",
134148
"INFRASTRUCTURE_REQUIRED_DOMAINS",
135149
]

0 commit comments

Comments
 (0)