Skip to content

Commit 9ce4840

Browse files
committed
initial implementation
1 parent 9330f55 commit 9ce4840

16 files changed

Lines changed: 3660 additions & 0 deletions

File tree

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Vendored MARBLE source (clone manually)
2+
marble/
3+
4+
# Python cache
5+
__pycache__/
6+
*.pyc
7+
*.pyo
8+
9+
# Test artifacts
10+
.pytest_cache/
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# MARBLE Integration Provenance
2+
3+
## Source Information
4+
5+
- **Source Repository**: https://github.com/ulab-uiuc/MARBLE
6+
- **Version**: Not yet pinned (clone latest and test)
7+
- **License**: MIT (Copyright 2024 Haofei Yu)
8+
- **Vendoring**: Permitted by MIT license with attribution
9+
10+
## Reference
11+
12+
**Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
13+
- arXiv: https://arxiv.org/abs/2503.01935
14+
- Authors: Haofei Yu et al.
15+
- Publication Date: 2025
16+
17+
## License Text (MIT)
18+
19+
```
20+
MIT License
21+
22+
Copyright (c) 2024 Haofei Yu
23+
24+
Permission is hereby granted, free of charge, to any person obtaining a copy
25+
of this software and associated documentation files (the "Software"), to deal
26+
in the Software without restriction, including without limitation the rights
27+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
28+
copies of the Software, and to permit persons to whom the Software is
29+
furnished to do so, subject to the following conditions:
30+
31+
The above copyright notice and this permission notice shall be included in all
32+
copies or substantial portions of the Software.
33+
34+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
35+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
36+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
37+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
38+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
39+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
40+
SOFTWARE.
41+
```
42+
43+
## Known Issues in MARBLE
44+
45+
1. **Missing method**: `AgentGraph.get_agent_profiles_linked()` does not exist but is
46+
called in `engine.py:702`. This breaks chain coordination mode.
47+
48+
2. **SharedMemory naming**: Despite the name, `SharedMemory` is instantiated per-agent
49+
in `BaseAgent.__init__()` and is NOT shared between agents. Use `msg_box` for
50+
inter-agent communication.
51+
52+
3. **Environment constructor signature**: Some environments expect different constructor
53+
arguments. Check each environment's `__init__` signature before use.
54+
55+
## Local Patches Applied
56+
57+
None currently. Document any patches here if applied.
58+
59+
## Update Process
60+
61+
To update MARBLE to a newer version:
62+
63+
1. `cd maseval/benchmark/multiagentbench/marble`
64+
2. `git fetch origin`
65+
3. `git log --oneline origin/main` (review changes)
66+
4. `git checkout <new-commit-hash>`
67+
5. Run integration tests
68+
6. Update this file with new version info
69+
70+
## Last Updated
71+
72+
- **Date**: 2026-01-19
73+
- **Updated By**: Claude Code
74+
- **Version Tested**: Initial integration (not yet pinned)
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# MultiAgentBench Integration
2+
3+
Framework-agnostic implementation of the MultiAgentBench benchmark suite from MARBLE
4+
(Multi-Agent Coordination Backbone with LLM Engine) for evaluating multi-agent collaboration.
5+
6+
**Original Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
7+
(arXiv:2503.01935)
8+
9+
**Original Repository**: https://github.com/ulab-uiuc/MARBLE
10+
11+
## Setup
12+
13+
This benchmark requires the MARBLE source code to be cloned locally. The benchmark
14+
does NOT install MARBLE as a dependency - instead, it imports directly from a local copy.
15+
16+
### 1. Clone MARBLE
17+
18+
```bash
19+
cd maseval/benchmark/multiagentbench
20+
git clone https://github.com/ulab-uiuc/MARBLE.git marble
21+
cd marble
22+
# Pin to tested version (recommended)
23+
git checkout <pinned-commit-hash>
24+
```
25+
26+
### 2. Install MARBLE Dependencies
27+
28+
MARBLE requires additional dependencies. Add them to your environment:
29+
30+
```bash
31+
# If using uv (recommended)
32+
uv add litellm ruamel.yaml
33+
34+
# Or with pip
35+
pip install litellm ruamel.yaml
36+
```
37+
38+
### 3. Verify Setup
39+
40+
```python
41+
from maseval.benchmark.multiagentbench import load_tasks
42+
43+
# Should load without error
44+
tasks = load_tasks("research", limit=1)
45+
print(f"Loaded {len(tasks)} task(s)")
46+
```
47+
48+
## Usage
49+
50+
### Basic Usage (Abstract Base)
51+
52+
The abstract `MultiAgentBenchBenchmark` provides task loading, environment setup,
53+
and evaluation infrastructure. You implement `setup_agents()` with your framework:
54+
55+
```python
56+
from maseval.benchmark.multiagentbench import (
57+
MultiAgentBenchBenchmark,
58+
MultiAgentBenchEnvironment,
59+
load_tasks,
60+
configure_model_ids,
61+
)
62+
63+
class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
64+
def setup_agents(self, agent_data, environment, task, user):
65+
# Your framework-specific agent creation
66+
...
67+
68+
def get_model_adapter(self, model_id, **kwargs):
69+
adapter = MyModelAdapter(model_id)
70+
if "register_name" in kwargs:
71+
self.register("models", kwargs["register_name"], adapter)
72+
return adapter
73+
74+
# Load and configure tasks
75+
tasks = load_tasks("research", limit=5)
76+
configure_model_ids(tasks, agent_model_id="gpt-4o")
77+
78+
# Run
79+
benchmark = MyMultiAgentBenchmark()
80+
results = benchmark.run(tasks)
81+
```
82+
83+
### MARBLE Reproduction
84+
85+
Use `MarbleMultiAgentBenchBenchmark` for exact reproduction of MARBLE's published results:
86+
87+
```python
88+
from maseval.benchmark.multiagentbench import (
89+
MarbleMultiAgentBenchBenchmark,
90+
load_tasks,
91+
configure_model_ids,
92+
)
93+
94+
# Load tasks from a simple domain (no Docker required)
95+
tasks = load_tasks("research", limit=5)
96+
configure_model_ids(tasks, agent_model_id="gpt-4o")
97+
98+
# Create benchmark with model adapter implementation
99+
class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
100+
def get_model_adapter(self, model_id, **kwargs):
101+
from maseval.interface.openai import OpenAIModelAdapter
102+
adapter = OpenAIModelAdapter(model_id)
103+
if "register_name" in kwargs:
104+
self.register("models", kwargs["register_name"], adapter)
105+
return adapter
106+
107+
benchmark = MyMarbleBenchmark()
108+
results = benchmark.run(tasks)
109+
110+
# Print results
111+
for result in results:
112+
print(f"Task: {result['task_id']}")
113+
print(f"Status: {result['status']}")
114+
if result['eval']:
115+
print(f"Passed: {result['eval'][0]['passed']}")
116+
```
117+
118+
## Domains
119+
120+
MultiAgentBench includes 7 domains with different requirements:
121+
122+
| Domain | External Dependencies | Initial Support |
123+
|--------|----------------------|-----------------|
124+
| Research | None | Yes |
125+
| Bargaining | None | Yes |
126+
| Coding | Filesystem access | Yes |
127+
| Web | Network access | Yes |
128+
| WorldSimulation | None | Yes |
129+
| Database | Docker + PostgreSQL | Optional |
130+
| Minecraft | External game server | Deferred |
131+
132+
### Domain-Specific Notes
133+
134+
- **Research/Bargaining**: Recommended for initial testing - no infrastructure required
135+
- **Coding**: Creates files in a workspace directory
136+
- **Database**: Requires Docker with PostgreSQL image
137+
- **Minecraft**: Not currently supported (requires external game server)
138+
139+
## Known Limitations
140+
141+
1. **Chain coordination mode bug**: MARBLE's `engine.py` references `get_agent_profiles_linked()`
142+
which doesn't exist in `AgentGraph`. Tasks using chain coordination may fail.
143+
144+
2. **SharedMemory is per-agent**: Despite the name, each MARBLE agent creates its own
145+
`SharedMemory` instance. Use `msg_box` for inter-agent communication.
146+
147+
3. **Requires manual MARBLE clone**: MARBLE must be cloned manually into the `marble/`
148+
subdirectory (gitignored by default).
149+
150+
## File Structure
151+
152+
```
153+
multiagentbench/
154+
├── __init__.py # Public API exports
155+
├── README.md # This file
156+
├── PROVENANCE.md # MARBLE version and license info
157+
├── .gitignore # Ignores marble/ directory
158+
├── multiagentbench.py # Benchmark classes
159+
├── environment.py # MultiAgentBenchEnvironment
160+
├── data_loader.py # Task loading utilities
161+
├── adapters/
162+
│ ├── __init__.py
163+
│ └── marble_adapter.py # MarbleAgentAdapter
164+
└── marble/ # ← Vendored MARBLE (gitignored)
165+
└── ...
166+
```
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
"""MultiAgentBench - Multi-Agent Coordination Benchmark from MARBLE.
2+
3+
Framework-agnostic implementation of the MultiAgentBench benchmark suite for
4+
evaluating multi-agent collaboration and competition in LLM-based systems.
5+
6+
Original Repository: https://github.com/ulab-uiuc/MARBLE
7+
Paper: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
8+
(arXiv:2503.01935)
9+
10+
Domains:
11+
- research: Research idea generation and collaboration
12+
- bargaining: Negotiation and bargaining scenarios
13+
- coding: Software development collaboration
14+
- database: Database manipulation and querying (requires Docker)
15+
- minecraft: Collaborative building (requires external server)
16+
- web: Web-based task completion
17+
- worldsimulation: World simulation and interaction
18+
19+
Setup:
20+
This benchmark requires MARBLE source code to be cloned locally:
21+
22+
```bash
23+
cd maseval/benchmark/multiagentbench
24+
git clone https://github.com/ulab-uiuc/MARBLE.git marble
25+
```
26+
27+
See README.md in this directory for detailed setup instructions.
28+
29+
Usage:
30+
from maseval.benchmark.multiagentbench import (
31+
MultiAgentBenchBenchmark,
32+
MarbleMultiAgentBenchBenchmark,
33+
MultiAgentBenchEnvironment,
34+
MultiAgentBenchEvaluator,
35+
MarbleAgentAdapter,
36+
load_tasks,
37+
configure_model_ids,
38+
get_domain_info,
39+
VALID_DOMAINS,
40+
)
41+
42+
# Load and configure tasks
43+
tasks = load_tasks("research", limit=5)
44+
configure_model_ids(tasks, agent_model_id="gpt-4o")
45+
46+
# Create your framework-specific benchmark subclass
47+
class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
48+
def setup_agents(self, agent_data, environment, task, user):
49+
# Create your agents
50+
...
51+
52+
def get_model_adapter(self, model_id, **kwargs):
53+
adapter = MyModelAdapter(model_id)
54+
if "register_name" in kwargs:
55+
self.register("models", kwargs["register_name"], adapter)
56+
return adapter
57+
58+
# Run benchmark
59+
benchmark = MyMultiAgentBenchmark()
60+
results = benchmark.run(tasks, agent_data={})
61+
62+
MARBLE Reproduction Mode:
63+
For exact reproduction of MARBLE's published results, use
64+
MarbleMultiAgentBenchBenchmark which wraps MARBLE's native agents:
65+
66+
```python
67+
class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
68+
def get_model_adapter(self, model_id, **kwargs):
69+
from maseval.interface.openai import OpenAIModelAdapter
70+
adapter = OpenAIModelAdapter(model_id)
71+
if "register_name" in kwargs:
72+
self.register("models", kwargs["register_name"], adapter)
73+
return adapter
74+
75+
benchmark = MyMarbleBenchmark()
76+
results = benchmark.run(tasks, agent_data={})
77+
```
78+
"""
79+
80+
# Core benchmark classes
81+
from maseval.benchmark.multiagentbench.multiagentbench import (
82+
MultiAgentBenchBenchmark,
83+
MarbleMultiAgentBenchBenchmark,
84+
)
85+
86+
# Environment
87+
from maseval.benchmark.multiagentbench.environment import (
88+
MultiAgentBenchEnvironment,
89+
INFRASTRUCTURE_DOMAINS,
90+
)
91+
92+
# Evaluator
93+
from maseval.benchmark.multiagentbench.evaluator import (
94+
MultiAgentBenchEvaluator,
95+
MultiAgentBenchMetrics,
96+
)
97+
98+
# Agent adapters
99+
from maseval.benchmark.multiagentbench.adapters import (
100+
MarbleAgentAdapter,
101+
)
102+
from maseval.benchmark.multiagentbench.adapters.marble_adapter import (
103+
create_marble_agents,
104+
)
105+
106+
# Data loading
107+
from maseval.benchmark.multiagentbench.data_loader import (
108+
load_tasks,
109+
configure_model_ids,
110+
get_domain_info,
111+
VALID_DOMAINS,
112+
INFRASTRUCTURE_DOMAINS as INFRASTRUCTURE_REQUIRED_DOMAINS,
113+
)
114+
115+
116+
__all__ = [
117+
# Core benchmark classes
118+
"MultiAgentBenchBenchmark",
119+
"MarbleMultiAgentBenchBenchmark",
120+
# Environment
121+
"MultiAgentBenchEnvironment",
122+
"INFRASTRUCTURE_DOMAINS",
123+
# Evaluator
124+
"MultiAgentBenchEvaluator",
125+
"MultiAgentBenchMetrics",
126+
# Agent adapters
127+
"MarbleAgentAdapter",
128+
"create_marble_agents",
129+
# Data loading
130+
"load_tasks",
131+
"configure_model_ids",
132+
"get_domain_info",
133+
"VALID_DOMAINS",
134+
"INFRASTRUCTURE_REQUIRED_DOMAINS",
135+
]
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
"""Agent adapters for MultiAgentBench."""
2+
3+
from maseval.benchmark.multiagentbench.adapters.marble_adapter import (
4+
MarbleAgentAdapter,
5+
)
6+
7+
__all__ = ["MarbleAgentAdapter"]

0 commit comments

Comments
 (0)