maseval
diff --git a/‎BENCHMARKS.md‎
Lines changed: 14 additions & 1 deletion b/‎BENCHMARKS.md‎
Lines changed: 14 additions & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 18 additions & 11 deletions b/‎CHANGELOG.md‎
Lines changed: 18 additions & 11 deletions
diff --git a/‎docs/benchmark/multiagentbench.md‎
Lines changed: 34 additions & 0 deletions b/‎docs/benchmark/multiagentbench.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/.gitignore‎
Lines changed: 10 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/.gitignore‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/PROVENANCE.md‎
Lines changed: 69 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/PROVENANCE.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/README.md‎
Lines changed: 182 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/README.md‎
Lines changed: 182 additions & 0 deletions
@@ -27,7 +27,20 @@ $\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi
 
 ---
 
-## 3. Gaia2 (ARE)
+## 3. MultiAgentBench (MARBLE)
+
+MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
+
+### Source and License
+
+- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
+- **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
+- **Code License:** MIT
+- **Data License:** MIT
+
+---
+
+## 4. GAIA2
 
 Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.
 
 
@@ -12,19 +12,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 **Benchmarks**
 
 - GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
-- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
-- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
-- Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
-- Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
-- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
-- Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
-- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
+  - `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
+  - `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
+  - Tool wrapper (`AREToolWrapper`) for MASEval tracing of ARE tools with simulation time tracking (PR: #26)
+  - Data loading utilities: `load_tasks()`, `configure_model_ids()` for loading scenarios from HuggingFace (PR: #26)
+  - Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
+  - Support for 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, noise (PR: #26)
+  - Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
+
+- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across research, bargaining, coding, and database domains (PR: #25)
+  - `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
+  - `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
+  - `MultiAgentBenchEnvironment` and `MultiAgentBenchEvaluator` components (PR: #25)
+  - Data loading utilities: `load_tasks()`, `configure_model_ids()`, `get_domain_info()`, `ensure_marble_exists()` (PR: #25)
+  - MARBLE adapter: `MarbleAgentAdapter` for wrapping MARBLE agents with MASEval tracing (PR: #25)
 
 **Examples**
 
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
-**Seeding System**
+**Core**
 
 - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
 - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
@@ -36,9 +43,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 **Interface**
 
 - CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
-- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
-- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
-- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
+  - Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
+  - Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
+  - Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
 
 ### Changed
 
 
@@ -0,0 +1,34 @@
+# MultiAgentBench: Multi-Agent Collaboration Benchmark
+
+The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
+
+[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
+
+- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
+- **Multiple coordination modes**: cooperative, star, tree, hierarchical
+- **LLM-based evaluation**: Matches MARBLE's evaluation methodology
+- **Framework-agnostic**: Use with any agent framework or MARBLE's native agents
+
+Reference Paper: [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
+
+Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
+
+::: maseval.benchmark.multiagentbench.MultiAgentBenchBenchmark
+
+::: maseval.benchmark.multiagentbench.MarbleMultiAgentBenchBenchmark
+
+::: maseval.benchmark.multiagentbench.MultiAgentBenchEnvironment
+
+::: maseval.benchmark.multiagentbench.MultiAgentBenchEvaluator
+
+::: maseval.benchmark.multiagentbench.MarbleAgentAdapter
+
+::: maseval.benchmark.multiagentbench.load_tasks
+
+::: maseval.benchmark.multiagentbench.configure_model_ids
+
+::: maseval.benchmark.multiagentbench.ensure_marble_exists
+
+::: maseval.benchmark.multiagentbench.download_marble
+
+::: maseval.benchmark.multiagentbench.get_domain_info
@@ -0,0 +1,10 @@
+# Vendored MARBLE source (clone manually)
+marble/
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+
+# Test artifacts
+.pytest_cache/
@@ -0,0 +1,69 @@
+# MARBLE Integration Provenance
+
+## Source Information
+
+- **Source Repository**: https://github.com/ulab-uiuc/MARBLE
+- **Version**: Not yet pinned (clone latest and test)
+- **License**: MIT (Copyright 2024 Haofei Yu)
+- **Vendoring**: Permitted by MIT license with attribution
+
+## Reference
+
+**Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
+
+- arXiv: https://arxiv.org/abs/2503.01935
+- Authors: Haofei Yu et al.
+- Publication Date: 2025
+
+## License Text (MIT)
+
+```
+MIT License
+
+Copyright (c) 2024 Haofei Yu
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```
+
+## Known Issues in MARBLE
+
+1. **Missing method**: `AgentGraph.get_agent_profiles_linked()` does not exist but is
+   called in `engine.py:702`. This breaks chain coordination mode.
+
+2. **SharedMemory naming**: Despite the name, `SharedMemory` is instantiated per-agent
+   in `BaseAgent.__init__()` and is NOT shared between agents. Use `msg_box` for
+   inter-agent communication.
+
+3. **Environment constructor signature**: Some environments expect different constructor
+   arguments. Check each environment's `__init__` signature before use.
+
+## Local Patches Applied
+
+None currently. Document any patches here if applied.
+
+## Update Process
+
+To update MARBLE to a newer version:
+
+1. `cd maseval/benchmark/multiagentbench/marble`
+2. `git fetch origin`
+3. `git log --oneline origin/main` (review changes)
+4. `git checkout <new-commit-hash>`
+5. Run integration tests
+6. Update this file with new version info
@@ -0,0 +1,182 @@
+# MultiAgentBench Integration
+
+Framework-agnostic implementation of the MultiAgentBench benchmark suite from MARBLE
+(Multi-Agent Coordination Backbone with LLM Engine) for evaluating multi-agent collaboration.
+
+**Original Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
+(arXiv:2503.01935)
+
+**Original Repository**: https://github.com/ulab-uiuc/MARBLE
+
+## Setup
+
+This benchmark requires the MARBLE source code. You can set it up automatically or manually.
+
+### Option 1: Automatic Setup (Recommended)
+
+MARBLE will be automatically downloaded when you first use it:
+
+```python
+from maseval.benchmark.multiagentbench import ensure_marble_exists, load_tasks
+
+# This downloads MARBLE if not present (about 50MB)
+ensure_marble_exists()
+
+# Now load tasks
+tasks = load_tasks("research", limit=1)
+print(f"Loaded {len(tasks)} task(s)")
+```
+
+### Option 2: Manual Clone
+
+If you prefer to clone manually:
+
+```bash
+cd maseval/benchmark/multiagentbench
+git clone https://github.com/ulab-uiuc/MARBLE.git marble
+cd marble
+# Pin to tested version (recommended)
+git checkout <pinned-commit-hash>
+```
+
+### Install MARBLE Dependencies
+
+MARBLE requires additional dependencies. Add them to your environment:
+
+```bash
+# If using uv (recommended)
+uv add litellm ruamel.yaml
+
+# Or with pip
+pip install litellm ruamel.yaml
+```
+
+### Verify Setup
+
+```python
+from maseval.benchmark.multiagentbench import load_tasks
+
+# Should load without error
+tasks = load_tasks("research", limit=1)
+print(f"Loaded {len(tasks)} task(s)")
+```
+
+## Usage
+
+### Basic Usage (Abstract Base)
+
+The abstract `MultiAgentBenchBenchmark` provides task loading, environment setup,
+and evaluation infrastructure. You implement `setup_agents()` with your framework:
+
+```python
+from maseval.benchmark.multiagentbench import (
+    MultiAgentBenchBenchmark,
+    MultiAgentBenchEnvironment,
+    load_tasks,
+    configure_model_ids,
+)
+
+class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
+    def setup_agents(self, agent_data, environment, task, user, seed_generator=None):
+        # Your framework-specific agent creation
+        ...
+
+    def get_model_adapter(self, model_id, **kwargs):
+        adapter = MyModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+# Load and configure tasks
+tasks = load_tasks("research", limit=5)
+configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+# Run
+benchmark = MyMultiAgentBenchmark()
+results = benchmark.run(tasks)
+```
+
+### MARBLE Reproduction
+
+Use `MarbleMultiAgentBenchBenchmark` for exact reproduction of MARBLE's published results:
+
+```python
+from maseval.benchmark.multiagentbench import (
+    MarbleMultiAgentBenchBenchmark,
+    load_tasks,
+    configure_model_ids,
+)
+
+# Load tasks from a simple domain (no Docker required)
+tasks = load_tasks("research", limit=5)
+configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+# Create benchmark with model adapter implementation
+class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
+    def get_model_adapter(self, model_id, **kwargs):
+        from maseval.interface.openai import OpenAIModelAdapter
+        adapter = OpenAIModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+benchmark = MyMarbleBenchmark()
+results = benchmark.run(tasks)
+
+# Print results
+for result in results:
+    print(f"Task: {result['task_id']}")
+    print(f"Status: {result['status']}")
+    if result['eval']:
+        print(f"Passed: {result['eval'][0]['passed']}")
+```
+
+## Domains
+
+MultiAgentBench includes 7 domains with different requirements:
+
+| Domain          | External Dependencies | Initial Support |
+| --------------- | --------------------- | --------------- |
+| Research        | None                  | Yes             |
+| Bargaining      | None                  | Yes             |
+| Coding          | Filesystem access     | Yes             |
+| Web             | Network access        | Yes             |
+| WorldSimulation | None                  | Yes             |
+| Database        | Docker + PostgreSQL   | Optional        |
+| Minecraft       | External game server  | Deferred        |
+
+### Domain-Specific Notes
+
+- **Research/Bargaining**: Recommended for initial testing - no infrastructure required
+- **Coding**: Creates files in a workspace directory
+- **Database**: Requires Docker with PostgreSQL image
+- **Minecraft**: Not currently supported (requires external game server)
+
+## Known Limitations
+
+1. **Chain coordination mode bug**: MARBLE's `engine.py` references `get_agent_profiles_linked()`
+   which doesn't exist in `AgentGraph`. Tasks using chain coordination may fail.
+
+2. **SharedMemory is per-agent**: Despite the name, each MARBLE agent creates its own
+   `SharedMemory` instance. Use `msg_box` for inter-agent communication.
+
+3. **Requires manual MARBLE clone**: MARBLE must be cloned manually into the `marble/`
+   subdirectory (gitignored by default).
+
+## File Structure
+
+```
+multiagentbench/
+├── __init__.py              # Public API exports
+├── README.md                # This file
+├── PROVENANCE.md            # MARBLE version and license info
+├── .gitignore               # Ignores marble/ directory
+├── multiagentbench.py       # Benchmark classes
+├── environment.py           # MultiAgentBenchEnvironment
+├── data_loader.py           # Task loading utilities
+├── adapters/
+│   ├── __init__.py
+│   └── marble_adapter.py    # MarbleAgentAdapter
+└── marble/                  # ← Vendored MARBLE (gitignored)
+    └── ...
+```