added to docs

cemde · cemde · commit 0667cc07ede3 · 2026-01-19T23:56:42.000+01:00
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -27,9 +27,22 @@ $\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi
 
 ---
 
-## 3. [Name of Next Benchmark]
+## 3. MultiAgentBench (MARBLE)
 
-(Description for the third benchmark...)
+MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.
+
+### Source and License
+
+- **Original Repository:** [https://github.com/ulab-uiuc/MARBLE](https://github.com/ulab-uiuc/MARBLE)
+- **Paper:** [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
+- **Code License:** MIT
+- **Data License:** MIT
+
+---
+
+## 4. [Name of Next Benchmark]
+
+(Description for the next benchmark...)
 
 ### Source and License
 
diff --git a/docs/benchmark/multiagentbench.md b/docs/benchmark/multiagentbench.md
@@ -0,0 +1,120 @@
+# MultiAgentBench: Multi-Agent Collaboration Benchmark
+
+The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more.
+
+## Overview
+
+[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features:
+
+- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft
+- **Multiple coordination modes**: cooperative, star, tree, hierarchical
+- **LLM-based evaluation**: Matches MARBLE's evaluation methodology
+- **Framework-agnostic**: Use with any agent framework or MARBLE's native agents
+
+Reference Paper: [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935)
+
+Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
+
+## Quick Start
+
+```python
+from maseval.benchmark.multiagentbench import (
+    MultiAgentBenchBenchmark,
+    MultiAgentBenchEnvironment,
+    MultiAgentBenchEvaluator,
+    load_tasks,
+    configure_model_ids,
+    ensure_marble_exists,
+)
+
+# Ensure MARBLE is installed (auto-downloads if needed)
+ensure_marble_exists()
+
+# Load and configure tasks
+tasks = load_tasks("research", limit=5)
+configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+# Create your framework-specific benchmark subclass
+class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
+    def setup_agents(self, agent_data, environment, task, user):
+        # Your framework-specific agent creation
+        agent_configs = task.environment_data.get("agents", [])
+        # Create agents based on configs...
+        ...
+
+    def get_model_adapter(self, model_id, **kwargs):
+        adapter = MyModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+# Run benchmark
+benchmark = MyMultiAgentBenchmark()
+results = benchmark.run(tasks, agent_data={})
+```
+
+## MARBLE Reproduction Mode
+
+For exact reproduction of MARBLE's published results, use `MarbleMultiAgentBenchBenchmark` which wraps MARBLE's native agents:
+
+```python
+from maseval.benchmark.multiagentbench import (
+    MarbleMultiAgentBenchBenchmark,
+    load_tasks,
+    configure_model_ids,
+    ensure_marble_exists,
+)
+
+# Ensure MARBLE is installed
+ensure_marble_exists()
+
+# Load tasks
+tasks = load_tasks("research", limit=5)
+configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+# Create benchmark with model adapter
+class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
+    def get_model_adapter(self, model_id, **kwargs):
+        from maseval.interface.openai import OpenAIModelAdapter
+        adapter = OpenAIModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+benchmark = MyMarbleBenchmark()
+results = benchmark.run(tasks, agent_data={})
+```
+
+## Available Domains
+
+| Domain | Description | Infrastructure |
+|--------|-------------|----------------|
+| `research` | Research idea generation and collaboration | None |
+| `bargaining` | Negotiation scenarios (buyer/seller) | None |
+| `coding` | Software development collaboration | Filesystem |
+| `database` | Database manipulation and querying | Docker + PostgreSQL |
+| `web` | Web-based task completion | Network |
+| `worldsimulation` | World simulation and interaction | None |
+| `minecraft` | Collaborative building | External server |
+
+## API Reference
+
+::: maseval.benchmark.multiagentbench.MultiAgentBenchBenchmark
+
+::: maseval.benchmark.multiagentbench.MarbleMultiAgentBenchBenchmark
+
+::: maseval.benchmark.multiagentbench.MultiAgentBenchEnvironment
+
+::: maseval.benchmark.multiagentbench.MultiAgentBenchEvaluator
+
+::: maseval.benchmark.multiagentbench.MarbleAgentAdapter
+
+::: maseval.benchmark.multiagentbench.load_tasks
+
+::: maseval.benchmark.multiagentbench.configure_model_ids
+
+::: maseval.benchmark.multiagentbench.ensure_marble_exists
+
+::: maseval.benchmark.multiagentbench.download_marble
+
+::: maseval.benchmark.multiagentbench.get_domain_info
diff --git a/maseval/benchmark/multiagentbench/README.md b/maseval/benchmark/multiagentbench/README.md
@@ -10,10 +10,26 @@ Framework-agnostic implementation of the MultiAgentBench benchmark suite from MA
 
 ## Setup
 
-This benchmark requires the MARBLE source code to be cloned locally. The benchmark
-does NOT install MARBLE as a dependency - instead, it imports directly from a local copy.
+This benchmark requires the MARBLE source code. You can set it up automatically or manually.
 
-### 1. Clone MARBLE
+### Option 1: Automatic Setup (Recommended)
+
+MARBLE will be automatically downloaded when you first use it:
+
+```python
+from maseval.benchmark.multiagentbench import ensure_marble_exists, load_tasks
+
+# This downloads MARBLE if not present (about 50MB)
+ensure_marble_exists()
+
+# Now load tasks
+tasks = load_tasks("research", limit=1)
+print(f"Loaded {len(tasks)} task(s)")
+```
+
+### Option 2: Manual Clone
+
+If you prefer to clone manually:
 
 ```bash
 cd maseval/benchmark/multiagentbench
@@ -23,7 +39,7 @@ cd marble
 git checkout <pinned-commit-hash>
 ```
 
-### 2. Install MARBLE Dependencies
+### Install MARBLE Dependencies
 
 MARBLE requires additional dependencies. Add them to your environment:
 
@@ -35,7 +51,7 @@ uv add litellm ruamel.yaml
 pip install litellm ruamel.yaml
 ```
 
-### 3. Verify Setup
+### Verify Setup
 
 ```python
 from maseval.benchmark.multiagentbench import load_tasks
diff --git a/maseval/benchmark/multiagentbench/__init__.py b/maseval/benchmark/multiagentbench/__init__.py
@@ -17,11 +17,17 @@
     - worldsimulation: World simulation and interaction
 
 Setup:
-    This benchmark requires MARBLE source code to be cloned locally:
+    This benchmark requires MARBLE source code. It will be automatically
+    downloaded when you first use `load_tasks()` or you can set it up manually:
 
-    ```bash
-    cd maseval/benchmark/multiagentbench
-    git clone https://github.com/ulab-uiuc/MARBLE.git marble
+    ```python
+    # Option 1: Automatic download (recommended)
+    from maseval.benchmark.multiagentbench import ensure_marble_exists
+    ensure_marble_exists()  # Downloads MARBLE if not present
+
+    # Option 2: Manual clone
+    # cd maseval/benchmark/multiagentbench
+    # git clone https://github.com/ulab-uiuc/MARBLE.git marble
     ```
 
     See README.md in this directory for detailed setup instructions.
@@ -35,10 +41,14 @@
         MarbleAgentAdapter,
         load_tasks,
         configure_model_ids,
+        ensure_marble_exists,
         get_domain_info,
         VALID_DOMAINS,
     )
 
+    # Ensure MARBLE is installed (auto-downloads if needed)
+    ensure_marble_exists()
+
     # Load and configure tasks
     tasks = load_tasks("research", limit=5)
     configure_model_ids(tasks, agent_model_id="gpt-4o")
@@ -103,11 +113,13 @@ def get_model_adapter(self, model_id, **kwargs):
     create_marble_agents,
 )
 
-# Data loading
+# Data loading and setup
 from maseval.benchmark.multiagentbench.data_loader import (
     load_tasks,
     configure_model_ids,
     get_domain_info,
+    ensure_marble_exists,
+    download_marble,
     VALID_DOMAINS,
     INFRASTRUCTURE_DOMAINS as INFRASTRUCTURE_REQUIRED_DOMAINS,
 )
@@ -126,10 +138,12 @@ def get_model_adapter(self, model_id, **kwargs):
     # Agent adapters
     "MarbleAgentAdapter",
     "create_marble_agents",
-    # Data loading
+    # Data loading and setup
     "load_tasks",
     "configure_model_ids",
     "get_domain_info",
+    "ensure_marble_exists",
+    "download_marble",
     "VALID_DOMAINS",
     "INFRASTRUCTURE_REQUIRED_DOMAINS",
 ]
diff --git a/maseval/benchmark/multiagentbench/data_loader.py b/maseval/benchmark/multiagentbench/data_loader.py
diff --git a/mkdocs.yml b/mkdocs.yml