|
| 1 | +# MultiAgentBench: Multi-Agent Collaboration Benchmark |
| 2 | + |
| 3 | +The **MultiAgentBench** benchmark evaluates multi-agent collaboration and competition in LLM-based systems across diverse scenarios including research, negotiation, coding, and more. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +[MultiAgentBench](https://github.com/ulab-uiuc/MARBLE) (from the MARBLE framework) is designed to evaluate how multiple LLM-based agents collaborate and compete to solve complex tasks. The benchmark features: |
| 8 | + |
| 9 | +- **7 diverse domains**: research, bargaining, coding, database, web, worldsimulation, minecraft |
| 10 | +- **Multiple coordination modes**: cooperative, star, tree, hierarchical |
| 11 | +- **LLM-based evaluation**: Matches MARBLE's evaluation methodology |
| 12 | +- **Framework-agnostic**: Use with any agent framework or MARBLE's native agents |
| 13 | + |
| 14 | +Reference Paper: [MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https://arxiv.org/abs/2503.01935) |
| 15 | + |
| 16 | +Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses. |
| 17 | + |
| 18 | +## Quick Start |
| 19 | + |
| 20 | +```python |
| 21 | +from maseval.benchmark.multiagentbench import ( |
| 22 | + MultiAgentBenchBenchmark, |
| 23 | + MultiAgentBenchEnvironment, |
| 24 | + MultiAgentBenchEvaluator, |
| 25 | + load_tasks, |
| 26 | + configure_model_ids, |
| 27 | + ensure_marble_exists, |
| 28 | +) |
| 29 | + |
| 30 | +# Ensure MARBLE is installed (auto-downloads if needed) |
| 31 | +ensure_marble_exists() |
| 32 | + |
| 33 | +# Load and configure tasks |
| 34 | +tasks = load_tasks("research", limit=5) |
| 35 | +configure_model_ids(tasks, agent_model_id="gpt-4o") |
| 36 | + |
| 37 | +# Create your framework-specific benchmark subclass |
| 38 | +class MyMultiAgentBenchmark(MultiAgentBenchBenchmark): |
| 39 | + def setup_agents(self, agent_data, environment, task, user): |
| 40 | + # Your framework-specific agent creation |
| 41 | + agent_configs = task.environment_data.get("agents", []) |
| 42 | + # Create agents based on configs... |
| 43 | + ... |
| 44 | + |
| 45 | + def get_model_adapter(self, model_id, **kwargs): |
| 46 | + adapter = MyModelAdapter(model_id) |
| 47 | + if "register_name" in kwargs: |
| 48 | + self.register("models", kwargs["register_name"], adapter) |
| 49 | + return adapter |
| 50 | + |
| 51 | +# Run benchmark |
| 52 | +benchmark = MyMultiAgentBenchmark() |
| 53 | +results = benchmark.run(tasks, agent_data={}) |
| 54 | +``` |
| 55 | + |
| 56 | +## MARBLE Reproduction Mode |
| 57 | + |
| 58 | +For exact reproduction of MARBLE's published results, use `MarbleMultiAgentBenchBenchmark` which wraps MARBLE's native agents: |
| 59 | + |
| 60 | +```python |
| 61 | +from maseval.benchmark.multiagentbench import ( |
| 62 | + MarbleMultiAgentBenchBenchmark, |
| 63 | + load_tasks, |
| 64 | + configure_model_ids, |
| 65 | + ensure_marble_exists, |
| 66 | +) |
| 67 | + |
| 68 | +# Ensure MARBLE is installed |
| 69 | +ensure_marble_exists() |
| 70 | + |
| 71 | +# Load tasks |
| 72 | +tasks = load_tasks("research", limit=5) |
| 73 | +configure_model_ids(tasks, agent_model_id="gpt-4o") |
| 74 | + |
| 75 | +# Create benchmark with model adapter |
| 76 | +class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark): |
| 77 | + def get_model_adapter(self, model_id, **kwargs): |
| 78 | + from maseval.interface.openai import OpenAIModelAdapter |
| 79 | + adapter = OpenAIModelAdapter(model_id) |
| 80 | + if "register_name" in kwargs: |
| 81 | + self.register("models", kwargs["register_name"], adapter) |
| 82 | + return adapter |
| 83 | + |
| 84 | +benchmark = MyMarbleBenchmark() |
| 85 | +results = benchmark.run(tasks, agent_data={}) |
| 86 | +``` |
| 87 | + |
| 88 | +## Available Domains |
| 89 | + |
| 90 | +| Domain | Description | Infrastructure | |
| 91 | +|--------|-------------|----------------| |
| 92 | +| `research` | Research idea generation and collaboration | None | |
| 93 | +| `bargaining` | Negotiation scenarios (buyer/seller) | None | |
| 94 | +| `coding` | Software development collaboration | Filesystem | |
| 95 | +| `database` | Database manipulation and querying | Docker + PostgreSQL | |
| 96 | +| `web` | Web-based task completion | Network | |
| 97 | +| `worldsimulation` | World simulation and interaction | None | |
| 98 | +| `minecraft` | Collaborative building | External server | |
| 99 | + |
| 100 | +## API Reference |
| 101 | + |
| 102 | +::: maseval.benchmark.multiagentbench.MultiAgentBenchBenchmark |
| 103 | + |
| 104 | +::: maseval.benchmark.multiagentbench.MarbleMultiAgentBenchBenchmark |
| 105 | + |
| 106 | +::: maseval.benchmark.multiagentbench.MultiAgentBenchEnvironment |
| 107 | + |
| 108 | +::: maseval.benchmark.multiagentbench.MultiAgentBenchEvaluator |
| 109 | + |
| 110 | +::: maseval.benchmark.multiagentbench.MarbleAgentAdapter |
| 111 | + |
| 112 | +::: maseval.benchmark.multiagentbench.load_tasks |
| 113 | + |
| 114 | +::: maseval.benchmark.multiagentbench.configure_model_ids |
| 115 | + |
| 116 | +::: maseval.benchmark.multiagentbench.ensure_marble_exists |
| 117 | + |
| 118 | +::: maseval.benchmark.multiagentbench.download_marble |
| 119 | + |
| 120 | +::: maseval.benchmark.multiagentbench.get_domain_info |
0 commit comments