maseval
diff --git a/‎maseval/benchmark/multiagentbench/.gitignore‎
Lines changed: 10 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/.gitignore‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/PROVENANCE.md‎
Lines changed: 74 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/PROVENANCE.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/README.md‎
Lines changed: 166 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/README.md‎
Lines changed: 166 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/__init__.py‎
Lines changed: 135 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/__init__.py‎
Lines changed: 135 additions & 0 deletions
diff --git a/‎maseval/benchmark/multiagentbench/adapters/__init__.py‎
Lines changed: 7 additions & 0 deletions b/‎maseval/benchmark/multiagentbench/adapters/__init__.py‎
Lines changed: 7 additions & 0 deletions
@@ -0,0 +1,10 @@
+# Vendored MARBLE source (clone manually)
+marble/
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+
+# Test artifacts
+.pytest_cache/
@@ -0,0 +1,74 @@
+# MARBLE Integration Provenance
+
+## Source Information
+
+- **Source Repository**: https://github.com/ulab-uiuc/MARBLE
+- **Version**: Not yet pinned (clone latest and test)
+- **License**: MIT (Copyright 2024 Haofei Yu)
+- **Vendoring**: Permitted by MIT license with attribution
+
+## Reference
+
+**Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
+- arXiv: https://arxiv.org/abs/2503.01935
+- Authors: Haofei Yu et al.
+- Publication Date: 2025
+
+## License Text (MIT)
+
+```
+MIT License
+
+Copyright (c) 2024 Haofei Yu
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```
+
+## Known Issues in MARBLE
+
+1. **Missing method**: `AgentGraph.get_agent_profiles_linked()` does not exist but is
+   called in `engine.py:702`. This breaks chain coordination mode.
+
+2. **SharedMemory naming**: Despite the name, `SharedMemory` is instantiated per-agent
+   in `BaseAgent.__init__()` and is NOT shared between agents. Use `msg_box` for
+   inter-agent communication.
+
+3. **Environment constructor signature**: Some environments expect different constructor
+   arguments. Check each environment's `__init__` signature before use.
+
+## Local Patches Applied
+
+None currently. Document any patches here if applied.
+
+## Update Process
+
+To update MARBLE to a newer version:
+
+1. `cd maseval/benchmark/multiagentbench/marble`
+2. `git fetch origin`
+3. `git log --oneline origin/main` (review changes)
+4. `git checkout <new-commit-hash>`
+5. Run integration tests
+6. Update this file with new version info
+
+## Last Updated
+
+- **Date**: 2026-01-19
+- **Updated By**: Claude Code
+- **Version Tested**: Initial integration (not yet pinned)
@@ -0,0 +1,166 @@
+# MultiAgentBench Integration
+
+Framework-agnostic implementation of the MultiAgentBench benchmark suite from MARBLE
+(Multi-Agent Coordination Backbone with LLM Engine) for evaluating multi-agent collaboration.
+
+**Original Paper**: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
+(arXiv:2503.01935)
+
+**Original Repository**: https://github.com/ulab-uiuc/MARBLE
+
+## Setup
+
+This benchmark requires the MARBLE source code to be cloned locally. The benchmark
+does NOT install MARBLE as a dependency - instead, it imports directly from a local copy.
+
+### 1. Clone MARBLE
+
+```bash
+cd maseval/benchmark/multiagentbench
+git clone https://github.com/ulab-uiuc/MARBLE.git marble
+cd marble
+# Pin to tested version (recommended)
+git checkout <pinned-commit-hash>
+```
+
+### 2. Install MARBLE Dependencies
+
+MARBLE requires additional dependencies. Add them to your environment:
+
+```bash
+# If using uv (recommended)
+uv add litellm ruamel.yaml
+
+# Or with pip
+pip install litellm ruamel.yaml
+```
+
+### 3. Verify Setup
+
+```python
+from maseval.benchmark.multiagentbench import load_tasks
+
+# Should load without error
+tasks = load_tasks("research", limit=1)
+print(f"Loaded {len(tasks)} task(s)")
+```
+
+## Usage
+
+### Basic Usage (Abstract Base)
+
+The abstract `MultiAgentBenchBenchmark` provides task loading, environment setup,
+and evaluation infrastructure. You implement `setup_agents()` with your framework:
+
+```python
+from maseval.benchmark.multiagentbench import (
+    MultiAgentBenchBenchmark,
+    MultiAgentBenchEnvironment,
+    load_tasks,
+    configure_model_ids,
+)
+
+class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
+    def setup_agents(self, agent_data, environment, task, user):
+        # Your framework-specific agent creation
+        ...
+
+    def get_model_adapter(self, model_id, **kwargs):
+        adapter = MyModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+# Load and configure tasks
+tasks = load_tasks("research", limit=5)
+configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+# Run
+benchmark = MyMultiAgentBenchmark()
+results = benchmark.run(tasks)
+```
+
+### MARBLE Reproduction
+
+Use `MarbleMultiAgentBenchBenchmark` for exact reproduction of MARBLE's published results:
+
+```python
+from maseval.benchmark.multiagentbench import (
+    MarbleMultiAgentBenchBenchmark,
+    load_tasks,
+    configure_model_ids,
+)
+
+# Load tasks from a simple domain (no Docker required)
+tasks = load_tasks("research", limit=5)
+configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+# Create benchmark with model adapter implementation
+class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
+    def get_model_adapter(self, model_id, **kwargs):
+        from maseval.interface.openai import OpenAIModelAdapter
+        adapter = OpenAIModelAdapter(model_id)
+        if "register_name" in kwargs:
+            self.register("models", kwargs["register_name"], adapter)
+        return adapter
+
+benchmark = MyMarbleBenchmark()
+results = benchmark.run(tasks)
+
+# Print results
+for result in results:
+    print(f"Task: {result['task_id']}")
+    print(f"Status: {result['status']}")
+    if result['eval']:
+        print(f"Passed: {result['eval'][0]['passed']}")
+```
+
+## Domains
+
+MultiAgentBench includes 7 domains with different requirements:
+
+| Domain | External Dependencies | Initial Support |
+|--------|----------------------|-----------------|
+| Research | None | Yes |
+| Bargaining | None | Yes |
+| Coding | Filesystem access | Yes |
+| Web | Network access | Yes |
+| WorldSimulation | None | Yes |
+| Database | Docker + PostgreSQL | Optional |
+| Minecraft | External game server | Deferred |
+
+### Domain-Specific Notes
+
+- **Research/Bargaining**: Recommended for initial testing - no infrastructure required
+- **Coding**: Creates files in a workspace directory
+- **Database**: Requires Docker with PostgreSQL image
+- **Minecraft**: Not currently supported (requires external game server)
+
+## Known Limitations
+
+1. **Chain coordination mode bug**: MARBLE's `engine.py` references `get_agent_profiles_linked()`
+   which doesn't exist in `AgentGraph`. Tasks using chain coordination may fail.
+
+2. **SharedMemory is per-agent**: Despite the name, each MARBLE agent creates its own
+   `SharedMemory` instance. Use `msg_box` for inter-agent communication.
+
+3. **Requires manual MARBLE clone**: MARBLE must be cloned manually into the `marble/`
+   subdirectory (gitignored by default).
+
+## File Structure
+
+```
+multiagentbench/
+├── __init__.py              # Public API exports
+├── README.md                # This file
+├── PROVENANCE.md            # MARBLE version and license info
+├── .gitignore               # Ignores marble/ directory
+├── multiagentbench.py       # Benchmark classes
+├── environment.py           # MultiAgentBenchEnvironment
+├── data_loader.py           # Task loading utilities
+├── adapters/
+│   ├── __init__.py
+│   └── marble_adapter.py    # MarbleAgentAdapter
+└── marble/                  # ← Vendored MARBLE (gitignored)
+    └── ...
+```
@@ -0,0 +1,135 @@
+"""MultiAgentBench - Multi-Agent Coordination Benchmark from MARBLE.
+
+Framework-agnostic implementation of the MultiAgentBench benchmark suite for
+evaluating multi-agent collaboration and competition in LLM-based systems.
+
+Original Repository: https://github.com/ulab-uiuc/MARBLE
+Paper: "MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents"
+       (arXiv:2503.01935)
+
+Domains:
+    - research: Research idea generation and collaboration
+    - bargaining: Negotiation and bargaining scenarios
+    - coding: Software development collaboration
+    - database: Database manipulation and querying (requires Docker)
+    - minecraft: Collaborative building (requires external server)
+    - web: Web-based task completion
+    - worldsimulation: World simulation and interaction
+
+Setup:
+    This benchmark requires MARBLE source code to be cloned locally:
+
+    ```bash
+    cd maseval/benchmark/multiagentbench
+    git clone https://github.com/ulab-uiuc/MARBLE.git marble
+    ```
+
+    See README.md in this directory for detailed setup instructions.
+
+Usage:
+    from maseval.benchmark.multiagentbench import (
+        MultiAgentBenchBenchmark,
+        MarbleMultiAgentBenchBenchmark,
+        MultiAgentBenchEnvironment,
+        MultiAgentBenchEvaluator,
+        MarbleAgentAdapter,
+        load_tasks,
+        configure_model_ids,
+        get_domain_info,
+        VALID_DOMAINS,
+    )
+
+    # Load and configure tasks
+    tasks = load_tasks("research", limit=5)
+    configure_model_ids(tasks, agent_model_id="gpt-4o")
+
+    # Create your framework-specific benchmark subclass
+    class MyMultiAgentBenchmark(MultiAgentBenchBenchmark):
+        def setup_agents(self, agent_data, environment, task, user):
+            # Create your agents
+            ...
+
+        def get_model_adapter(self, model_id, **kwargs):
+            adapter = MyModelAdapter(model_id)
+            if "register_name" in kwargs:
+                self.register("models", kwargs["register_name"], adapter)
+            return adapter
+
+    # Run benchmark
+    benchmark = MyMultiAgentBenchmark()
+    results = benchmark.run(tasks, agent_data={})
+
+MARBLE Reproduction Mode:
+    For exact reproduction of MARBLE's published results, use
+    MarbleMultiAgentBenchBenchmark which wraps MARBLE's native agents:
+
+    ```python
+    class MyMarbleBenchmark(MarbleMultiAgentBenchBenchmark):
+        def get_model_adapter(self, model_id, **kwargs):
+            from maseval.interface.openai import OpenAIModelAdapter
+            adapter = OpenAIModelAdapter(model_id)
+            if "register_name" in kwargs:
+                self.register("models", kwargs["register_name"], adapter)
+            return adapter
+
+    benchmark = MyMarbleBenchmark()
+    results = benchmark.run(tasks, agent_data={})
+    ```
+"""
+
+# Core benchmark classes
+from maseval.benchmark.multiagentbench.multiagentbench import (
+    MultiAgentBenchBenchmark,
+    MarbleMultiAgentBenchBenchmark,
+)
+
+# Environment
+from maseval.benchmark.multiagentbench.environment import (
+    MultiAgentBenchEnvironment,
+    INFRASTRUCTURE_DOMAINS,
+)
+
+# Evaluator
+from maseval.benchmark.multiagentbench.evaluator import (
+    MultiAgentBenchEvaluator,
+    MultiAgentBenchMetrics,
+)
+
+# Agent adapters
+from maseval.benchmark.multiagentbench.adapters import (
+    MarbleAgentAdapter,
+)
+from maseval.benchmark.multiagentbench.adapters.marble_adapter import (
+    create_marble_agents,
+)
+
+# Data loading
+from maseval.benchmark.multiagentbench.data_loader import (
+    load_tasks,
+    configure_model_ids,
+    get_domain_info,
+    VALID_DOMAINS,
+    INFRASTRUCTURE_DOMAINS as INFRASTRUCTURE_REQUIRED_DOMAINS,
+)
+
+
+__all__ = [
+    # Core benchmark classes
+    "MultiAgentBenchBenchmark",
+    "MarbleMultiAgentBenchBenchmark",
+    # Environment
+    "MultiAgentBenchEnvironment",
+    "INFRASTRUCTURE_DOMAINS",
+    # Evaluator
+    "MultiAgentBenchEvaluator",
+    "MultiAgentBenchMetrics",
+    # Agent adapters
+    "MarbleAgentAdapter",
+    "create_marble_agents",
+    # Data loading
+    "load_tasks",
+    "configure_model_ids",
+    "get_domain_info",
+    "VALID_DOMAINS",
+    "INFRASTRUCTURE_REQUIRED_DOMAINS",
+]
@@ -0,0 +1,7 @@
+"""Agent adapters for MultiAgentBench."""
+
+from maseval.benchmark.multiagentbench.adapters.marble_adapter import (
+    MarbleAgentAdapter,
+)
+
+__all__ = ["MarbleAgentAdapter"]