Available Benchmarks

This document provides detailed information, sources, and licensing for all benchmarks included in this library.

1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)

This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.

Source and License

Original Repository: https://github.com/aws-samples/multiagent-collab-scenario-benchmark
Data License: The dataset containing the scenarios is made available under the Creative Commons Attribution 4.0 International License (CC-BY-4.0).

2. $\tau^2$-bench (Beta)

$\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi-turn interactive environments.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License

Original Repository: https://github.com/sierra-research/tau2-bench
Paper: Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Code License: MIT
Data License: MIT

3. MultiAgentBench (MARBLE) (Beta)

MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License

Original Repository: https://github.com/ulab-uiuc/MARBLE (where the original work was done)
Fork Used: https://github.com/cemde/MARBLE (contains bug fixes for MASEval integration)
Paper: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
Code License: MIT
Data License: MIT

Note: MASEval uses a fork with bug fixes. All credit for the original work goes to the MARBLE team (Zhu et al., 2025).

4. GAIA2 (Beta)

Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License

Original Repository: https://github.com/facebookresearch/meta-agents-research-environments
Paper: Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments (ICLR 2026)
Dataset: https://huggingface.co/datasets/meta-agents-research-environments/gaia2
Code License: MIT
Data License: Subject to Meta's data usage terms (see HuggingFace dataset page)

5. CONVERSE (Beta)

CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License

Original Repository: https://github.com/amrgomaaelhady/ConVerse
Paper: ConVerse: Contextual Safety in Agent-to-Agent Conversations
Code License: MIT (as provided by the upstream repository)
Data License: Refer to the upstream repository's dataset and license terms

6. MMLU (Massive Multitask Language Understanding) (Beta)

MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.

Beta: This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Implemented: A ready-to-use implementation is available via DefaultMMLUBenchmark with HuggingFace model support. Install with pip install maseval[mmlu]. See the MMLU documentation for usage details.

Source and License

Original Paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2021)
DISCO Paper: DISCO: Diversifying Sample Condensation for Efficient Model Evaluation (Rubinstein et al., ICLR 2026)
Dataset: arubique/flattened-MMLU

7. [Name of Next Benchmark]

(Description for the next benchmark...)

Source and License

Original Repository: Link
Data License: Data License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Available Benchmarks

1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)

Source and License

2. $\tau^2$-bench (Beta)

Source and License

3. MultiAgentBench (MARBLE) (Beta)

Source and License

4. GAIA2 (Beta)

Source and License

5. CONVERSE (Beta)

Source and License

6. MMLU (Massive Multitask Language Understanding) (Beta)

Source and License

7. [Name of Next Benchmark]

Source and License

FilesExpand file tree

BENCHMARKS.md

Latest commit

History

BENCHMARKS.md

File metadata and controls

Available Benchmarks

1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)

Source and License

2. $\tau^2$-bench (Beta)

Source and License

3. MultiAgentBench (MARBLE) (Beta)

Source and License

4. GAIA2 (Beta)

Source and License

5. CONVERSE (Beta)

Source and License

6. MMLU (Massive Multitask Language Understanding) (Beta)

Source and License

7. [Name of Next Benchmark]

Source and License