Skip to content

Latest commit

 

History

History
105 lines (62 loc) · 6.19 KB

File metadata and controls

105 lines (62 loc) · 6.19 KB

Available Benchmarks

This document provides detailed information, sources, and licensing for all benchmarks included in this library.


1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)

This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.

Source and License


2. $\tau^2$-bench (Beta)

$\tau^2$-bench is a benchmark for evaluating agentic systems in realistic, multi-turn interactive environments.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License


3. MultiAgentBench (MARBLE) (Beta)

MultiAgentBench is a comprehensive benchmark suite for evaluating multi-agent collaboration and competition in LLM-based systems. It includes diverse scenarios across multiple domains including research collaboration, negotiation, coding tasks, and more.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License

Note: MASEval uses a fork with bug fixes. All credit for the original work goes to the MARBLE team (Zhu et al., 2025).


4. GAIA2 (Beta)

Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scenarios using Meta's ARE (Agent Research Environments) platform. It tests agents across 7 capability dimensions: execution, search, adaptability, time, ambiguity, agent2agent, and noise.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License


5. CONVERSE (Beta)

CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.

Beta: This benchmark has been implemented carefully, but it is highly complex and we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Source and License


6. MMLU (Massive Multitask Language Understanding) (Beta)

MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.

Beta: This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

Implemented: A ready-to-use implementation is available via DefaultMMLUBenchmark with HuggingFace model support. Install with pip install maseval[mmlu]. See the MMLU documentation for usage details.

Source and License


7. [Name of Next Benchmark]

(Description for the next benchmark...)

Source and License

  • Original Repository: Link
  • Data License: Data License.