Skip to content

Commit 270c7e2

Browse files
authored
[skip ci] updated readme (#43)
* updated comparison table in readme and docs * added ArXiv Citation
1 parent 95b15a2 commit 270c7e2

3 files changed

Lines changed: 122 additions & 66 deletions

File tree

CITATION.cff

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
cff-version: 1.2.0
2+
title: "MASEval: Extending Multi-Agent Evaluation from Models to Systems"
3+
message: "If you use this software, please cite it as below."
4+
type: software
5+
authors:
6+
- family-names: Emde
7+
given-names: Cornelius
8+
- family-names: Rubinstein
9+
given-names: Alexander
10+
equal-contrib: true
11+
- family-names: Goel
12+
given-names: Anmol
13+
equal-contrib: true
14+
- family-names: Heakl
15+
given-names: Ahmed
16+
equal-contrib: true
17+
- family-names: Yun
18+
given-names: Sangdoo
19+
- family-names: Oh
20+
given-names: Seong Joon
21+
- family-names: Gubri
22+
given-names: Martin
23+
year: 2026
24+
identifiers:
25+
- type: url
26+
value: "https://arxiv.org/abs/2603.08835"
27+
- type: other
28+
value: "2603.08835"
29+
description: arXiv eprint
30+
preferred-citation:
31+
type: article
32+
title: "MASEval: Extending Multi-Agent Evaluation from Models to Systems"
33+
authors:
34+
- family-names: Emde
35+
given-names: Cornelius
36+
- family-names: Rubinstein
37+
given-names: Alexander
38+
equal-contrib: true
39+
- family-names: Goel
40+
given-names: Anmol
41+
equal-contrib: true
42+
- family-names: Heakl
43+
given-names: Ahmed
44+
equal-contrib: true
45+
- family-names: Yun
46+
given-names: Sangdoo
47+
- family-names: Oh
48+
given-names: Seong Joon
49+
- family-names: Gubri
50+
given-names: Martin
51+
year: 2026
52+
url: "https://arxiv.org/abs/2603.08835"
53+
journal: "arXiv preprint arXiv:2603.08835"

README.md

Lines changed: 43 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -24,43 +24,36 @@ Analogous to pytest for testing or MLflow for ML experimentation, MASEval focuse
2424

2525
Compare multi-agent evaluation frameworks across key capabilities.
2626

27-
| Library | Multi-Agent | System Evaluation | Agent-Agnostic | Benchmarks | Multi-turn User | No Lock-In | BYO | State-Action Eval | Error Attr | Lightweight | Project Maturity | Sandboxed Environment |
28-
| ----------------- | :---------: | :---------------: | :------------: | :--------: | :-------------: | :--------: | :-: | :---------------: | :--------: | :---------: | :--------------: | :-------------------: |
29-
| **MASEval** ||||||| 🟢 ||||| 🟢 |
30-
| **HAL Harness** | 🟡 |||| 🟡 || 🟡 | 🟡 ||| 🟡 ||
31-
| **AnyAgent** | 🟡 |||| 🟡 || 🟢 | 🟡 |||||
32-
| **Inspect-AI** | 🟡 || 🟡 || 🟡 || 🟡 | 🟡 || 🟡 |||
33-
| **MLflow GenAI** | 🟡 | 🟡 | 🟢 || 🟡 || 🟢 ||| 🟡 || 🟡 |
34-
| **LangSmith** | 🟡 | 🟡 | 🟡 |||| 🟡 ||||||
35-
| **OpenCompass** || 🟡 ||| 🟡 || 🟡 | 🟡 |||| 🟡 |
36-
| **AgentGym** ||||| 🟡 || 🟢 | 🟡 ||| 🟡 | 🟡 |
37-
| **Arize Phoenix** | 🟡 || 🟡 ||| 🟡 | 🟢 ||| 🟡 |||
38-
| **MARBLE** |||||||| 🟡 | ? | 🟡 | 🟡 | 🟡 |
39-
| **TruLens** | 🟡 || 🟡 |||| 🟡 | 🟢 || 🟡 |||
40-
| **AgentBeats** | 🟡 || 🟡 ||| 🟡 | 🟡 | 🟡 | ? || 🟡 | 🟡 |
41-
| **DeepEval** | 🟡 || 🟡 || 🟡 | 🟡 | 🟡 | 🟡 || 🟡 |||
42-
| **MCPEval** ||||||| 🟡 | 🟡 || 🟡 | 🟡 ||
43-
| **Galileo** | 🟡 || 🟡 |||| 🟡 | 🟡 || 🟡 |||
44-
45-
**** Full/Native · **🟢** Flexible for BYO · **🟡** Partial/Limited · **** Not possible
27+
| Library | Multi-Agent | System Eval | Agent-Agnostic | Benchmarks | Flexible Interaction | BYO | Trace-First | Mature |
28+
| ----------------- | :---------: | :---------: | :------------: | :--------: | :------------------: | :-: | :---------: | :----: |
29+
| **MASEval** |||||||||
30+
| **AnyAgent** | 🟡 |||| 🟡 || 🟡 ||
31+
| **MLflow GenAI** | 🟡 | 🟡 ||| 🟡 ||||
32+
| **HAL Harness** | 🟡 |||| 🟡 | 🟡 | 🟡 | 🟡 |
33+
| **Inspect-AI** | 🟡 || 🟡 || 🟡 | 🟡 | 🟡 ||
34+
| **OpenCompass** || 🟡 ||| 🟡 | 🟡 | 🟡 ||
35+
| **AgentGym** ||||| 🟡 || 🟡 | 🟡 |
36+
| **Arize Phoenix** | 🟡 || 🟡 ||| 🟡 |||
37+
| **TruLens** | 🟡 || 🟡 ||| 🟡 |||
38+
| **MARBLE** ||||||| 🟡 | 🟡 |
39+
| **DeepEval** | 🟡 || 🟡 || 🟡 | 🟡 | 🟡 ||
40+
| **MCPEval** |||||| 🟡 | 🟡 | 🟡 |
41+
42+
**** Full/Native · **🟡** Partial/Limited · **** Not supported
4643

4744
<details>
4845
<summary>Expand for Column Explanation</summary>
4946

50-
| Column | Feature | One-Liner |
51-
| --------------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------ |
52-
| **Multi-Agent** | Multi-Agent Native | Native orchestration with per-agent tracing, independent message histories, and explicit coordination patterns. |
53-
| **System Evaluation** | System-Level Comparison | Compare different framework implementations on the same benchmark (not just swapping LLMs). |
54-
| **Agent Agnostic** | Agent Framework Agnostic | Evaluate agents from any framework via thin adapters without requiring protocol adoption or code recreation. |
55-
| **Benchmarks** | Pre-Implemented Benchmarks | Ships complete, ready-to-run benchmarks with environments, tools, and evaluators (not just templates). |
56-
| **Multi-turn User** | User-Agent Multi-turn | First-class user simulation with personas, stop tokens, and tool access for realistic multi-turn conversations. |
57-
| **No Lock-In** | No Vendor Lock-In | Fully open-source, works offline, permissive license (MIT/Apache), no mandatory cloud services or telemetry. |
58-
| **BYO** | BYO Philosophy | Bring your own logging, agents, environments, and tools — flexibility over opinionated defaults. |
59-
| **State-Action Eval** | Trace-First Evaluation | Evaluate intermediate steps and tool usage patterns via trace filtering, not just final output scoring. |
60-
| **Error Attr** | Structured Error Attribution | Structured exceptions distinguish between different failure for fair scoring (`AgentError` vs `EnvironmentError`). |
61-
| **Lightweight** | Lightweight | Minimal dependencies, small codebase (~20k LOC), quick time to first evaluation (~5-15 min). |
62-
| **Project Maturity** | Professional Tooling | Published on PyPI, CI/CD, good test coverage, structured logging, active maintenance, excellent docs. |
63-
| **Sandbox** | Sandboxed Execution | Built-in Docker/K8s/VM isolation for safe code execution (or BYO sandbox via abstract Environment). |
47+
| Column | Feature | One-Liner |
48+
| ------------------------ | ------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
49+
| **Multi-Agent** | Multi-Agent Native | Native orchestration with per-agent tracing, independent message histories, and explicit coordination patterns. |
50+
| **System Eval** | System-Level Comparison | Compare different framework implementations on the same benchmark (not just swapping LLMs). |
51+
| **Agent-Agnostic** | Agent Framework Agnostic | Evaluate agents from any framework via thin adapters without requiring protocol adoption or code recreation. |
52+
| **Benchmarks** | Pre-Implemented Benchmarks | Ships complete, ready-to-run benchmarks with environments, tools, and evaluators (not just templates). |
53+
| **Flexible Interaction** | Flexible Agent-Environment-User | First-class user simulation with personas and tool access for realistic multi-turn conversations. |
54+
| **BYO** | BYO Philosophy | Bring your own logging, agents, environments, and tools. Open-source, works offline, no mandatory cloud services. |
55+
| **Trace-First** | Trace-First Evaluation | Evaluate intermediate steps across environment and agents via first-class traces, not post-hoc fixes. |
56+
| **Mature** | Professional Tooling | Published on PyPI, CI/CD, good test coverage, active maintenance. |
6457

6558
</details>
6659

@@ -122,3 +115,20 @@ We welcome any contributions. Please read the [CONTRIBUTING.md](CONTRIBUTING.md)
122115
This library includes implementations for several benchmarks to evaluate a variety of multi-agent scenarios. Each benchmark is designed to test specific collaboration and problem-solving skills.
123116

124117
➡️ **[See here for a full list and description of all available benchmarks including licenses.](./BENCHMARKS.md)**
118+
119+
## Citation
120+
121+
Please consider citing the MASEval library.
122+
123+
```
124+
@misc{emde2026maseval,
125+
title={MASEval: Extending Multi-Agent Evaluation from Models to Systems},
126+
author={Cornelius Emde and Alexander Rubinstein and Anmol Goel and Ahmed Heakl and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
127+
year={2026},
128+
eprint={2603.08835},
129+
archivePrefix={arXiv},
130+
primaryClass={cs.AI},
131+
url={https://arxiv.org/abs/2603.08835},
132+
note={Alexander Rubinstein, Anmol Goel, and Ahmed Heakl contributed equally.},
133+
}
134+
```

0 commit comments

Comments
 (0)