You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**Multi-Agent**| Multi-Agent Native | Native orchestration with per-agent tracing, independent message histories, and explicit coordination patterns. |
53
-
|**System Evaluation**| System-Level Comparison | Compare different framework implementations on the same benchmark (not just swapping LLMs). |
54
-
|**Agent Agnostic**| Agent Framework Agnostic | Evaluate agents from any framework via thin adapters without requiring protocol adoption or code recreation. |
55
-
|**Benchmarks**| Pre-Implemented Benchmarks | Ships complete, ready-to-run benchmarks with environments, tools, and evaluators (not just templates). |
56
-
|**Multi-turn User**| User-Agent Multi-turn | First-class user simulation with personas, stop tokens, and tool access for realistic multi-turn conversations. |
57
-
|**No Lock-In**| No Vendor Lock-In | Fully open-source, works offline, permissive license (MIT/Apache), no mandatory cloud services or telemetry. |
58
-
|**BYO**| BYO Philosophy | Bring your own logging, agents, environments, and tools — flexibility over opinionated defaults. |
59
-
|**State-Action Eval**| Trace-First Evaluation | Evaluate intermediate steps and tool usage patterns via trace filtering, not just final output scoring. |
60
-
|**Error Attr**| Structured Error Attribution | Structured exceptions distinguish between different failure for fair scoring (`AgentError` vs `EnvironmentError`). |
61
-
|**Lightweight**| Lightweight | Minimal dependencies, small codebase (~20k LOC), quick time to first evaluation (~5-15 min). |
62
-
|**Project Maturity**| Professional Tooling | Published on PyPI, CI/CD, good test coverage, structured logging, active maintenance, excellent docs. |
63
-
|**Sandbox**| Sandboxed Execution | Built-in Docker/K8s/VM isolation for safe code execution (or BYO sandbox via abstract Environment). |
|**Multi-Agent**| Multi-Agent Native | Native orchestration with per-agent tracing, independent message histories, and explicit coordination patterns. |
50
+
|**System Eval**| System-Level Comparison | Compare different framework implementations on the same benchmark (not just swapping LLMs). |
51
+
|**Agent-Agnostic**| Agent Framework Agnostic | Evaluate agents from any framework via thin adapters without requiring protocol adoption or code recreation. |
52
+
|**Benchmarks**| Pre-Implemented Benchmarks | Ships complete, ready-to-run benchmarks with environments, tools, and evaluators (not just templates). |
53
+
|**Flexible Interaction**| Flexible Agent-Environment-User | First-class user simulation with personas and tool access for realistic multi-turn conversations. |
54
+
|**BYO**| BYO Philosophy | Bring your own logging, agents, environments, and tools. Open-source, works offline, no mandatory cloud services. |
55
+
|**Trace-First**| Trace-First Evaluation | Evaluate intermediate steps across environment and agents via first-class traces, not post-hoc fixes. |
56
+
|**Mature**| Professional Tooling | Published on PyPI, CI/CD, good test coverage, active maintenance. |
64
57
65
58
</details>
66
59
@@ -122,3 +115,20 @@ We welcome any contributions. Please read the [CONTRIBUTING.md](CONTRIBUTING.md)
122
115
This library includes implementations for several benchmarks to evaluate a variety of multi-agent scenarios. Each benchmark is designed to test specific collaboration and problem-solving skills.
123
116
124
117
➡️ **[See here for a full list and description of all available benchmarks including licenses.](./BENCHMARKS.md)**
118
+
119
+
## Citation
120
+
121
+
Please consider citing the MASEval library.
122
+
123
+
```
124
+
@misc{emde2026maseval,
125
+
title={MASEval: Extending Multi-Agent Evaluation from Models to Systems},
126
+
author={Cornelius Emde and Alexander Rubinstein and Anmol Goel and Ahmed Heakl and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
127
+
year={2026},
128
+
eprint={2603.08835},
129
+
archivePrefix={arXiv},
130
+
primaryClass={cs.AI},
131
+
url={https://arxiv.org/abs/2603.08835},
132
+
note={Alexander Rubinstein, Anmol Goel, and Ahmed Heakl contributed equally.},
0 commit comments