[skip ci] updated readme (#43)

cemde · web-flow · commit 270c7e2ffb07 · 2026-03-11T10:01:47.000Z
* updated comparison table in readme and docs
* added ArXiv Citation
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,53 @@
+cff-version: 1.2.0
+title: "MASEval: Extending Multi-Agent Evaluation from Models to Systems"
+message: "If you use this software, please cite it as below."
+type: software
+authors:
+  - family-names: Emde
+    given-names: Cornelius
+  - family-names: Rubinstein
+    given-names: Alexander
+    equal-contrib: true
+  - family-names: Goel
+    given-names: Anmol
+    equal-contrib: true
+  - family-names: Heakl
+    given-names: Ahmed
+    equal-contrib: true
+  - family-names: Yun
+    given-names: Sangdoo
+  - family-names: Oh
+    given-names: Seong Joon
+  - family-names: Gubri
+    given-names: Martin
+year: 2026
+identifiers:
+  - type: url
+    value: "https://arxiv.org/abs/2603.08835"
+  - type: other
+    value: "2603.08835"
+    description: arXiv eprint
+preferred-citation:
+  type: article
+  title: "MASEval: Extending Multi-Agent Evaluation from Models to Systems"
+  authors:
+    - family-names: Emde
+      given-names: Cornelius
+    - family-names: Rubinstein
+      given-names: Alexander
+      equal-contrib: true
+    - family-names: Goel
+      given-names: Anmol
+      equal-contrib: true
+    - family-names: Heakl
+      given-names: Ahmed
+      equal-contrib: true
+    - family-names: Yun
+      given-names: Sangdoo
+    - family-names: Oh
+      given-names: Seong Joon
+    - family-names: Gubri
+      given-names: Martin
+  year: 2026
+  url: "https://arxiv.org/abs/2603.08835"
+  journal: "arXiv preprint arXiv:2603.08835"
diff --git a/README.md b/README.md
@@ -24,43 +24,36 @@ Analogous to pytest for testing or MLflow for ML experimentation, MASEval focuse
 
 Compare multi-agent evaluation frameworks across key capabilities.
 
-| Library           | Multi-Agent | System Evaluation | Agent-Agnostic | Benchmarks | Multi-turn User | No Lock-In | BYO | State-Action Eval | Error Attr | Lightweight | Project Maturity | Sandboxed Environment |
-| ----------------- | :---------: | :---------------: | :------------: | :--------: | :-------------: | :--------: | :-: | :---------------: | :--------: | :---------: | :--------------: | :-------------------: |
-| **MASEval**       |     ✅      |        ✅         |       ✅       |     ✅     |       ✅        |     ✅     | 🟢  |        ✅         |     ✅     |     ✅      |        ✅        |          🟢           |
-| **HAL Harness**   |     🟡      |        ✅         |       ✅       |     ✅     |       🟡        |     ✅     | 🟡  |        🟡         |     ❌     |     ✅      |        🟡        |          ✅           |
-| **AnyAgent**      |     🟡      |        ✅         |       ✅       |     ❌     |       🟡        |     ✅     | 🟢  |        🟡         |     ❌     |     ✅      |        ✅        |          ❌           |
-| **Inspect-AI**    |     🟡      |        ✅         |       🟡       |     ✅     |       🟡        |     ✅     | 🟡  |        🟡         |     ❌     |     🟡      |        ✅        |          ✅           |
-| **MLflow GenAI**  |     🟡      |        🟡         |       🟢       |     ❌     |       🟡        |     ✅     | 🟢  |        ✅         |     ❌     |     🟡      |        ✅        |          🟡           |
-| **LangSmith**     |     🟡      |        🟡         |       🟡       |     ❌     |       ✅        |     ❌     | 🟡  |        ✅         |     ❌     |     ✅      |        ✅        |          ❌           |
-| **OpenCompass**   |     ❌      |        🟡         |       ❌       |     ✅     |       🟡        |     ✅     | 🟡  |        🟡         |     ❌     |     ❌      |        ✅        |          🟡           |
-| **AgentGym**      |     ❌      |        ❌         |       ❌       |     ✅     |       🟡        |     ✅     | 🟢  |        🟡         |     ❌     |     ❌      |        🟡        |          🟡           |
-| **Arize Phoenix** |     🟡      |        ❌         |       🟡       |     ❌     |       ❌        |     🟡     | 🟢  |        ✅         |     ❌     |     🟡      |        ✅        |          ❌           |
-| **MARBLE**        |     ✅      |        ❌         |       ❌       |     ✅     |       ❌        |     ✅     | ❌  |        🟡         |     ?      |     🟡      |        🟡        |          🟡           |
-| **TruLens**       |     🟡      |        ❌         |       🟡       |     ❌     |       ❌        |     ✅     | 🟡  |        🟢         |     ❌     |     🟡      |        ✅        |          ❌           |
-| **AgentBeats**    |     🟡      |        ❌         |       🟡       |     ❌     |       ❌        |     🟡     | 🟡  |        🟡         |     ?      |     ✅      |        🟡        |          🟡           |
-| **DeepEval**      |     🟡      |        ❌         |       🟡       |     ❌     |       🟡        |     🟡     | 🟡  |        🟡         |     ❌     |     🟡      |        ✅        |          ❌           |
-| **MCPEval**       |     ❌      |        ❌         |       ❌       |     ✅     |       ❌        |     ✅     | 🟡  |        🟡         |     ❌     |     🟡      |        🟡        |          ❌           |
-| **Galileo**       |     🟡      |        ❌         |       🟡       |     ❌     |       ❌        |     ❌     | 🟡  |        🟡         |     ❌     |     🟡      |        ✅        |          ❌           |
-
-**✅** Full/Native · **🟢** Flexible for BYO · **🟡** Partial/Limited · **❌** Not possible
+| Library           | Multi-Agent | System Eval | Agent-Agnostic | Benchmarks | Flexible Interaction | BYO | Trace-First | Mature |
+| ----------------- | :---------: | :---------: | :------------: | :--------: | :------------------: | :-: | :---------: | :----: |
+| **MASEval**       |     ✅      |     ✅      |       ✅       |     ✅     |          ✅          | ✅  |     ✅      |   ✅   |
+| **AnyAgent**      |     🟡      |     ✅      |       ✅       |     ❌     |          🟡          | ✅  |     🟡      |   ✅   |
+| **MLflow GenAI**  |     🟡      |     🟡      |       ✅       |     ❌     |          🟡          | ✅  |     ✅      |   ✅   |
+| **HAL Harness**   |     🟡      |     ✅      |       ✅       |     ✅     |          🟡          | 🟡  |     🟡      |   🟡   |
+| **Inspect-AI**    |     🟡      |     ✅      |       🟡       |     ✅     |          🟡          | 🟡  |     🟡      |   ✅   |
+| **OpenCompass**   |     ❌      |     🟡      |       ❌       |     ✅     |          🟡          | 🟡  |     🟡      |   ✅   |
+| **AgentGym**      |     ❌      |     ❌      |       ❌       |     ✅     |          🟡          | ✅  |     🟡      |   🟡   |
+| **Arize Phoenix** |     🟡      |     ❌      |       🟡       |     ❌     |          ❌          | 🟡  |     ✅      |   ✅   |
+| **TruLens**       |     🟡      |     ❌      |       🟡       |     ❌     |          ❌          | 🟡  |     ✅      |   ✅   |
+| **MARBLE**        |     ✅      |     ❌      |       ❌       |     ✅     |          ❌          | ❌  |     🟡      |   🟡   |
+| **DeepEval**      |     🟡      |     ❌      |       🟡       |     ❌     |          🟡          | 🟡  |     🟡      |   ✅   |
+| **MCPEval**       |     ❌      |     ❌      |       ❌       |     ✅     |          ❌          | 🟡  |     🟡      |   🟡   |
+
+**✅** Full/Native · **🟡** Partial/Limited · **❌** Not supported
 
 <details>
 <summary>Expand for Column Explanation</summary>
 
-| Column                | Feature                      | One-Liner                                                                                                          |
-| --------------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------ |
-| **Multi-Agent**       | Multi-Agent Native           | Native orchestration with per-agent tracing, independent message histories, and explicit coordination patterns.    |
-| **System Evaluation** | System-Level Comparison      | Compare different framework implementations on the same benchmark (not just swapping LLMs).                        |
-| **Agent Agnostic**    | Agent Framework Agnostic     | Evaluate agents from any framework via thin adapters without requiring protocol adoption or code recreation.       |
-| **Benchmarks**        | Pre-Implemented Benchmarks   | Ships complete, ready-to-run benchmarks with environments, tools, and evaluators (not just templates).             |
-| **Multi-turn User**   | User-Agent Multi-turn        | First-class user simulation with personas, stop tokens, and tool access for realistic multi-turn conversations.    |
-| **No Lock-In**        | No Vendor Lock-In            | Fully open-source, works offline, permissive license (MIT/Apache), no mandatory cloud services or telemetry.       |
-| **BYO**               | BYO Philosophy               | Bring your own logging, agents, environments, and tools — flexibility over opinionated defaults.                   |
-| **State-Action Eval** | Trace-First Evaluation       | Evaluate intermediate steps and tool usage patterns via trace filtering, not just final output scoring.            |
-| **Error Attr**        | Structured Error Attribution | Structured exceptions distinguish between different failure for fair scoring (`AgentError` vs `EnvironmentError`). |
-| **Lightweight**       | Lightweight                  | Minimal dependencies, small codebase (~20k LOC), quick time to first evaluation (~5-15 min).                       |
-| **Project Maturity**  | Professional Tooling         | Published on PyPI, CI/CD, good test coverage, structured logging, active maintenance, excellent docs.              |
-| **Sandbox**           | Sandboxed Execution          | Built-in Docker/K8s/VM isolation for safe code execution (or BYO sandbox via abstract Environment).                |
+| Column                   | Feature                         | One-Liner                                                                                                         |
+| ------------------------ | ------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
+| **Multi-Agent**          | Multi-Agent Native              | Native orchestration with per-agent tracing, independent message histories, and explicit coordination patterns.   |
+| **System Eval**          | System-Level Comparison         | Compare different framework implementations on the same benchmark (not just swapping LLMs).                       |
+| **Agent-Agnostic**       | Agent Framework Agnostic        | Evaluate agents from any framework via thin adapters without requiring protocol adoption or code recreation.      |
+| **Benchmarks**           | Pre-Implemented Benchmarks      | Ships complete, ready-to-run benchmarks with environments, tools, and evaluators (not just templates).            |
+| **Flexible Interaction** | Flexible Agent-Environment-User | First-class user simulation with personas and tool access for realistic multi-turn conversations.                 |
+| **BYO**                  | BYO Philosophy                  | Bring your own logging, agents, environments, and tools. Open-source, works offline, no mandatory cloud services. |
+| **Trace-First**          | Trace-First Evaluation          | Evaluate intermediate steps across environment and agents via first-class traces, not post-hoc fixes.             |
+| **Mature**               | Professional Tooling            | Published on PyPI, CI/CD, good test coverage, active maintenance.                                                 |
 
 </details>
 
@@ -122,3 +115,20 @@ We welcome any contributions. Please read the [CONTRIBUTING.md](CONTRIBUTING.md)
 This library includes implementations for several benchmarks to evaluate a variety of multi-agent scenarios. Each benchmark is designed to test specific collaboration and problem-solving skills.
 
 ➡️ **[See here for a full list and description of all available benchmarks including licenses.](./BENCHMARKS.md)**
+
+## Citation
+
+Please consider citing the MASEval library.
+
+```
+@misc{emde2026maseval,
+      title={MASEval: Extending Multi-Agent Evaluation from Models to Systems},
+      author={Cornelius Emde and Alexander Rubinstein and Anmol Goel and Ahmed Heakl and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
+      year={2026},
+      eprint={2603.08835},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2603.08835},
+      note={Alexander Rubinstein, Anmol Goel, and Ahmed Heakl contributed equally.},
+}
+```
diff --git a/docs/index.md b/docs/index.md