Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions _events/2026-acl-workshop.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
layout: event
title: 2026 ACL Workshop on Evaluating Evaluations (EvalEval)
title: ACL 2026 Workshop on Evaluating Evaluations (EvalEval)
subtitle: Examining Best Practices for Utilizing and Developing Generative Model Evaluations
team: Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat
status: active
order: 1
category: Organization
event_date: 2026-7-04
location: Room Harbor A
location: San Diego (USA), Room Harbor A
host: EvalEval
description: |
This workshop focuses on AI evaluation in practice, centering the tensions and collaborations between model developers and evaluation researchers and aims to surface practical insights from across the evaluation ecosystem.
Expand Down Expand Up @@ -73,7 +73,7 @@ This panel brings together model developers and evaluation researchers to examin
### 🔬 4:25 PM – 5:15 PM | Shared Task *(50 mins)*
**Moderator:** Jan Batzner, Weizenbaum Institute, Technical University Munich

AI evaluation results are scattered across leaderboards, papers, blog posts, and harness logs in incompatible formats, with different frameworks producing divergent scores and inconsistent metadata that hinder comparison, reuse, and cost reduction. **Every Eval Ever** is the first shared schema and community-crowdsourced repository for AI evaluation results — source-agnostic by design and at unprecedented scale: **22,235 models, 2,273 unique benchmarks, and 31 evaluation formats — and growing**. At ACL, we present the [Every Eval Ever Shared Task](https://github.com/evaleval/every_eval_ever) and the community case studies it has enabled on this data. 🖊️
AI evaluation results are scattered across leaderboards, papers, blog posts, and harness logs in incompatible formats, with different frameworks producing divergent scores and inconsistent metadata that hinder comparison, reuse, and [cost](https://evalevalai.com/research/2026/04/29/eval-costs-bottleneck/) reduction. [**Every Eval Ever**](https://evalevalai.com/infrastructure/2026/02/17/everyevalever-launch/) is the first unifying schema and community-crowdsourced repository for all AI evaluation results — source-agnostic by design and at unprecedented scale: over 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats — and growing! At ACL, we present the [ACL Every Eval Ever Shared Task](https://evalevalai.com/events/shared-task-every-eval-ever/) and the fantastic community case studies it has already enabled on this [database](https://huggingface.co/datasets/evaleval/EEE_datastore).

---

Expand Down Expand Up @@ -173,15 +173,21 @@ To support a fair, high-quality, and sustainable review process, we adopt a reci


## ❓ FAQ
**Can I attend this workshop online?**
The workshop is in-person at ACL 2026 in San Diego. At least one author of each accepted paper must present and register on-site.

**How do I change from non-archival to archival?**
Please indicate in the form sent to all accepted authors whether you want your submission to be considered archival or non-archival.

**Do I need to upload a Camera Ready if I selected non-archival?**
We encourage all authors to deanonymize their papers and incorporate the reviewers feedback in their revision. Nevertheless, Camera Ready instructions are tailored to archival submissions.

**I'm waiting for my ARR decision — can I still submit to EvalEval?**
Yes! If your paper is later accepted at ACL, you would simply choose our non-archival option.

**Can I also submit in the ICML format?**
No, please use the [ARR formatting](https://github.com/acl-org/acl-style-files).

**Can I attend this workshop online?**
The workshop is in-person at ACL 2026 in San Diego. At least one author of each accepted paper must present on-site.

**My position paper is 6 pages. Does that work?**
Yes, all submission types (research and positions/provocations) are welcome at any of the three length tiers.

Expand Down
44 changes: 20 additions & 24 deletions _events/2026-facct-tutorial.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,48 @@
---
layout: event
title: "🎓 Every Eval Ever: Building Community-Governed AI Evaluation Infrastructure"
subtitle: ACM FAccT 2026 Tutorial
team: Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Avijit Ghosh, Angelie Kraft, Wm. Matthew Kennedy, Leon Staufer, David Hartmann, Usman Gohar, Michelle Lin, Yanan Long, Jennifer Mickel, Leshem Choshen, Irene Solaiman
title: "FAccT 2026 Tutorial on Every Eval Ever"
subtitle: Building Community-Governed AI Evaluation Infrastructure
team: Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Avijit Ghosh, Angelie Kraft, Usman Gohar, Michelle Lin, Yanan Long, Jennifer Mickel, Wm. Matthew Kennedy, Leon Staufer, David Hartmann, Leshem Choshen*, Irene Solaiman*
status: active
order: 1
category: Organization
event_date: 2026-06-26
location: ACM FAccT, Montreal
location: Montreal (Canada)
host: EvalEval
description: |
A FAccT 2026 tutorial walking through Every Eval Ever — a community-governed open source infrastructure that unifies evaluation results under a shared metadata schema — and Evaluation Cards, an interpretive layer for evaluation reporting.
A FAccT 2026 tutorial walking through Every Eval Ever — a community-governed open source infrastructure unifying evaluation results under a shared metadata schema — and Evaluation Cards, an interpretive layer for evaluation reporting.
---

## 📖 About

## 🪧 About
Existing model evaluation results are scattered across leaderboards, papers, and technical reports in incompatible formats. This fragmentation obscures transparency, hinders progress, and disadvantages researchers, civil society, policymakers, and industry alike, especially those who can't afford to run evaluations from scratch. Built once, shared eval infrastructure serves us all.

In this tutorial, we walk through [Every Eval Ever](https://github.com/evaleval/every_eval_ever), a community-governed open source infrastructure that unifies all evaluation results under a shared metadata schema. We then present **Evaluation Cards**, an interface and interpretive layer for evaluation reporting designed around practitioner needs from stakeholder interviews, and show how participants can find, compare, and contribute evaluations themselves.

All technical experience levels are welcome. **If you can, please bring a laptop or tablet!** 💻
In this tutorial, we walk through [**Every Eval Ever**](https://evalevalai.com/infrastructure/2026/02/17/everyevalever-launch/), a community-governed open source infrastructure that unifies all evaluation results under a shared metadata schema. We then present [**Evaluation Cards**](https://evalcards.evalevalai.com), an interface and interpretive layer for evaluation reporting designed around practitioner needs from stakeholder interviews, and show how participants can find, compare, and contribute evaluations themselves.

## 📅 Date & Location
All technical experience levels are welcome. If you can, please bring a laptop or tablet! 💻

- **When:** Friday, June 26, 2026 · 3:00 – 4:00 PM
## 📅 In-Person FAccT Tutorial
- **When:** Friday, June 26, 2026 · 3:00 – 4:00 PM (Canada)
- **Where:** ACM FAccT 2026, Montreal (in person)

## 🎤 Presenters
## 🌐 Online FAccT Tutorial
- **When:** Friday, June 26, 2026 · TBD
- **Where:** Zoom Video Conference

## 🏛️ Tutorial Program Committee
- Jan Batzner, Weizenbaum Institute, Technical University Munich
- Sree Harsha Nelaturu, Zuse Institute
- Anastassia Kornilova, Trustible
- Avijit Ghosh, Hugging Face
- Angelie Kraft, Weizenbaum Institute
- Wm. Matthew Kennedy, Oxford
- Leon Staufer, Cambridge
- David Hartmann, Weizenbaum Institute
- Usman Gohar, Iowa State University
- Michelle Lin, Mila, Quebec AI Institute
- Yanan Long, StickFluxLabs
- Jennifer Mickel, EleutherAI
- Leshem Choshen, MIT, IBM Research
- Irene Solaiman, Hugging Face

## 🏛️ Organizers

[EvalEval Coalition](/about/)
- Wm. Matthew Kennedy, University of Oxford
- Leon Staufer, University of Cambridge
- David Hartmann, Weizenbaum Institute
- Leshem Choshen*, MIT, IBM Research, MIT-IBM Watson AI Lab
- Irene Solaiman*, Hugging Face

## 📬 Contact

- [evalevalpc@googlegroups.com](mailto:evalevalpc@googlegroups.com)
We are looking forward to meeting you! For any questions, please reach the [EvalEval Organizing Team here](mailto:evalevalpc@googlegroups.com).
3 changes: 2 additions & 1 deletion _events/shared-task-every-eval-ever.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
---
layout: event
title: "Shared Task: Every Eval Ever"
title: "ACL Shared Task on Every Eval Ever"
subtitle: Building a Unifying, Standardized Database of LLM Evaluations
status: active
order: 2
category: Infrastructure
event_date: 2026-05-01
location: 🌐 Online
host: EvalEval
team: Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Damian Stachura, Stella Biderman, Irene Solaiman, Avijit Ghosh, Leshem Choshen
description: |
Help us build the first unifying, open database of LLM evaluation results! Convert evaluation data from leaderboards, papers, or your own runs into a shared format — and join as co-author on the resulting paper.
---
Expand Down