Skip to content

Commit 304e54c

Browse files
committed
[Move DISCO queue to core]:
- AnchorPointsTaskQueue moved to core (maseval/core/task.py) - MMLUBenchmark no longer implements agents - Remove silent .get() fallbacks for required fields - Add mmlu = [] extra to pyproject.toml - Add MMLU entry to BENCHMARKS.md - Update documentation with MMLU - Update CHANGELOG.md
1 parent d166e87 commit 304e54c

9 files changed

Lines changed: 280 additions & 203 deletions

File tree

BENCHMARKS.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,21 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses
7979

8080
---
8181

82-
## 6. [Name of Next Benchmark]
82+
## 6. MMLU (Massive Multitask Language Understanding) (Beta)
83+
84+
MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.
85+
86+
> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
87+
88+
### Source and License
89+
90+
- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021)
91+
- **DISCO Paper:** [DISCO: DISCOvering key features for accurate prediction of LLM abilities on benchmarks](https://arxiv.org/abs/2407.12890) (Rubinstein et al., 2025)
92+
- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU)
93+
94+
---
95+
96+
## 7. [Name of Next Benchmark]
8397

8498
(Description for the next benchmark...)
8599

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Benchmarks**
1313

14-
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
14+
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
1515

1616
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
1717

@@ -35,11 +35,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3535
**Examples**
3636

3737
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
38+
- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34)
3839
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
3940
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
4041

4142
**Core**
4243

44+
- Added `AnchorPointsTaskQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import AnchorPointsTaskQueue`. (PR: #34)
4345
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
4446
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
4547
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
@@ -86,6 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8688

8789
**Benchmarks**
8890

91+
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `AnchorPointsTaskQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
8992
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
9093
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
9194
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`

docs/benchmark/mmlu.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# MMLU: Massive Multitask Language Understanding (Beta)
2+
3+
!!! warning "Beta"
4+
This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
5+
6+
The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2407.12890) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks.
7+
8+
## Overview
9+
10+
[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features:
11+
12+
- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology
13+
- **Anchor-point task selection** via `AnchorPointsTaskQueue` for DISCO-style subset evaluation
14+
- **HuggingFace integration** with batched log-probability computation
15+
- **lm-eval compatibility** mode for exact numerical reproduction
16+
17+
Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
18+
19+
## Installation
20+
21+
MMLU has an optional dependency extra (currently empty, as core MMLU requires no additional packages):
22+
23+
```bash
24+
pip install maseval[mmlu]
25+
```
26+
27+
For the HuggingFace implementation, also install transformers:
28+
29+
```bash
30+
pip install maseval[mmlu,transformers]
31+
```
32+
33+
For DISCO prediction support:
34+
35+
```bash
36+
pip install maseval[disco]
37+
```
38+
39+
For exact lm-evaluation-harness reproduction:
40+
41+
```bash
42+
pip install maseval[lm-eval]
43+
```
44+
45+
## Quick Start
46+
47+
```python
48+
from maseval.benchmark.mmlu import (
49+
HuggingFaceMMLUBenchmark,
50+
load_tasks,
51+
compute_benchmark_metrics,
52+
)
53+
54+
# Load tasks (downloads from HuggingFace automatically)
55+
tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json")
56+
57+
# Create benchmark with HuggingFace model
58+
benchmark = HuggingFaceMMLUBenchmark(
59+
model_id="meta-llama/Llama-2-7b-hf",
60+
device="cuda:0",
61+
)
62+
63+
# Run evaluation
64+
results = benchmark.run(
65+
tasks=tasks,
66+
agent_data={"model_id": "meta-llama/Llama-2-7b-hf"},
67+
)
68+
69+
# Compute metrics
70+
metrics = compute_benchmark_metrics(results)
71+
print(f"Accuracy: {metrics['acc']:.4f}")
72+
```
73+
74+
### With Anchor Points (DISCO)
75+
76+
```python
77+
from maseval.benchmark.mmlu import load_tasks
78+
79+
# Load tasks filtered to anchor points
80+
tasks = load_tasks(
81+
data_path="/path/to/mmlu_prompts_examples.json",
82+
anchor_points_path="/path/to/anchor_points.json",
83+
)
84+
85+
# tasks is an AnchorPointsTaskQueue — only anchor tasks are evaluated
86+
print(f"Evaluating {len(tasks)} anchor tasks")
87+
```
88+
89+
## Custom Benchmark Subclass
90+
91+
`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`:
92+
93+
```python
94+
from maseval.benchmark.mmlu import MMLUBenchmark, MMLUModelAgent, MMLUAgentAdapter
95+
96+
class MyMMLUBenchmark(MMLUBenchmark):
97+
def setup_agents(self, agent_data, environment, task, user, seed_generator):
98+
model = self.get_model_adapter(agent_data["model_id"])
99+
agent = MMLUModelAgent(model, name="mmlu_agent")
100+
adapter = MMLUAgentAdapter(agent, "mmlu_agent")
101+
return [adapter], {"mmlu_agent": adapter}
102+
103+
def get_model_adapter(self, model_id, **kwargs):
104+
adapter = MyModelAdapter(model_id)
105+
register_name = kwargs.get("register_name")
106+
if register_name:
107+
self.register("models", register_name, adapter)
108+
return adapter
109+
```
110+
111+
## API Reference
112+
113+
::: maseval.benchmark.mmlu.MMLUBenchmark
114+
115+
::: maseval.benchmark.mmlu.HuggingFaceMMLUBenchmark
116+
117+
::: maseval.benchmark.mmlu.MMLUEnvironment
118+
119+
::: maseval.benchmark.mmlu.MMLUEvaluator
120+
121+
::: maseval.benchmark.mmlu.MMLUModelAgent
122+
123+
::: maseval.benchmark.mmlu.MMLUAgentAdapter
124+
125+
::: maseval.benchmark.mmlu.load_tasks
126+
127+
::: maseval.benchmark.mmlu.compute_benchmark_metrics

maseval/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
BaseTaskQueue,
1717
TaskQueue,
1818
SequentialTaskQueue,
19+
AnchorPointsTaskQueue,
1920
PriorityTaskQueue,
2021
AdaptiveTaskQueue,
2122
)
@@ -93,6 +94,7 @@
9394
"BaseTaskQueue",
9495
"TaskQueue",
9596
"SequentialTaskQueue",
97+
"AnchorPointsTaskQueue",
9698
"PriorityTaskQueue",
9799
"AdaptiveTaskQueue",
98100
# Model adapters

maseval/benchmark/mmlu/__init__.py

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,31 +4,30 @@
44
55
Usage:
66
from maseval.benchmark.mmlu import (
7-
MMLUBenchmark,
8-
MMLUEnvironment,
9-
MMLUEvaluator,
7+
HuggingFaceMMLUBenchmark,
108
load_tasks,
11-
AnchorPointsTaskQueue,
129
)
10+
from maseval import AnchorPointsTaskQueue
1311
1412
# Load tasks and anchor points
1513
tasks = load_tasks(
1614
data_path="path/to/mmlu_prompts_examples.json",
1715
anchor_points_path="path/to/anchor_points.pkl", # Optional
1816
)
1917
20-
# Create benchmark
21-
benchmark = MMLUBenchmark()
22-
results = benchmark.run(tasks=tasks, agent_data={"model_id": "gpt-4"})
18+
# Run benchmark
19+
benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf")
20+
results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"})
2321
"""
2422

23+
from maseval import AnchorPointsTaskQueue
24+
2525
from .mmlu import (
2626
DEFAULT_AGENT_NAME,
2727
DEFAULT_BATCH_SIZE,
2828
DEFAULT_CHOICES,
2929
DEFAULT_DEVICE,
3030
DEFAULT_MODEL_REGISTER_NAME,
31-
FALLBACK_MODEL_ID,
3231
MMLU_TASK_NAME,
3332
STATUS_SUCCESS,
3433
TARGET_DELIMITER,
@@ -39,7 +38,6 @@
3938
MMLUEvaluator,
4039
MMLUModelAgent,
4140
MMLUAgentAdapter,
42-
AnchorPointsTaskQueue,
4341
load_tasks,
4442
compute_benchmark_metrics,
4543
)
@@ -50,7 +48,6 @@
5048
"DEFAULT_CHOICES",
5149
"DEFAULT_DEVICE",
5250
"DEFAULT_MODEL_REGISTER_NAME",
53-
"FALLBACK_MODEL_ID",
5451
"MMLU_TASK_NAME",
5552
"STATUS_SUCCESS",
5653
"TARGET_DELIMITER",

0 commit comments

Comments
 (0)