Skip to content

Commit 3779e2e

Browse files
committed
[Move DISCO queue to core]:
- Update links in mmlu.md
1 parent dd46f1a commit 3779e2e

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

docs/benchmark/mmlu.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33
!!! warning "Beta"
44
This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
55

6-
The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2407.12890) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks.
6+
The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2510.07959) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks.
77

88
## Overview
99

1010
[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features:
1111

12-
- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology
12+
- **Log-likelihood MCQ evaluation** matching [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) methodology
1313
- **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation
1414
- **HuggingFace integration** with batched log-probability computation
1515
- **lm-eval compatibility** mode for exact numerical reproduction

0 commit comments

Comments
 (0)