|
| 1 | +# MMLU: Massive Multitask Language Understanding (Beta) |
| 2 | + |
| 3 | +!!! warning "Beta" |
| 4 | + This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome! |
| 5 | + |
| 6 | +The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2407.12890) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks. |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features: |
| 11 | + |
| 12 | +- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology |
| 13 | +- **Anchor-point task selection** via `AnchorPointsTaskQueue` for DISCO-style subset evaluation |
| 14 | +- **HuggingFace integration** with batched log-probability computation |
| 15 | +- **lm-eval compatibility** mode for exact numerical reproduction |
| 16 | + |
| 17 | +Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses. |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +MMLU has an optional dependency extra (currently empty, as core MMLU requires no additional packages): |
| 22 | + |
| 23 | +```bash |
| 24 | +pip install maseval[mmlu] |
| 25 | +``` |
| 26 | + |
| 27 | +For the HuggingFace implementation, also install transformers: |
| 28 | + |
| 29 | +```bash |
| 30 | +pip install maseval[mmlu,transformers] |
| 31 | +``` |
| 32 | + |
| 33 | +For DISCO prediction support: |
| 34 | + |
| 35 | +```bash |
| 36 | +pip install maseval[disco] |
| 37 | +``` |
| 38 | + |
| 39 | +For exact lm-evaluation-harness reproduction: |
| 40 | + |
| 41 | +```bash |
| 42 | +pip install maseval[lm-eval] |
| 43 | +``` |
| 44 | + |
| 45 | +## Quick Start |
| 46 | + |
| 47 | +```python |
| 48 | +from maseval.benchmark.mmlu import ( |
| 49 | + HuggingFaceMMLUBenchmark, |
| 50 | + load_tasks, |
| 51 | + compute_benchmark_metrics, |
| 52 | +) |
| 53 | + |
| 54 | +# Load tasks (downloads from HuggingFace automatically) |
| 55 | +tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json") |
| 56 | + |
| 57 | +# Create benchmark with HuggingFace model |
| 58 | +benchmark = HuggingFaceMMLUBenchmark( |
| 59 | + model_id="meta-llama/Llama-2-7b-hf", |
| 60 | + device="cuda:0", |
| 61 | +) |
| 62 | + |
| 63 | +# Run evaluation |
| 64 | +results = benchmark.run( |
| 65 | + tasks=tasks, |
| 66 | + agent_data={"model_id": "meta-llama/Llama-2-7b-hf"}, |
| 67 | +) |
| 68 | + |
| 69 | +# Compute metrics |
| 70 | +metrics = compute_benchmark_metrics(results) |
| 71 | +print(f"Accuracy: {metrics['acc']:.4f}") |
| 72 | +``` |
| 73 | + |
| 74 | +### With Anchor Points (DISCO) |
| 75 | + |
| 76 | +```python |
| 77 | +from maseval.benchmark.mmlu import load_tasks |
| 78 | + |
| 79 | +# Load tasks filtered to anchor points |
| 80 | +tasks = load_tasks( |
| 81 | + data_path="/path/to/mmlu_prompts_examples.json", |
| 82 | + anchor_points_path="/path/to/anchor_points.json", |
| 83 | +) |
| 84 | + |
| 85 | +# tasks is an AnchorPointsTaskQueue — only anchor tasks are evaluated |
| 86 | +print(f"Evaluating {len(tasks)} anchor tasks") |
| 87 | +``` |
| 88 | + |
| 89 | +## Custom Benchmark Subclass |
| 90 | + |
| 91 | +`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`: |
| 92 | + |
| 93 | +```python |
| 94 | +from maseval.benchmark.mmlu import MMLUBenchmark, MMLUModelAgent, MMLUAgentAdapter |
| 95 | + |
| 96 | +class MyMMLUBenchmark(MMLUBenchmark): |
| 97 | + def setup_agents(self, agent_data, environment, task, user, seed_generator): |
| 98 | + model = self.get_model_adapter(agent_data["model_id"]) |
| 99 | + agent = MMLUModelAgent(model, name="mmlu_agent") |
| 100 | + adapter = MMLUAgentAdapter(agent, "mmlu_agent") |
| 101 | + return [adapter], {"mmlu_agent": adapter} |
| 102 | + |
| 103 | + def get_model_adapter(self, model_id, **kwargs): |
| 104 | + adapter = MyModelAdapter(model_id) |
| 105 | + register_name = kwargs.get("register_name") |
| 106 | + if register_name: |
| 107 | + self.register("models", register_name, adapter) |
| 108 | + return adapter |
| 109 | +``` |
| 110 | + |
| 111 | +## API Reference |
| 112 | + |
| 113 | +::: maseval.benchmark.mmlu.MMLUBenchmark |
| 114 | + |
| 115 | +::: maseval.benchmark.mmlu.HuggingFaceMMLUBenchmark |
| 116 | + |
| 117 | +::: maseval.benchmark.mmlu.MMLUEnvironment |
| 118 | + |
| 119 | +::: maseval.benchmark.mmlu.MMLUEvaluator |
| 120 | + |
| 121 | +::: maseval.benchmark.mmlu.MMLUModelAgent |
| 122 | + |
| 123 | +::: maseval.benchmark.mmlu.MMLUAgentAdapter |
| 124 | + |
| 125 | +::: maseval.benchmark.mmlu.load_tasks |
| 126 | + |
| 127 | +::: maseval.benchmark.mmlu.compute_benchmark_metrics |
0 commit comments