|
1 | 1 | # MMLU Benchmark Example |
2 | 2 |
|
3 | | -This example demonstrates how to evaluate HuggingFace language models on MMLU (Massive Multitask Language Understanding) using MASEval, with optional anchor point-based task selection for DISCO prediction. |
| 3 | +Evaluate language models on MMLU (Massive Multitask Language Understanding) with optional efficient evaluation via DISCO. |
4 | 4 |
|
5 | | -## Overview |
| 5 | +## Run without DISCO (full evaluation) |
6 | 6 |
|
7 | | -The MMLU benchmark evaluates language models on multiple choice questions across 57 subjects including STEM, humanities, social sciences, and more. This implementation is compatible with the [disco-public](https://github.com/parameterlab/disco-public) evaluation methodology. |
8 | | - |
9 | | -### Key Features |
10 | | - |
11 | | -- **Anchor Point-Based Evaluation**: Evaluate only on selected anchor tasks for efficient DISCO-based performance prediction |
12 | | -- **Full Prompt Support**: Use few-shot examples from `full_prompt` field (like lm-evaluation-harness) |
13 | | -- **HuggingFace Integration**: Works with any HuggingFace transformers model |
14 | | -- **DISCO-Compatible Output**: Saves predictions in format compatible with DISCO predictor |
15 | | - |
16 | | -## Installation |
17 | | - |
18 | | -```bash |
19 | | -# Install MASEval with all extras (includes transformers) |
20 | | -pip install "maseval[all]" |
21 | | - |
22 | | -# Or install with specific extras |
23 | | -pip install "maseval[transformers]" |
24 | | -``` |
25 | | - |
26 | | -## Data Format |
27 | | - |
28 | | -The benchmark expects a JSON file in the `mmlu_prompts_examples.json` format: |
29 | | - |
30 | | -```json |
31 | | -[ |
32 | | - { |
33 | | - "query": "Question text with answer choices...\nA. ...\nB. ...\nC. ...\nD. ...\nAnswer:", |
34 | | - "full_prompt": "The following are multiple choice questions... [few-shot examples] ... [question]", |
35 | | - "choices": ["A", "B", "C", "D"], |
36 | | - "gold": 1 |
37 | | - }, |
38 | | - ... |
39 | | -] |
40 | | -``` |
41 | | - |
42 | | -## Usage |
43 | | - |
44 | | -### Basic Evaluation |
45 | | - |
46 | | -Evaluate a model on all MMLU tasks (uses `arubique/flattened-MMLU` by default): |
47 | | - |
48 | | -```bash |
49 | | -python mmlu_benchmark.py --model_id "meta-llama/Llama-2-7b-hf" |
50 | | -``` |
51 | | - |
52 | | -To use a local JSON file or another Hugging Face dataset: |
53 | | - |
54 | | -```bash |
55 | | -python mmlu_benchmark.py \ |
56 | | - --model_id "meta-llama/Llama-2-7b-hf" \ |
57 | | - --data_path /path/to/mmlu_prompts_examples.json |
58 | | -``` |
59 | | - |
60 | | -### Anchor Points Evaluation (for DISCO) |
61 | | - |
62 | | -Evaluate only on anchor tasks for DISCO prediction: |
63 | | - |
64 | | -```bash |
65 | | -python mmlu_benchmark.py \ |
66 | | - --model_id "alignment-handbook/zephyr-7b-sft-full" \ |
67 | | - --data_path /path/to/mmlu_prompts_examples.json \ |
68 | | - --anchor_points_path /path/to/anchor_points_disagreement.pkl \ |
69 | | - --predictions_path ./output/predictions.pkl |
70 | | -``` |
71 | | - |
72 | | -### DISCO Prediction (Predict Full Benchmark from Anchor Points) |
73 | | - |
74 | | -Evaluate on anchor points and predict full benchmark performance using DISCO: |
75 | | - |
76 | | -```bash |
77 | | -python mmlu_benchmark.py \ |
78 | | - --model_id "alignment-handbook/zephyr-7b-sft-full" \ |
79 | | - --data_path /path/to/mmlu_prompts_examples.json \ |
80 | | - --anchor_points_path /path/to/anchor_points_disagreement.pkl \ |
81 | | - --disco_model_path /path/to/fitted_weights.pkl \ |
82 | | - --disco_transform_path /path/to/transform.pkl \ |
83 | | - --pca 256 |
84 | | -``` |
85 | | - |
86 | | -This is equivalent to running: |
87 | | -1. The evaluation on anchor points (like `scripts/run_lm_eval.py --skip_non_anchor_points`) |
88 | | -2. The prediction step (like `scripts/predict_model_performance.py`) |
89 | | - |
90 | | -#### Avoiding pickle / sklearn version warnings |
91 | | - |
92 | | -If you see `InconsistentVersionWarning` when loading the DISCO pickles (e.g. fitted_weights.pkl was saved with sklearn 1.7.2 but you use 1.8.0), you can export the model to NumPy-only `.npz` files and use those instead: |
93 | | - |
94 | | -```bash |
95 | | -# One-time extraction (run with the env that created the pickles, or ignore the warning) |
96 | | -python extract_disco_weights.py \ |
97 | | - --model_path /path/to/fitted_weights.pkl \ |
98 | | - --transform_path /path/to/transform.pkl \ |
99 | | - --output_dir /path/to/disco_npz |
100 | | -``` |
101 | | - |
102 | | -Then point the benchmark at the extracted files: |
| 7 | +From the project root: |
103 | 8 |
|
104 | 9 | ```bash |
105 | | -python mmlu_benchmark.py ... \ |
106 | | - --disco_model_path /path/to/disco_npz/disco_model.npz \ |
107 | | - --disco_transform_path /path/to/disco_npz/disco_transform.npz \ |
108 | | - --pca 256 |
| 10 | +uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full |
109 | 11 | ``` |
110 | 12 |
|
111 | | -No pickle or sklearn objects are loaded at runtime when using `.npz` paths. |
112 | | - |
113 | | -#### Loading DISCO from Hugging Face |
114 | | - |
115 | | -You can upload the extracted weights as a custom Hugging Face model and load it with `AutoModel.from_pretrained(..., trust_remote_code=True)`. See [huggingface_disco/README.md](huggingface_disco/README.md) for: |
| 13 | +Full evaluation results look like: |
116 | 14 |
|
117 | | -1. Extracting weights (`extract_disco_weights.py`) |
118 | | -2. Building the HF repo (`huggingface_disco/build_repo.py`) |
119 | | -3. Uploading to the Hub |
120 | | - |
121 | | -Then run the benchmark with a repo id instead of local paths: |
122 | | - |
123 | | -```bash |
124 | | -python mmlu_benchmark.py ... \ |
125 | | - --disco_model_path <USERNAME>/my-disco-mmlu |
126 | 15 | ``` |
127 | | - |
128 | | -(No `--disco_transform_path` needed; the Hub model contains everything.) |
129 | | - |
130 | | -### Quick Test |
131 | | - |
132 | | -Run on a small subset for testing: |
133 | | - |
134 | | -```bash |
135 | | -python mmlu_benchmark.py \ |
136 | | - --model_id "meta-llama/Llama-2-7b-hf" \ |
137 | | - --data_path /path/to/mmlu_prompts_examples.json \ |
138 | | - --limit 10 |
| 16 | +================================================================================ |
| 17 | +Results Summary (Evaluated Tasks) |
| 18 | +================================================================================ |
| 19 | +Total tasks: 100 |
| 20 | +Correct: 35 |
| 21 | +Accuracy (on anchor points): 0.3500 |
| 22 | +Accuracy norm (on anchor points): 0.3500 |
| 23 | +Built predictions tensor with shape: (1, 100, 31) |
139 | 24 | ``` |
140 | 25 |
|
141 | | -## Command Line Arguments |
142 | | - |
143 | | -| Argument | Description | Default | |
144 | | -|----------|-------------|---------| |
145 | | -| `--model_id` | HuggingFace model identifier (required) | - | |
146 | | -| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` | |
147 | | -| `--anchor_points_path` | Path to anchor points pickle file | None | |
148 | | -| `--output_dir` | Directory to save results | `./results` | |
149 | | -| `--predictions_path` | Path to save predictions pickle (for DISCO) | None | |
150 | | -| `--limit` | Limit number of tasks to evaluate | None | |
151 | | -| `--device` | Device to run model on | `cuda:0` | |
152 | | -| `--num_workers` | Number of parallel workers | 1 | |
153 | | -| `--disco_model_path` | If set, run DISCO prediction (.pkl, .npz, or HF repo id) | None | |
154 | | -| `--disco_transform_path` | Path to DISCO PCA transform pickle | None | |
155 | | -| `--pca` | PCA dimension for DISCO embeddings | 256 | |
| 26 | +## Run with DISCO (predicted full-benchmark score) |
156 | 27 |
|
157 | | -## Equivalent disco-public Command |
158 | | - |
159 | | -This MASEval benchmark provides equivalent functionality to: |
| 28 | +From the project root: |
160 | 29 |
|
161 | 30 | ```bash |
162 | | -python scripts/run_lm_eval.py \ |
163 | | - --anchor_points_path=/path/to/anchor_points_disagreement.pkl \ |
164 | | - --batch_size=8 --device=cuda:0 \ |
165 | | - --gen_kwargs=max_gen_toks=128,output_scores=True,return_dict_in_generate=True \ |
166 | | - --metric=acc_norm --model=hf \ |
167 | | - --model_args=pretrained=alignment-handbook/zephyr-7b-sft-full,trust_remote_code=True \ |
168 | | - --num_fewshot=0 \ |
169 | | - --output_path=/path/to/output \ |
170 | | - --tasks=mmlu_prompts --log_samples --force_recompute \ |
171 | | - --use_full_prompt --skip_non_anchor_points |
| 31 | +uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full --disco_model_path arubique/DISCO-MMLU |
172 | 32 | ``` |
173 | 33 |
|
174 | | -## Programmatic Usage |
175 | | - |
176 | | -```python |
177 | | -from maseval.benchmark.mmlu import ( |
178 | | - MMLUBenchmark, |
179 | | - load_tasks, |
180 | | - compute_benchmark_metrics, |
181 | | -) |
182 | | -from maseval.interface.inference import HuggingFaceModelAdapter |
183 | | - |
184 | | - |
185 | | -class MyMMLUBenchmark(MMLUBenchmark): |
186 | | - def __init__(self, model_id: str, **kwargs): |
187 | | - super().__init__(**kwargs) |
188 | | - self._model_id = model_id |
189 | | - self._pipeline = None |
| 34 | +Predicted score output: |
190 | 35 |
|
191 | | - def get_model_adapter(self, model_id: str, **kwargs): |
192 | | - from transformers import pipeline |
193 | | - |
194 | | - if self._pipeline is None: |
195 | | - self._pipeline = pipeline( |
196 | | - "text-generation", |
197 | | - model=self._model_id, |
198 | | - device="cuda:0", |
199 | | - ) |
200 | | - |
201 | | - return HuggingFaceModelAdapter( |
202 | | - model=self._pipeline, |
203 | | - model_id=self._model_id, |
204 | | - ) |
205 | | - |
206 | | - |
207 | | -# Load tasks with anchor points filtering |
208 | | -tasks = load_tasks( |
209 | | - data_path="/path/to/mmlu_prompts_examples.json", |
210 | | - anchor_points_path="/path/to/anchor_points.pkl", |
211 | | -) |
212 | | - |
213 | | -# Run benchmark |
214 | | -benchmark = MyMMLUBenchmark( |
215 | | - model_id="meta-llama/Llama-2-7b-hf", |
216 | | - use_full_prompt=True, |
217 | | -) |
218 | | -results = benchmark.run(tasks=tasks, agent_data={"model_id": "llama-7b"}) |
219 | | - |
220 | | -# Compute metrics |
221 | | -metrics = compute_benchmark_metrics(results) |
222 | | -print(f"Accuracy: {metrics['acc']:.4f}") |
223 | 36 | ``` |
224 | | - |
225 | | -## Output Format |
226 | | - |
227 | | -### Results (JSONL) |
228 | | - |
229 | | -Each task result is saved in JSONL format: |
230 | | - |
231 | | -```json |
232 | | -{ |
233 | | - "task_id": "mmlu_42", |
234 | | - "status": "success", |
235 | | - "eval": [ |
236 | | - { |
237 | | - "acc": 1.0, |
238 | | - "acc_norm": 1.0, |
239 | | - "predicted": 1, |
240 | | - "gold": 1, |
241 | | - "correct": true, |
242 | | - "doc_id": 42 |
243 | | - } |
244 | | - ] |
245 | | -} |
| 37 | +---------------------------------------- |
| 38 | +DISCO Predicted Full Benchmark Accuracy: |
| 39 | +---------------------------------------- |
| 40 | + Model 0: 0.606739 |
246 | 41 | ``` |
247 | | - |
248 | | -### Predictions (Pickle) |
249 | | - |
250 | | -When `--predictions_path` is specified, saves a numpy array of shape `(1, n_questions, n_choices)` for use with DISCO predictor. |
251 | | - |
252 | | -## Architecture |
253 | | - |
254 | | -The MMLU benchmark implementation follows MASEval patterns: |
255 | | - |
256 | | -- `MMLUBenchmark`: Main benchmark class (abstract, requires `get_model_adapter`) |
257 | | -- `MMLUEnvironment`: Simple environment holding task context (no tools needed) |
258 | | -- `MMLUEvaluator`: Evaluates model predictions against gold answers |
259 | | -- `AnchorPointsTaskQueue`: AdaptiveTaskQueue that iterates through anchor tasks |
260 | | -- `MMLUModelAgent`: Simple agent wrapper that forwards prompts to model |
261 | | - |
262 | | -## References |
263 | | - |
264 | | -- [MMLU Dataset](https://huggingface.co/datasets/cais/mmlu) |
265 | | -- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
266 | | -- [disco-public](https://github.com/parameterlab/disco-public) |
0 commit comments