|
| 1 | +# MMLU Benchmark Example |
| 2 | + |
| 3 | +Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959). |
| 4 | + |
| 5 | +## Run without DISCO (full evaluation) |
| 6 | + |
| 7 | +From the project root: |
| 8 | + |
| 9 | +```bash |
| 10 | +uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full |
| 11 | +``` |
| 12 | + |
| 13 | +Full evaluation results look like: |
| 14 | + |
| 15 | +``` |
| 16 | +================================================================================ |
| 17 | +Results Summary (Evaluated Tasks) |
| 18 | +================================================================================ |
| 19 | +Total tasks: 14042 |
| 20 | +Correct: 8291 |
| 21 | +Accuracy (on anchor points): 0.5904 |
| 22 | +Accuracy norm (on anchor points): 0.5904 |
| 23 | +``` |
| 24 | + |
| 25 | +## Run with DISCO (predicted full-benchmark score) |
| 26 | + |
| 27 | +From the project root: |
| 28 | + |
| 29 | +```bash |
| 30 | +uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full --disco_model_path arubique/DISCO-MMLU |
| 31 | +``` |
| 32 | + |
| 33 | +Predicted score output: |
| 34 | + |
| 35 | +``` |
| 36 | +---------------------------------------- |
| 37 | +DISCO Predicted Full Benchmark Accuracy: |
| 38 | +---------------------------------------- |
| 39 | + Model 0: 0.606739 |
| 40 | +``` |
| 41 | + |
| 42 | +## Arguments |
| 43 | + |
| 44 | +| Argument | Description | Default | |
| 45 | +|----------|-------------|---------| |
| 46 | +| `--model_id` | HuggingFace model identifier (e.g. `meta-llama/Llama-2-7b-hf`) | *(required)* | |
| 47 | +| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` | |
| 48 | +| `--anchor_points_path` | Path to anchor points pickle file; if set, only anchor tasks are evaluated | — | |
| 49 | +| `--output_dir` | Directory to save results | `./results` | |
| 50 | +| `--predictions_path` | Path to save predictions tensor as pickle (for DISCO) | — | |
| 51 | +| `--limit` | Limit number of tasks to evaluate (for testing) | — | |
| 52 | +| `--batch_size` | Batch size for evaluation (reserved for future use) | `1` | |
| 53 | +| `--device` | Device to run model on (e.g. `cuda:0`, `cpu`) | `cuda:0` | |
| 54 | +| `--num_workers` | Number of parallel workers for task execution | `1` | |
| 55 | +| `--disco_model_path` | If set, run DISCO prediction; path to `.pkl`, `.npz`, or Hugging Face repo id | — | |
| 56 | +| `--disco_transform_path` | Path to DISCO PCA transform `.pkl` or `.npz` (for local DISCO model when using `--pca`) | — | |
| 57 | +| `--pca` | PCA dimension for DISCO embeddings | — | |
| 58 | +| `--pad_to_size` | Pad predictions to this size with -inf | — | |
| 59 | +| `--use_lmeval_batching` | Use [lm-evaluation-harness-style](https://github.com/EleutherAI/lm-evaluation-harness) batching for exact numerical match with DISCO repo | off | |
0 commit comments