Skip to content

Commit 22e4b96

Browse files
committed
[Create PR]:
- Shorten example's readme
1 parent 43cabb7 commit 22e4b96

1 file changed

Lines changed: 21 additions & 246 deletions

File tree

examples/mmlu_benchmark/README.md

Lines changed: 21 additions & 246 deletions
Original file line numberDiff line numberDiff line change
@@ -1,266 +1,41 @@
11
# MMLU Benchmark Example
22

3-
This example demonstrates how to evaluate HuggingFace language models on MMLU (Massive Multitask Language Understanding) using MASEval, with optional anchor point-based task selection for DISCO prediction.
3+
Evaluate language models on MMLU (Massive Multitask Language Understanding) with optional efficient evaluation via DISCO.
44

5-
## Overview
5+
## Run without DISCO (full evaluation)
66

7-
The MMLU benchmark evaluates language models on multiple choice questions across 57 subjects including STEM, humanities, social sciences, and more. This implementation is compatible with the [disco-public](https://github.com/parameterlab/disco-public) evaluation methodology.
8-
9-
### Key Features
10-
11-
- **Anchor Point-Based Evaluation**: Evaluate only on selected anchor tasks for efficient DISCO-based performance prediction
12-
- **Full Prompt Support**: Use few-shot examples from `full_prompt` field (like lm-evaluation-harness)
13-
- **HuggingFace Integration**: Works with any HuggingFace transformers model
14-
- **DISCO-Compatible Output**: Saves predictions in format compatible with DISCO predictor
15-
16-
## Installation
17-
18-
```bash
19-
# Install MASEval with all extras (includes transformers)
20-
pip install "maseval[all]"
21-
22-
# Or install with specific extras
23-
pip install "maseval[transformers]"
24-
```
25-
26-
## Data Format
27-
28-
The benchmark expects a JSON file in the `mmlu_prompts_examples.json` format:
29-
30-
```json
31-
[
32-
{
33-
"query": "Question text with answer choices...\nA. ...\nB. ...\nC. ...\nD. ...\nAnswer:",
34-
"full_prompt": "The following are multiple choice questions... [few-shot examples] ... [question]",
35-
"choices": ["A", "B", "C", "D"],
36-
"gold": 1
37-
},
38-
...
39-
]
40-
```
41-
42-
## Usage
43-
44-
### Basic Evaluation
45-
46-
Evaluate a model on all MMLU tasks (uses `arubique/flattened-MMLU` by default):
47-
48-
```bash
49-
python mmlu_benchmark.py --model_id "meta-llama/Llama-2-7b-hf"
50-
```
51-
52-
To use a local JSON file or another Hugging Face dataset:
53-
54-
```bash
55-
python mmlu_benchmark.py \
56-
--model_id "meta-llama/Llama-2-7b-hf" \
57-
--data_path /path/to/mmlu_prompts_examples.json
58-
```
59-
60-
### Anchor Points Evaluation (for DISCO)
61-
62-
Evaluate only on anchor tasks for DISCO prediction:
63-
64-
```bash
65-
python mmlu_benchmark.py \
66-
--model_id "alignment-handbook/zephyr-7b-sft-full" \
67-
--data_path /path/to/mmlu_prompts_examples.json \
68-
--anchor_points_path /path/to/anchor_points_disagreement.pkl \
69-
--predictions_path ./output/predictions.pkl
70-
```
71-
72-
### DISCO Prediction (Predict Full Benchmark from Anchor Points)
73-
74-
Evaluate on anchor points and predict full benchmark performance using DISCO:
75-
76-
```bash
77-
python mmlu_benchmark.py \
78-
--model_id "alignment-handbook/zephyr-7b-sft-full" \
79-
--data_path /path/to/mmlu_prompts_examples.json \
80-
--anchor_points_path /path/to/anchor_points_disagreement.pkl \
81-
--disco_model_path /path/to/fitted_weights.pkl \
82-
--disco_transform_path /path/to/transform.pkl \
83-
--pca 256
84-
```
85-
86-
This is equivalent to running:
87-
1. The evaluation on anchor points (like `scripts/run_lm_eval.py --skip_non_anchor_points`)
88-
2. The prediction step (like `scripts/predict_model_performance.py`)
89-
90-
#### Avoiding pickle / sklearn version warnings
91-
92-
If you see `InconsistentVersionWarning` when loading the DISCO pickles (e.g. fitted_weights.pkl was saved with sklearn 1.7.2 but you use 1.8.0), you can export the model to NumPy-only `.npz` files and use those instead:
93-
94-
```bash
95-
# One-time extraction (run with the env that created the pickles, or ignore the warning)
96-
python extract_disco_weights.py \
97-
--model_path /path/to/fitted_weights.pkl \
98-
--transform_path /path/to/transform.pkl \
99-
--output_dir /path/to/disco_npz
100-
```
101-
102-
Then point the benchmark at the extracted files:
7+
From the project root:
1038

1049
```bash
105-
python mmlu_benchmark.py ... \
106-
--disco_model_path /path/to/disco_npz/disco_model.npz \
107-
--disco_transform_path /path/to/disco_npz/disco_transform.npz \
108-
--pca 256
10+
uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full
10911
```
11012

111-
No pickle or sklearn objects are loaded at runtime when using `.npz` paths.
112-
113-
#### Loading DISCO from Hugging Face
114-
115-
You can upload the extracted weights as a custom Hugging Face model and load it with `AutoModel.from_pretrained(..., trust_remote_code=True)`. See [huggingface_disco/README.md](huggingface_disco/README.md) for:
13+
Full evaluation results look like:
11614

117-
1. Extracting weights (`extract_disco_weights.py`)
118-
2. Building the HF repo (`huggingface_disco/build_repo.py`)
119-
3. Uploading to the Hub
120-
121-
Then run the benchmark with a repo id instead of local paths:
122-
123-
```bash
124-
python mmlu_benchmark.py ... \
125-
--disco_model_path <USERNAME>/my-disco-mmlu
12615
```
127-
128-
(No `--disco_transform_path` needed; the Hub model contains everything.)
129-
130-
### Quick Test
131-
132-
Run on a small subset for testing:
133-
134-
```bash
135-
python mmlu_benchmark.py \
136-
--model_id "meta-llama/Llama-2-7b-hf" \
137-
--data_path /path/to/mmlu_prompts_examples.json \
138-
--limit 10
16+
================================================================================
17+
Results Summary (Evaluated Tasks)
18+
================================================================================
19+
Total tasks: 100
20+
Correct: 35
21+
Accuracy (on anchor points): 0.3500
22+
Accuracy norm (on anchor points): 0.3500
23+
Built predictions tensor with shape: (1, 100, 31)
13924
```
14025

141-
## Command Line Arguments
142-
143-
| Argument | Description | Default |
144-
|----------|-------------|---------|
145-
| `--model_id` | HuggingFace model identifier (required) | - |
146-
| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` |
147-
| `--anchor_points_path` | Path to anchor points pickle file | None |
148-
| `--output_dir` | Directory to save results | `./results` |
149-
| `--predictions_path` | Path to save predictions pickle (for DISCO) | None |
150-
| `--limit` | Limit number of tasks to evaluate | None |
151-
| `--device` | Device to run model on | `cuda:0` |
152-
| `--num_workers` | Number of parallel workers | 1 |
153-
| `--disco_model_path` | If set, run DISCO prediction (.pkl, .npz, or HF repo id) | None |
154-
| `--disco_transform_path` | Path to DISCO PCA transform pickle | None |
155-
| `--pca` | PCA dimension for DISCO embeddings | 256 |
26+
## Run with DISCO (predicted full-benchmark score)
15627

157-
## Equivalent disco-public Command
158-
159-
This MASEval benchmark provides equivalent functionality to:
28+
From the project root:
16029

16130
```bash
162-
python scripts/run_lm_eval.py \
163-
--anchor_points_path=/path/to/anchor_points_disagreement.pkl \
164-
--batch_size=8 --device=cuda:0 \
165-
--gen_kwargs=max_gen_toks=128,output_scores=True,return_dict_in_generate=True \
166-
--metric=acc_norm --model=hf \
167-
--model_args=pretrained=alignment-handbook/zephyr-7b-sft-full,trust_remote_code=True \
168-
--num_fewshot=0 \
169-
--output_path=/path/to/output \
170-
--tasks=mmlu_prompts --log_samples --force_recompute \
171-
--use_full_prompt --skip_non_anchor_points
31+
uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full --disco_model_path arubique/DISCO-MMLU
17232
```
17333

174-
## Programmatic Usage
175-
176-
```python
177-
from maseval.benchmark.mmlu import (
178-
MMLUBenchmark,
179-
load_tasks,
180-
compute_benchmark_metrics,
181-
)
182-
from maseval.interface.inference import HuggingFaceModelAdapter
183-
184-
185-
class MyMMLUBenchmark(MMLUBenchmark):
186-
def __init__(self, model_id: str, **kwargs):
187-
super().__init__(**kwargs)
188-
self._model_id = model_id
189-
self._pipeline = None
34+
Predicted score output:
19035

191-
def get_model_adapter(self, model_id: str, **kwargs):
192-
from transformers import pipeline
193-
194-
if self._pipeline is None:
195-
self._pipeline = pipeline(
196-
"text-generation",
197-
model=self._model_id,
198-
device="cuda:0",
199-
)
200-
201-
return HuggingFaceModelAdapter(
202-
model=self._pipeline,
203-
model_id=self._model_id,
204-
)
205-
206-
207-
# Load tasks with anchor points filtering
208-
tasks = load_tasks(
209-
data_path="/path/to/mmlu_prompts_examples.json",
210-
anchor_points_path="/path/to/anchor_points.pkl",
211-
)
212-
213-
# Run benchmark
214-
benchmark = MyMMLUBenchmark(
215-
model_id="meta-llama/Llama-2-7b-hf",
216-
use_full_prompt=True,
217-
)
218-
results = benchmark.run(tasks=tasks, agent_data={"model_id": "llama-7b"})
219-
220-
# Compute metrics
221-
metrics = compute_benchmark_metrics(results)
222-
print(f"Accuracy: {metrics['acc']:.4f}")
22336
```
224-
225-
## Output Format
226-
227-
### Results (JSONL)
228-
229-
Each task result is saved in JSONL format:
230-
231-
```json
232-
{
233-
"task_id": "mmlu_42",
234-
"status": "success",
235-
"eval": [
236-
{
237-
"acc": 1.0,
238-
"acc_norm": 1.0,
239-
"predicted": 1,
240-
"gold": 1,
241-
"correct": true,
242-
"doc_id": 42
243-
}
244-
]
245-
}
37+
----------------------------------------
38+
DISCO Predicted Full Benchmark Accuracy:
39+
----------------------------------------
40+
Model 0: 0.606739
24641
```
247-
248-
### Predictions (Pickle)
249-
250-
When `--predictions_path` is specified, saves a numpy array of shape `(1, n_questions, n_choices)` for use with DISCO predictor.
251-
252-
## Architecture
253-
254-
The MMLU benchmark implementation follows MASEval patterns:
255-
256-
- `MMLUBenchmark`: Main benchmark class (abstract, requires `get_model_adapter`)
257-
- `MMLUEnvironment`: Simple environment holding task context (no tools needed)
258-
- `MMLUEvaluator`: Evaluates model predictions against gold answers
259-
- `AnchorPointsTaskQueue`: AdaptiveTaskQueue that iterates through anchor tasks
260-
- `MMLUModelAgent`: Simple agent wrapper that forwards prompts to model
261-
262-
## References
263-
264-
- [MMLU Dataset](https://huggingface.co/datasets/cais/mmlu)
265-
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
266-
- [disco-public](https://github.com/parameterlab/disco-public)

0 commit comments

Comments
 (0)