Skip to content

Commit d166e87

Browse files
authored
Add MMLU benchmark with DISCO support (#34)
* Adding MMLU Dataset * Adding AnchorPointsTaskQueue
1 parent c21750d commit d166e87

13 files changed

Lines changed: 2246 additions & 1 deletion

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@
55
.idea/
66
.DS_Store
77
.devcontainer/
8+
results/
9+
DISCO-MMLU/
10+
flattened-MMLU/
11+
tmp/
812

913
# Byte-compiled / optimized / DLL files
1014
__pycache__/

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Benchmarks**
1313

14+
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
15+
1416
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
1517

1618
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
@@ -32,6 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3234

3335
**Examples**
3436

37+
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
3538
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
3639
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
3740

@@ -63,6 +66,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6366
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
6467
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
6568
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)
69+
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34)
6670

6771
### Changed
6872

examples/mmlu_benchmark/.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Results
2+
results/
3+
*.jsonl
4+
5+
# Predictions
6+
predictions/
7+
*.pkl

examples/mmlu_benchmark/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# MMLU Benchmark Example
2+
3+
Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959).
4+
5+
## Run without DISCO (full evaluation)
6+
7+
From the project root:
8+
9+
```bash
10+
uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full
11+
```
12+
13+
Full evaluation results look like:
14+
15+
```
16+
================================================================================
17+
Results Summary (Evaluated Tasks)
18+
================================================================================
19+
Total tasks: 14042
20+
Correct: 8291
21+
Accuracy (on anchor points): 0.5904
22+
Accuracy norm (on anchor points): 0.5904
23+
```
24+
25+
## Run with DISCO (predicted full-benchmark score)
26+
27+
From the project root:
28+
29+
```bash
30+
uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-handbook/zephyr-7b-sft-full --disco_model_path arubique/DISCO-MMLU
31+
```
32+
33+
Predicted score output:
34+
35+
```
36+
----------------------------------------
37+
DISCO Predicted Full Benchmark Accuracy:
38+
----------------------------------------
39+
Model 0: 0.606739
40+
```
41+
42+
## Arguments
43+
44+
| Argument | Description | Default |
45+
|----------|-------------|---------|
46+
| `--model_id` | HuggingFace model identifier (e.g. `meta-llama/Llama-2-7b-hf`) | *(required)* |
47+
| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` |
48+
| `--anchor_points_path` | Path to anchor points pickle file; if set, only anchor tasks are evaluated ||
49+
| `--output_dir` | Directory to save results | `./results` |
50+
| `--predictions_path` | Path to save predictions tensor as pickle (for DISCO) ||
51+
| `--limit` | Limit number of tasks to evaluate (for testing) ||
52+
| `--batch_size` | Batch size for evaluation (reserved for future use) | `1` |
53+
| `--device` | Device to run model on (e.g. `cuda:0`, `cpu`) | `cuda:0` |
54+
| `--num_workers` | Number of parallel workers for task execution | `1` |
55+
| `--disco_model_path` | If set, run DISCO prediction; path to `.pkl`, `.npz`, or Hugging Face repo id ||
56+
| `--disco_transform_path` | Path to DISCO PCA transform `.pkl` or `.npz` (for local DISCO model when using `--pca`) ||
57+
| `--pca` | PCA dimension for DISCO embeddings ||
58+
| `--pad_to_size` | Pad predictions to this size with -inf ||
59+
| `--use_lmeval_batching` | Use [lm-evaluation-harness-style](https://github.com/EleutherAI/lm-evaluation-harness) batching for exact numerical match with DISCO repo | off |
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""MMLU Benchmark Example.
2+
3+
This example demonstrates how to evaluate HuggingFace models on MMLU
4+
using anchor point-based task selection for DISCO prediction.
5+
"""

0 commit comments

Comments
 (0)