[Create PR]:

arubique · arubique · commit d945cb2fd525 · 2026-02-16T21:47:46.000+01:00
- Add args description to readme
diff --git a/examples/mmlu_benchmark/README.md b/examples/mmlu_benchmark/README.md
@@ -16,11 +16,10 @@ Full evaluation results look like:
 ================================================================================
 Results Summary (Evaluated Tasks)
 ================================================================================
-Total tasks: 100
-Correct: 35
-Accuracy (on anchor points): 0.3500
-Accuracy norm (on anchor points): 0.3500
-Built predictions tensor with shape: (1, 100, 31)
+Total tasks: 14042
+Correct: 8291
+Accuracy (on anchor points): 0.5904
+Accuracy norm (on anchor points): 0.5904
 ```
 
 ## Run with DISCO (predicted full-benchmark score)
@@ -39,3 +38,22 @@ DISCO Predicted Full Benchmark Accuracy:
 ----------------------------------------
   Model 0: 0.606739
 ```
+
+## Arguments
+
+| Argument | Description | Default |
+|----------|-------------|---------|
+| `--model_id` | HuggingFace model identifier (e.g. `meta-llama/Llama-2-7b-hf`) | *(required)* |
+| `--data_path` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | `arubique/flattened-MMLU` |
+| `--anchor_points_path` | Path to anchor points pickle file; if set, only anchor tasks are evaluated | — |
+| `--output_dir` | Directory to save results | `./results` |
+| `--predictions_path` | Path to save predictions tensor as pickle (for DISCO) | — |
+| `--limit` | Limit number of tasks to evaluate (for testing) | — |
+| `--batch_size` | Batch size for evaluation (reserved for future use) | `1` |
+| `--device` | Device to run model on (e.g. `cuda:0`, `cpu`) | `cuda:0` |
+| `--num_workers` | Number of parallel workers for task execution | `1` |
+| `--disco_model_path` | If set, run DISCO prediction; path to `.pkl`, `.npz`, or Hugging Face repo id | — |
+| `--disco_transform_path` | Path to DISCO PCA transform `.pkl` or `.npz` (for local DISCO model when using `--pca`) | — |
+| `--pca` | PCA dimension for DISCO embeddings | — |
+| `--pad_to_size` | Pad predictions to this size with -inf | — |
+| `--use_lmeval_batching` | Use lm-evaluation-harness-style batching for exact numerical match with DISCO repo | off |