@@ -16,11 +16,10 @@ Full evaluation results look like:
1616================================================================================
1717Results Summary (Evaluated Tasks)
1818================================================================================
19- Total tasks: 100
20- Correct: 35
21- Accuracy (on anchor points): 0.3500
22- Accuracy norm (on anchor points): 0.3500
23- Built predictions tensor with shape: (1, 100, 31)
19+ Total tasks: 14042
20+ Correct: 8291
21+ Accuracy (on anchor points): 0.5904
22+ Accuracy norm (on anchor points): 0.5904
2423```
2524
2625## Run with DISCO (predicted full-benchmark score)
@@ -39,3 +38,22 @@ DISCO Predicted Full Benchmark Accuracy:
3938----------------------------------------
4039 Model 0: 0.606739
4140```
41+
42+ ## Arguments
43+
44+ | Argument | Description | Default |
45+ | ----------| -------------| ---------|
46+ | ` --model_id ` | HuggingFace model identifier (e.g. ` meta-llama/Llama-2-7b-hf ` ) | * (required)* |
47+ | ` --data_path ` | Path to MMLU prompts JSON file or Hugging Face dataset repo id | ` arubique/flattened-MMLU ` |
48+ | ` --anchor_points_path ` | Path to anchor points pickle file; if set, only anchor tasks are evaluated | — |
49+ | ` --output_dir ` | Directory to save results | ` ./results ` |
50+ | ` --predictions_path ` | Path to save predictions tensor as pickle (for DISCO) | — |
51+ | ` --limit ` | Limit number of tasks to evaluate (for testing) | — |
52+ | ` --batch_size ` | Batch size for evaluation (reserved for future use) | ` 1 ` |
53+ | ` --device ` | Device to run model on (e.g. ` cuda:0 ` , ` cpu ` ) | ` cuda:0 ` |
54+ | ` --num_workers ` | Number of parallel workers for task execution | ` 1 ` |
55+ | ` --disco_model_path ` | If set, run DISCO prediction; path to ` .pkl ` , ` .npz ` , or Hugging Face repo id | — |
56+ | ` --disco_transform_path ` | Path to DISCO PCA transform ` .pkl ` or ` .npz ` (for local DISCO model when using ` --pca ` ) | — |
57+ | ` --pca ` | PCA dimension for DISCO embeddings | — |
58+ | ` --pad_to_size ` | Pad predictions to this size with -inf | — |
59+ | ` --use_lmeval_batching ` | Use lm-evaluation-harness-style batching for exact numerical match with DISCO repo | off |
0 commit comments