maseval
diff --git a/‎BENCHMARKS.md‎
Lines changed: 15 additions & 1 deletion b/‎BENCHMARKS.md‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 4 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/benchmark/mmlu.md‎
Lines changed: 127 additions & 0 deletions b/‎docs/benchmark/mmlu.md‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎maseval/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎maseval/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎maseval/benchmark/mmlu/__init__.py‎
Lines changed: 7 additions & 10 deletions b/‎maseval/benchmark/mmlu/__init__.py‎
Lines changed: 7 additions & 10 deletions
@@ -79,7 +79,21 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses
 
 ---
 
-## 6. [Name of Next Benchmark]
+## 6. MMLU (Massive Multitask Language Understanding) (Beta)
+
+MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects.  The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.
+
+> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
+### Source and License
+
+- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021)
+- **DISCO Paper:** [DISCO: DISCOvering key features for accurate prediction of LLM abilities on benchmarks](https://arxiv.org/abs/2407.12890) (Rubinstein et al., 2025)
+- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU)
+
+---
+
+## 7. [Name of Next Benchmark]
 
 (Description for the next benchmark...)
 
 
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Benchmarks**
 
-- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
+- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
 
 - CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
 
@@ -35,11 +35,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 **Examples**
 
 - MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
+- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34)
 - Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
 **Core**
 
+- Added `AnchorPointsTaskQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import AnchorPointsTaskQueue`. (PR: #34)
 - Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
 - Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
 - Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
@@ -86,6 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Benchmarks**
 
+- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `AnchorPointsTaskQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
 - `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
   - `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
   - `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`
 
@@ -0,0 +1,127 @@
+# MMLU: Massive Multitask Language Understanding (Beta)
+
+!!! warning "Beta"
+    This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
+
+The **MMLU Benchmark** evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration supports anchor-point-based evaluation for [DISCO](https://arxiv.org/abs/2407.12890) prediction, enabling efficient estimation of full benchmark performance from a subset of tasks.
+
+## Overview
+
+[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features:
+
+- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology
+- **Anchor-point task selection** via `AnchorPointsTaskQueue` for DISCO-style subset evaluation
+- **HuggingFace integration** with batched log-probability computation
+- **lm-eval compatibility** mode for exact numerical reproduction
+
+Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
+
+## Installation
+
+MMLU has an optional dependency extra (currently empty, as core MMLU requires no additional packages):
+
+```bash
+pip install maseval[mmlu]
+```
+
+For the HuggingFace implementation, also install transformers:
+
+```bash
+pip install maseval[mmlu,transformers]
+```
+
+For DISCO prediction support:
+
+```bash
+pip install maseval[disco]
+```
+
+For exact lm-evaluation-harness reproduction:
+
+```bash
+pip install maseval[lm-eval]
+```
+
+## Quick Start
+
+```python
+from maseval.benchmark.mmlu import (
+    HuggingFaceMMLUBenchmark,
+    load_tasks,
+    compute_benchmark_metrics,
+)
+
+# Load tasks (downloads from HuggingFace automatically)
+tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json")
+
+# Create benchmark with HuggingFace model
+benchmark = HuggingFaceMMLUBenchmark(
+    model_id="meta-llama/Llama-2-7b-hf",
+    device="cuda:0",
+)
+
+# Run evaluation
+results = benchmark.run(
+    tasks=tasks,
+    agent_data={"model_id": "meta-llama/Llama-2-7b-hf"},
+)
+
+# Compute metrics
+metrics = compute_benchmark_metrics(results)
+print(f"Accuracy: {metrics['acc']:.4f}")
+```
+
+### With Anchor Points (DISCO)
+
+```python
+from maseval.benchmark.mmlu import load_tasks
+
+# Load tasks filtered to anchor points
+tasks = load_tasks(
+    data_path="/path/to/mmlu_prompts_examples.json",
+    anchor_points_path="/path/to/anchor_points.json",
+)
+
+# tasks is an AnchorPointsTaskQueue — only anchor tasks are evaluated
+print(f"Evaluating {len(tasks)} anchor tasks")
+```
+
+## Custom Benchmark Subclass
+
+`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`:
+
+```python
+from maseval.benchmark.mmlu import MMLUBenchmark, MMLUModelAgent, MMLUAgentAdapter
+
+class MyMMLUBenchmark(MMLUBenchmark):
+    def setup_agents(self, agent_data, environment, task, user, seed_generator):
+        model = self.get_model_adapter(agent_data["model_id"])
+        agent = MMLUModelAgent(model, name="mmlu_agent")
+        adapter = MMLUAgentAdapter(agent, "mmlu_agent")
+        return [adapter], {"mmlu_agent": adapter}
+
+    def get_model_adapter(self, model_id, **kwargs):
+        adapter = MyModelAdapter(model_id)
+        register_name = kwargs.get("register_name")
+        if register_name:
+            self.register("models", register_name, adapter)
+        return adapter
+```
+
+## API Reference
+
+::: maseval.benchmark.mmlu.MMLUBenchmark
+
+::: maseval.benchmark.mmlu.HuggingFaceMMLUBenchmark
+
+::: maseval.benchmark.mmlu.MMLUEnvironment
+
+::: maseval.benchmark.mmlu.MMLUEvaluator
+
+::: maseval.benchmark.mmlu.MMLUModelAgent
+
+::: maseval.benchmark.mmlu.MMLUAgentAdapter
+
+::: maseval.benchmark.mmlu.load_tasks
+
+::: maseval.benchmark.mmlu.compute_benchmark_metrics
@@ -16,6 +16,7 @@
     BaseTaskQueue,
     TaskQueue,
     SequentialTaskQueue,
+    AnchorPointsTaskQueue,
     PriorityTaskQueue,
     AdaptiveTaskQueue,
 )
@@ -93,6 +94,7 @@
     "BaseTaskQueue",
     "TaskQueue",
     "SequentialTaskQueue",
+    "AnchorPointsTaskQueue",
     "PriorityTaskQueue",
     "AdaptiveTaskQueue",
     # Model adapters
 
@@ -4,31 +4,30 @@
 
 Usage:
     from maseval.benchmark.mmlu import (
-        MMLUBenchmark,
-        MMLUEnvironment,
-        MMLUEvaluator,
+        HuggingFaceMMLUBenchmark,
         load_tasks,
-        AnchorPointsTaskQueue,
     )
+    from maseval import AnchorPointsTaskQueue
 
     # Load tasks and anchor points
     tasks = load_tasks(
         data_path="path/to/mmlu_prompts_examples.json",
         anchor_points_path="path/to/anchor_points.pkl",  # Optional
     )
 
-    # Create benchmark
-    benchmark = MMLUBenchmark()
-    results = benchmark.run(tasks=tasks, agent_data={"model_id": "gpt-4"})
+    # Run benchmark
+    benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf")
+    results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"})
 """
 
+from maseval import AnchorPointsTaskQueue
+
 from .mmlu import (
     DEFAULT_AGENT_NAME,
     DEFAULT_BATCH_SIZE,
     DEFAULT_CHOICES,
     DEFAULT_DEVICE,
     DEFAULT_MODEL_REGISTER_NAME,
-    FALLBACK_MODEL_ID,
     MMLU_TASK_NAME,
     STATUS_SUCCESS,
     TARGET_DELIMITER,
@@ -39,7 +38,6 @@
     MMLUEvaluator,
     MMLUModelAgent,
     MMLUAgentAdapter,
-    AnchorPointsTaskQueue,
     load_tasks,
     compute_benchmark_metrics,
 )
@@ -50,7 +48,6 @@
     "DEFAULT_CHOICES",
     "DEFAULT_DEVICE",
     "DEFAULT_MODEL_REGISTER_NAME",
-    "FALLBACK_MODEL_ID",
     "MMLU_TASK_NAME",
     "STATUS_SUCCESS",
     "TARGET_DELIMITER",