Skip to content

Commit b498ce7

Browse files
committed
[Move DISCO queue to core]:
- Rename HuggingFaceMMLUBenchmark to DefaultMMLUBenchmark for consistency with other benchmarks
1 parent 6ad80a8 commit b498ce7

5 files changed

Lines changed: 15 additions & 15 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Benchmarks**
1313

14-
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
14+
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
1515

1616
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
1717

@@ -88,7 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8888

8989
**Benchmarks**
9090

91-
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
91+
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
9292
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
9393
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
9494
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`

docs/benchmark/mmlu.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ pip install maseval[lm-eval]
5252

5353
```python
5454
from maseval.benchmark.mmlu import (
55-
HuggingFaceMMLUBenchmark,
55+
DefaultMMLUBenchmark,
5656
load_tasks,
5757
compute_benchmark_metrics,
5858
)
@@ -61,7 +61,7 @@ from maseval.benchmark.mmlu import (
6161
tasks = load_tasks(data_path="/path/to/mmlu_prompts_examples.json")
6262

6363
# Create benchmark with HuggingFace model
64-
benchmark = HuggingFaceMMLUBenchmark(
64+
benchmark = DefaultMMLUBenchmark(
6565
model_id="meta-llama/Llama-2-7b-hf",
6666
device="cuda:0",
6767
)
@@ -118,7 +118,7 @@ class MyMMLUBenchmark(MMLUBenchmark):
118118

119119
::: maseval.benchmark.mmlu.MMLUBenchmark
120120

121-
::: maseval.benchmark.mmlu.HuggingFaceMMLUBenchmark
121+
::: maseval.benchmark.mmlu.DefaultMMLUBenchmark
122122

123123
::: maseval.benchmark.mmlu.MMLUEnvironment
124124

examples/mmlu_benchmark/mmlu_benchmark.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@
5252
# MMLU benchmark imports
5353
from maseval.benchmark.mmlu import (
5454
DEFAULT_DEVICE,
55-
HuggingFaceMMLUBenchmark,
55+
DefaultMMLUBenchmark,
5656
load_tasks,
5757
compute_benchmark_metrics,
5858
)
@@ -691,7 +691,7 @@ def main():
691691
)
692692

693693
# Create benchmark
694-
benchmark = HuggingFaceMMLUBenchmark(
694+
benchmark = DefaultMMLUBenchmark(
695695
model_id=args.model_id,
696696
device=args.device,
697697
trust_remote_code=True,

maseval/benchmark/mmlu/__init__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
Usage:
66
from maseval.benchmark.mmlu import (
7-
HuggingFaceMMLUBenchmark,
7+
DefaultMMLUBenchmark,
88
load_tasks,
99
)
1010
from maseval import DISCOQueue, InformativeSubsetQueue
@@ -16,7 +16,7 @@
1616
)
1717
1818
# Run benchmark
19-
benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf")
19+
benchmark = DefaultMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf")
2020
results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"})
2121
"""
2222

@@ -33,7 +33,7 @@
3333
TARGET_DELIMITER,
3434
TASK_TYPE_MMLU,
3535
MMLUBenchmark,
36-
HuggingFaceMMLUBenchmark,
36+
DefaultMMLUBenchmark,
3737
MMLUEnvironment,
3838
MMLUEvaluator,
3939
MMLUModelAgent,
@@ -53,7 +53,7 @@
5353
"TARGET_DELIMITER",
5454
"TASK_TYPE_MMLU",
5555
"MMLUBenchmark",
56-
"HuggingFaceMMLUBenchmark",
56+
"DefaultMMLUBenchmark",
5757
"MMLUEnvironment",
5858
"MMLUEvaluator",
5959
"MMLUModelAgent",

maseval/benchmark/mmlu/mmlu.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
99
Usage:
1010
from maseval.benchmark.mmlu import (
11-
HuggingFaceMMLUBenchmark, load_tasks,
11+
DefaultMMLUBenchmark, load_tasks,
1212
)
1313
from maseval import DISCOQueue
1414
@@ -19,7 +19,7 @@
1919
)
2020
2121
# Run with the HuggingFace concrete implementation
22-
benchmark = HuggingFaceMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf")
22+
benchmark = DefaultMMLUBenchmark(model_id="meta-llama/Llama-2-7b-hf")
2323
results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"})
2424
"""
2525

@@ -342,7 +342,7 @@ class MMLUBenchmark(Benchmark):
342342
- ``setup_agents()`` - create agents for MCQ evaluation
343343
- ``get_model_adapter()`` - provide model adapters
344344
345-
For a ready-to-use implementation, see ``HuggingFaceMMLUBenchmark``.
345+
For a ready-to-use implementation, see ``DefaultMMLUBenchmark``.
346346
"""
347347

348348
def __init__(
@@ -431,7 +431,7 @@ def evaluate(
431431
return results
432432

433433

434-
class HuggingFaceMMLUBenchmark(MMLUBenchmark):
434+
class DefaultMMLUBenchmark(MMLUBenchmark):
435435
"""MMLU Benchmark using HuggingFace transformers models.
436436
437437
This concrete implementation uses log-likelihood based MCQ evaluation

0 commit comments

Comments
 (0)