Skip to content

Commit 6ad80a8

Browse files
committed
[Move DISCO queue to core]:
- Add InformativeSubsetQueue - Rename AnchorPointsTaskQueue to DISCOQueue - Make DISCOQueue a subclass of InformativeSubsetQueue
1 parent c0f81b9 commit 6ad80a8

6 files changed

Lines changed: 74 additions & 30 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4141

4242
**Core**
4343

44-
- Added `AnchorPointsTaskQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import AnchorPointsTaskQueue`. (PR: #34)
44+
- Added `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import DISCOQueue`. (PR: #34)
4545
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
4646
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
4747
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
@@ -88,7 +88,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8888

8989
**Benchmarks**
9090

91-
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `AnchorPointsTaskQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
91+
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `HuggingFaceMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
9292
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
9393
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
9494
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`

docs/benchmark/mmlu.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The **MMLU Benchmark** evaluates language models on multiple-choice questions sp
1010
[MMLU](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021) is a widely used benchmark for measuring knowledge and reasoning across diverse domains. The MASEval implementation features:
1111

1212
- **Log-likelihood MCQ evaluation** matching lm-evaluation-harness methodology
13-
- **Anchor-point task selection** via `AnchorPointsTaskQueue` for DISCO-style subset evaluation
13+
- **Anchor-point task selection** via `DISCOQueue` for DISCO-style subset evaluation
1414
- **HuggingFace integration** with batched log-probability computation
1515
- **lm-eval compatibility** mode for exact numerical reproduction
1616

@@ -88,7 +88,7 @@ tasks = load_tasks(
8888
anchor_points_path="/path/to/anchor_points.json",
8989
)
9090

91-
# tasks is an AnchorPointsTaskQueue — only anchor tasks are evaluated
91+
# tasks is an DISCOQueue — only anchor tasks are evaluated
9292
print(f"Evaluating {len(tasks)} anchor tasks")
9393
```
9494

maseval/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@
1616
BaseTaskQueue,
1717
TaskQueue,
1818
SequentialTaskQueue,
19-
AnchorPointsTaskQueue,
19+
InformativeSubsetQueue,
20+
DISCOQueue,
2021
PriorityTaskQueue,
2122
AdaptiveTaskQueue,
2223
)
@@ -94,7 +95,8 @@
9495
"BaseTaskQueue",
9596
"TaskQueue",
9697
"SequentialTaskQueue",
97-
"AnchorPointsTaskQueue",
98+
"InformativeSubsetQueue",
99+
"DISCOQueue",
98100
"PriorityTaskQueue",
99101
"AdaptiveTaskQueue",
100102
# Model adapters

maseval/benchmark/mmlu/__init__.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
HuggingFaceMMLUBenchmark,
88
load_tasks,
99
)
10-
from maseval import AnchorPointsTaskQueue
10+
from maseval import DISCOQueue, InformativeSubsetQueue
1111
1212
# Load tasks and anchor points
1313
tasks = load_tasks(
@@ -20,7 +20,7 @@
2020
results = benchmark.run(tasks=tasks, agent_data={"model_id": "meta-llama/Llama-2-7b-hf"})
2121
"""
2222

23-
from maseval import AnchorPointsTaskQueue
23+
from maseval import DISCOQueue
2424

2525
from .mmlu import (
2626
DEFAULT_AGENT_NAME,
@@ -58,7 +58,8 @@
5858
"MMLUEvaluator",
5959
"MMLUModelAgent",
6060
"MMLUAgentAdapter",
61-
"AnchorPointsTaskQueue",
61+
"InformativeSubsetQueue",
62+
"DISCOQueue",
6263
"load_tasks",
6364
"compute_benchmark_metrics",
6465
]

maseval/benchmark/mmlu/mmlu.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from maseval.benchmark.mmlu import (
1111
HuggingFaceMMLUBenchmark, load_tasks,
1212
)
13-
from maseval import AnchorPointsTaskQueue
13+
from maseval import DISCOQueue
1414
1515
# Load tasks (optionally filtered to anchor points)
1616
tasks = load_tasks(
@@ -39,7 +39,7 @@
3939

4040
from maseval import (
4141
AgentAdapter,
42-
AnchorPointsTaskQueue,
42+
DISCOQueue,
4343
Benchmark,
4444
Environment,
4545
Evaluator,
@@ -963,13 +963,13 @@ def load_tasks(
963963
data_path: Union[str, Path],
964964
anchor_points_path: Optional[Union[str, Path]] = None,
965965
limit: Optional[int] = None,
966-
) -> Union[AnchorPointsTaskQueue, SequentialTaskQueue]:
966+
) -> Union[DISCOQueue, SequentialTaskQueue]:
967967
"""Load MMLU tasks from JSON file.
968968
969969
Args:
970970
data_path: Path to MMLU prompts JSON file (mmlu_prompts_examples.json format).
971971
anchor_points_path: Optional path to anchor points pickle file.
972-
If provided, returns an AnchorPointsTaskQueue that evaluates
972+
If provided, returns an DISCOQueue that evaluates
973973
only the anchor tasks in order.
974974
limit: Optional limit on number of tasks to load.
975975
@@ -1024,7 +1024,7 @@ def load_tasks(
10241024

10251025
# Create appropriate queue
10261026
if anchor_points is not None:
1027-
return AnchorPointsTaskQueue(tasks, anchor_points)
1027+
return DISCOQueue(tasks, anchor_points)
10281028
else:
10291029
return SequentialTaskQueue(tasks)
10301030

maseval/core/task.py

Lines changed: 57 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -273,51 +273,92 @@ def __iter__(self) -> Iterator[Task]:
273273
return iter(self._tasks)
274274

275275

276-
class AnchorPointsTaskQueue(SequentialTaskQueue):
277-
"""Task queue that evaluates a specified subset of tasks in a given order.
276+
class InformativeSubsetQueue(SequentialTaskQueue):
277+
"""Evaluates an informative subset of tasks in a specified order.
278278
279-
Used for anchor-point-based evaluation where performance on a full dataset
280-
is predicted from results on a carefully selected subset. Anchor points are
281-
integer indices into the original task list. Only tasks at those indices are
282-
yielded, in the order specified by ``anchor_points``.
279+
Used for efficient evaluation where a carefully selected subset of tasks
280+
can predict performance on the full dataset. The subset is defined by
281+
``indices`` — integer positions into the original task list. Only tasks
282+
at those positions are yielded, in the order given by ``indices``.
283283
284-
When ``anchor_points`` is ``None``, all tasks are yielded in their original order
285-
(equivalent to ``SequentialTaskQueue``).
284+
The informativeness criterion (how the indices were chosen) is determined
285+
by the caller or by a subclass. This base class is criterion-agnostic.
286+
287+
When ``indices`` is ``None``, all tasks are yielded in their original
288+
order (equivalent to ``SequentialTaskQueue``).
286289
287290
Attributes:
288291
_all_tasks: The complete, unfiltered task list.
289-
_anchor_points: The anchor-point indices, or ``None``.
292+
_indices: The subset indices, or ``None``.
290293
291294
Example:
292295
```python
293296
# Evaluate only tasks at indices 0, 5, 12
294-
queue = AnchorPointsTaskQueue(tasks, anchor_points=[0, 5, 12])
297+
queue = InformativeSubsetQueue(tasks, indices=[0, 5, 12])
295298
296299
for task in queue:
297300
result = execute(task) # Only 3 tasks
298301
```
299302
"""
300303

301-
def __init__(self, tasks: Iterable[Task], anchor_points: Optional[List[int]] = None) -> None:
302-
"""Initialize anchor-points task queue.
304+
def __init__(self, tasks: Iterable[Task], indices: Optional[List[int]] = None) -> None:
305+
"""Initialize informative-subset task queue.
303306
304307
Args:
305308
tasks: Full list of tasks (ordered by index).
306-
anchor_points: Indices into ``tasks`` selecting which tasks to evaluate
309+
indices: Positions into ``tasks`` selecting which tasks to evaluate
307310
and in what order. If ``None``, evaluates all tasks in order.
308311
"""
309312
all_tasks = list(tasks)
310313
self._all_tasks: List[Task] = all_tasks
311-
self._anchor_points: Optional[List[int]] = anchor_points
314+
self._indices: Optional[List[int]] = indices
312315

313-
if anchor_points is not None:
316+
if indices is not None:
314317
task_by_index: Dict[int, Task] = {i: task for i, task in enumerate(all_tasks)}
315-
filtered = [task_by_index[idx] for idx in anchor_points if idx in task_by_index]
318+
filtered = [task_by_index[idx] for idx in indices if idx in task_by_index]
316319
super().__init__(filtered)
317320
else:
318321
super().__init__(all_tasks)
319322

320323

324+
class DISCOQueue(InformativeSubsetQueue):
325+
"""Diversity-based informative subset using DISCO anchor points.
326+
327+
Selects a diverse subset of tasks (anchor points) for evaluation. Full
328+
benchmark performance is then predicted from results on this subset using
329+
DISCO (DISCOvering key features for accurate prediction of LLM abilities
330+
on benchmarks).
331+
332+
The informativeness criterion is **diversity**: anchor points are chosen
333+
to maximise disagreement across models, so that a small evaluation set
334+
captures the discriminative structure of the full benchmark.
335+
336+
Reference: `DISCO: DISCOvering key features for accurate prediction of
337+
LLM abilities on benchmarks <https://arxiv.org/abs/2407.12890>`_
338+
339+
Example:
340+
```python
341+
queue = DISCOQueue(tasks, anchor_points=[0, 5, 12])
342+
343+
for task in queue:
344+
result = execute(task) # Only 3 tasks
345+
```
346+
"""
347+
348+
def __init__(self, tasks: Iterable[Task], anchor_points: Optional[List[int]] = None) -> None:
349+
"""Initialize DISCO task queue.
350+
351+
Args:
352+
tasks: Full list of tasks (ordered by index).
353+
anchor_points: Diversity-selected indices into ``tasks``.
354+
Typically loaded from a DISCO anchor-points file or
355+
downloaded from a HuggingFace DISCO model repo.
356+
If ``None``, evaluates all tasks in order.
357+
"""
358+
self._anchor_points: Optional[List[int]] = anchor_points
359+
super().__init__(tasks, indices=anchor_points)
360+
361+
321362
class PriorityTaskQueue(BaseTaskQueue):
322363
"""Execute tasks ordered by priority.
323364

0 commit comments

Comments
 (0)