Skip to content

Commit 14bcb3f

Browse files
committed
[Move DISCO queue to core] Add ModelScorer, ModelAgentAdapter, and rename HuggingFaceModelAdapter
Introduce two new core abstractions and refactor the HuggingFace inference layer: - ModelScorer (maseval.core.scorer): ABC for log-likelihood scoring, parallel to ModelAdapter for generation. Methods: loglikelihood(), loglikelihood_batch(), loglikelihood_choices(). - ModelAgentAdapter (maseval.core.agent): generic adapter wrapping any ModelAdapter as an AgentAdapter, replacing benchmark-specific wrappers like MMLUModelAgent/MMLUAgentAdapter. - HuggingFaceModelAdapter renamed to HuggingFacePipelineModelAdapter (old name kept as backwards-compatible alias). - HuggingFaceModelScorer (maseval.interface.inference): concrete ModelScorer backed by AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Extracted from DefaultMMLUBenchmark. - DefaultMMLUBenchmark refactored to delegate scoring to HuggingFaceModelScorer and use ModelAgentAdapter.
1 parent b498ce7 commit 14bcb3f

11 files changed

Lines changed: 722 additions & 450 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Benchmarks**
1313

14-
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
14+
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
1515

1616
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
1717

@@ -42,16 +42,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4242
**Core**
4343

4444
- Added `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). Available via `from maseval import DISCOQueue`. (PR: #34)
45+
- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #PR_NUMBER_PLACEHOLDER)
46+
- Added `ModelAgentAdapter` in `maseval.core.agent` — a generic adapter that wraps any `ModelAdapter` as an `AgentAdapter` for direct model evaluation (replaces benchmark-specific agent wrappers). (PR: #PR_NUMBER_PLACEHOLDER)
4547
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
4648
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
4749
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
4850
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
4951
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
50-
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
52+
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
5153
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
5254

5355
**Interface**
5456

57+
- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #PR_NUMBER_PLACEHOLDER)
58+
- Renamed `HuggingFaceModelAdapter``HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. The old name remains as a backwards-compatible alias. (PR: #PR_NUMBER_PLACEHOLDER)
59+
5560
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
5661
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
5762
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
@@ -88,7 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8893

8994
**Benchmarks**
9095

91-
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). (PR: #34)
96+
- `MMLUBenchmark` no longer implements `setup_agents()` — consistent with other benchmarks, agent creation is left to concrete subclasses (e.g., `DefaultMMLUBenchmark`). Removed silent `.get()` fallbacks for required fields (`gold`, `query`, `model_id`) so missing data surfaces errors immediately instead of failing silently. `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue` instead of `AdaptiveTaskQueue`. Added `mmlu` optional extra (`pip install maseval[mmlu]`). `DefaultMMLUBenchmark` now delegates log-likelihood computation to `HuggingFaceModelScorer` and uses `ModelAgentAdapter` instead of the MMLU-specific `MMLUModelAgent`/`MMLUAgentAdapter` (removed). (PR: #34)
9297
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
9398
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
9499
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`

docs/benchmark/mmlu.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,13 @@ print(f"Evaluating {len(tasks)} anchor tasks")
9797
`MMLUBenchmark` is a framework-agnostic base class. To use a different model backend, subclass it and implement `setup_agents()` and `get_model_adapter()`:
9898

9999
```python
100-
from maseval.benchmark.mmlu import MMLUBenchmark, MMLUModelAgent, MMLUAgentAdapter
100+
from maseval import ModelAgentAdapter
101+
from maseval.benchmark.mmlu import MMLUBenchmark
101102

102103
class MyMMLUBenchmark(MMLUBenchmark):
103104
def setup_agents(self, agent_data, environment, task, user, seed_generator):
104105
model = self.get_model_adapter(agent_data["model_id"])
105-
agent = MMLUModelAgent(model, name="mmlu_agent")
106-
adapter = MMLUAgentAdapter(agent, "mmlu_agent")
106+
adapter = ModelAgentAdapter(model, name="mmlu_agent")
107107
return [adapter], {"mmlu_agent": adapter}
108108

109109
def get_model_adapter(self, model_id, **kwargs):
@@ -124,10 +124,6 @@ class MyMMLUBenchmark(MMLUBenchmark):
124124

125125
::: maseval.benchmark.mmlu.MMLUEvaluator
126126

127-
::: maseval.benchmark.mmlu.MMLUModelAgent
128-
129-
::: maseval.benchmark.mmlu.MMLUAgentAdapter
130-
131127
::: maseval.benchmark.mmlu.load_tasks
132128

133129
::: maseval.benchmark.mmlu.compute_benchmark_metrics

maseval/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
AdaptiveTaskQueue,
2323
)
2424
from .core.environment import Environment
25-
from .core.agent import AgentAdapter
25+
from .core.agent import AgentAdapter, ModelAgentAdapter
2626
from .core.benchmark import Benchmark, TaskExecutionStatus
2727
from .core.callback_handler import CallbackHandler
2828
from .core.callback import BenchmarkCallback, EnvironmentCallback, AgentCallback
@@ -35,6 +35,7 @@
3535
UserSimulatorError,
3636
)
3737
from .core.model import ModelAdapter, ChatResponse
38+
from .core.scorer import ModelScorer
3839
from .core.user import User, LLMUser, AgenticLLMUser, TerminationReason
3940
from .core.evaluator import Evaluator
4041
from .core.history import MessageHistory, ToolInvocationHistory
@@ -63,6 +64,7 @@
6364
# Core abstractions
6465
"Environment",
6566
"AgentAdapter",
67+
"ModelAgentAdapter",
6668
"Benchmark",
6769
"TaskExecutionStatus",
6870
# Callbacks
@@ -99,9 +101,10 @@
99101
"DISCOQueue",
100102
"PriorityTaskQueue",
101103
"AdaptiveTaskQueue",
102-
# Model adapters
104+
# Model adapters and scorers
103105
"ModelAdapter",
104106
"ChatResponse",
107+
"ModelScorer",
105108
# Exceptions and validation
106109
"MASEvalError",
107110
"AgentError",

maseval/benchmark/mmlu/__init__.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,6 @@
3636
DefaultMMLUBenchmark,
3737
MMLUEnvironment,
3838
MMLUEvaluator,
39-
MMLUModelAgent,
40-
MMLUAgentAdapter,
4139
load_tasks,
4240
compute_benchmark_metrics,
4341
)
@@ -56,8 +54,6 @@
5654
"DefaultMMLUBenchmark",
5755
"MMLUEnvironment",
5856
"MMLUEvaluator",
59-
"MMLUModelAgent",
60-
"MMLUAgentAdapter",
6157
"InformativeSubsetQueue",
6258
"DISCOQueue",
6359
"load_tasks",

0 commit comments

Comments
 (0)