Skip to content

Commit 9b2696a

Browse files
committed
fixed changelog and docs
1 parent 524fee5 commit 9b2696a

2 files changed

Lines changed: 25 additions & 43 deletions

File tree

CHANGELOG.md

Lines changed: 25 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -25,16 +25,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2525
- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
2626
- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
2727
- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
28-
2928
- `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
3029
- `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
30+
- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34 and #41)
31+
- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34 and #41)
32+
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
33+
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
34+
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
35+
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
36+
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
37+
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
3138

32-
**Benchmarks**
39+
**Interface**
40+
41+
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
42+
- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34 and #41)
43+
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
44+
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
45+
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
46+
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
3347

34-
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `DefaultMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
48+
**Benchmarks**
3549

50+
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. `MMLUBenchmark` is a framework-agnostic base class (`setup_agents()` and `get_model_adapter()` must be implemented by subclasses); `DefaultMMLUBenchmark` provides a ready-made HuggingFace implementation. Also includes `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34 and #41)
3651
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
37-
3852
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
3953
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
4054
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
@@ -44,7 +58,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4458
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
4559
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
4660
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
47-
4861
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
4962
- `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
5063
- `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
@@ -55,38 +68,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5568
**Examples**
5669

5770
- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
58-
59-
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
60-
- MMLU benchmark documentation at `docs/benchmark/mmlu.md` with installation, quick start, and API reference. (PR: #34)
71+
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
6172
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
6273
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
6374

6475
**Documentation**
6576

6677
- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)
6778

68-
**Core**
69-
70-
- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34)
71-
- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34)
72-
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
73-
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
74-
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
75-
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
76-
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
77-
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
78-
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
79-
80-
**Interface**
81-
82-
- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34)
83-
- Renamed `HuggingFaceModelAdapter``HuggingFacePipelineModelAdapter` to distinguish it from the new scorer. (PR: #34)
84-
85-
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
86-
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
87-
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
88-
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
89-
9079
**Testing**
9180

9281
- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
@@ -98,7 +87,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
9887
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
9988
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
10089
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)
101-
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34)
90+
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)
10291

10392
### Changed
10493

@@ -115,30 +104,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
115104
- `LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
116105
- `SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
117106
- `GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
107+
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
108+
- `SmolAgentUser``SmolAgentLLMUser`
109+
- `LangGraphUser``LangGraphLLMUser`
110+
- `LlamaIndexUser``LlamaIndexLLMUser`
118111

119112
**Benchmarks**
120113

121-
- `MMLUBenchmark` is now a framework-agnostic base class — `setup_agents()` and `get_model_adapter()` must be implemented by subclasses. Use `DefaultMMLUBenchmark` for HuggingFace models. Missing required fields now raise immediately instead of falling back to silent defaults. Install with `pip install maseval[mmlu]`. (PR: #34)
122-
- `DISCOQueue` moved from `maseval.benchmark.mmlu` to `maseval.core.task` and now extends `SequentialTaskQueue`. Removed `MMLUModelAgent`, `MMLUAgentAdapter`, and `AnchorPointsTaskQueue`. (PR: #34)
123114
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
124115
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
125116
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`
126117
- `Tau2Benchmark`: Seeds `simulators/user`, `agents/default_agent`
118+
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
127119

128120
**User**
129121

130122
- Refactored `User` class into abstract base class defining the interface (`get_initial_query()`, `respond()`, `is_done()`) with `LLMUser` as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
131123
- Renamed `AgenticUser``AgenticLLMUser` for consistency with the new hierarchy (PR: #22)
132124

133-
**Interface**
134-
135-
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
136-
- `SmolAgentUser``SmolAgentLLMUser`
137-
- `LangGraphUser``LangGraphLLMUser`
138-
- `LlamaIndexUser``LlamaIndexLLMUser`
139-
140-
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
141-
142125
**Testing**
143126

144127
- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)

mkdocs.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,6 @@ nav:
131131
- LiteLLM: interface/inference/litellm.md
132132
- OpenAI: interface/inference/openai.md
133133
- Benchmarks:
134-
- Overview: benchmark/index.md
135134
- ConVerse: benchmark/converse.md
136135
- GAIA2: benchmark/gaia2.md
137136
- MACS: benchmark/macs.md

0 commit comments

Comments
 (0)