You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Added
**Exceptions and Error Classification**
- Added `AgentError`, `EnvironmentError`, `UserError` exception hierarchy in `maseval.core.exceptions` for classifying execution failures by responsibility
- Added `TaskExecutionStatus.AGENT_ERROR`, `ENVIRONMENT_ERROR`, `USER_ERROR`, `UNKNOWN_EXECUTION_ERROR` for fine-grained error classification enabling fair scoring
- Added validation helpers: `validate_argument_type()`, `validate_required_arguments()`, `validate_no_extra_arguments()`, `validate_arguments_from_schema()` for tool implementers
- Added `ToolSimulatorError` and `UserSimulatorError` exception subclasses for simulator-specific context while inheriting proper classification
**Documentation**
- Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks
**Benchmarks**
- MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark
**Benchmark**
- Added `execution_loop()` method to `Benchmark` base class enabling iterative agent-user interaction
- Added `max_invocations` constructor parameter to `Benchmark` (default: 1 for backwards compatibility)
- Added abstract `get_model_adapter(model_id, **kwargs)` method to `Benchmark` base class as universal model factory to be used throughout the benchmarks.
**User**
- Added `max_turns` and `stop_token` parameters to `User` base class for multi-turn support with early stopping. Same applied to `UserLLMSimulator`.
- Added `is_done()`, `_check_stop_token()`, and `increment_turn()` methods to `User` base class
- Added `get_initial_query()` method to `User` base class for LLM-generated initial messages
- Added `initial_query` parameter in `User` base class to trigger the agentic system.
**Environment**
- Added `Environment.get_tool(name)` method for single-tool lookup
### Changed
**Exception Handling**
- Benchmark now classifies execution errors into `AGENT_ERROR` (agent's fault), `ENVIRONMENT_ERROR` (tool/infra failure), `USER_ERROR` (user simulator failure), or `UNKNOWN_EXECUTION_ERROR` (unclassified) instead of generic `TASK_EXECUTION_FAILED` (PR: #13)
- `ToolLLMSimulator` now raises `ToolSimulatorError` (classified as `ENVIRONMENT_ERROR`) on failure
- `UserLLMSimulator` now raises `UserSimulatorError` (classified as `USER_ERROR`) on failure
**Environment**
- `Environment.create_tools()` now returns `Dict[str, Any]` instead of `list`
**Benchmark**
- `Benchmark.run_agents()` signature changed: added `query: str` parameter
- `Benchmark.run()` now uses `execution_loop()` internally to handle agent-user interaction cycles
**Simulator**
- The `LLMSimulator` now throws an exception when json cannot be decoded instead of returning the error message as text to the agent.
@@ -239,4 +239,87 @@ For lists and dictionaries, use `Dict[...,...]`, `List[...]`, `Sequence[...]` et
239
239
240
240
## Changelog
241
241
242
-
When the task is completed, add your changes to the Changelog.
242
+
When you complete a task, document your changes in the Changelog. Multiple tasks contribute to a single PR, and PRs are compiled into release changelogs.
243
+
244
+
### User-Facing Documentation
245
+
246
+
Write changelog entries from the **user's perspective** - describe what the change means for someone using the library, not what you did internally. Focus on features, fixes, and improvements they'll notice or benefit from.
247
+
248
+
### Task-Level Documentation
249
+
250
+
Add an entry for your completed task under the `## Unreleased` section.
251
+
252
+
### Important Rules
253
+
254
+
- If you modified something already listed under "Added" in `Unreleased`, **update that existing entry** instead of adding a new one under "Changed"
255
+
- Keep entries focused on user impact, not implementation details
256
+
- Multiple task entries will be grouped together under the same PR
257
+
- PR changelogs are then compiled into release notes between versions
258
+
259
+
### Format
260
+
261
+
Brief description of the user-facing change (PR: #PR_NUMBER_PLACEHOLDER)
262
+
263
+
### Example (User-Facing)
264
+
265
+
**Good:**
266
+
267
+
- Added support for custom retry strategies in API client with argument `retry` for `Client.__init__`. (PR: #13)
268
+
- Fixed timeout errors when processing large datasets in `func` (PR: #4)
269
+
270
+
**Bad (not user-focused):**
271
+
272
+
- Refactored retry logic into separate module
273
+
- Updated error handling in data_processor.py
274
+
275
+
## Docstrings
276
+
277
+
Write docstrings for **users**, not about your implementation process.
278
+
279
+
### Rules
280
+
281
+
- Describe what the code does and how to use it
282
+
- Explain parameters, return values, and behavior
283
+
- Never write narratives: "I did...", "First we...", "Then I..."
284
+
- Never include quality claims: "rigorously tested", "well-optimized"
285
+
- Omit implementation details users don't need
286
+
287
+
### Bad (narrative, claims, implementation details)
288
+
289
+
```
290
+
def calculate_average(numbers: list) -> float:
291
+
"""
292
+
I implemented this to calculate averages. First I sum the numbers,
293
+
then divide by count. Rigorously tested and optimized.
294
+
"""
295
+
```
296
+
297
+
### Good (clear, user-focused)
298
+
299
+
```
300
+
def calculate_average(numbers: list) -> float:
301
+
"""
302
+
Calculate the arithmetic mean of numbers.
303
+
304
+
Args:
305
+
numbers: List of numeric values
306
+
307
+
Returns:
308
+
Average as float
309
+
310
+
Raises:
311
+
ValueError: If list is empty
312
+
"""
313
+
```
314
+
315
+
## Early-Release Status
316
+
317
+
**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**
318
+
319
+
- Break APIs if it improves design
320
+
- Refactor poor implementations
321
+
- Remove technical debt as soon as you identify it
322
+
- Don't preserve bad patterns for compatibility reasons
323
+
- Focus on getting it right, not keeping it the same
324
+
325
+
We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.
- Added `execution_loop()` method to `Benchmark` base class enabling iterative agent-user interaction (PR: #13)
30
+
- Added `max_invocations` constructor parameter to `Benchmark` (default: 1 for backwards compatibility) (PR: #13)
31
+
- Added abstract `get_model_adapter(model_id, **kwargs)` method to `Benchmark` base class as universal model factory to be used throughout the benchmarks. (PR: #13)
32
+
33
+
**User**
34
+
35
+
- Added `max_turns` and `stop_token` parameters to `User` base class for multi-turn support with early stopping. Same applied to `UserLLMSimulator`. (PR: #13)
36
+
- Added `is_done()`, `_check_stop_token()`, and `increment_turn()` methods to `User` base class (PR: #13)
37
+
- Added `get_initial_query()` method to `User` base class for LLM-generated initial messages (PR: #13)
38
+
- Added `initial_query` parameter in `User` base class to trigger the agentic system. (PR: #13)
39
+
40
+
**Environment**
41
+
42
+
- Added `Environment.get_tool(name)` method for single-tool lookup (PR: #13)
43
+
44
+
**Interface**
45
+
12
46
-[LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
13
-
- Supports async workflow execution with proper event loop handling
14
-
- Added a new example: The `5_a_day_benchmark` (PR: #10)
15
47
- The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)
16
48
49
+
**Examples**
50
+
51
+
- Added a new example: The `5_a_day_benchmark` (PR: #10)
52
+
17
53
### Changed
18
54
19
-
- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
20
-
-`FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
55
+
**Exception Handling**
56
+
57
+
- Benchmark now classifies execution errors into `AGENT_ERROR` (agent's fault), `ENVIRONMENT_ERROR` (tool/infra failure), `USER_ERROR` (user simulator failure), or `UNKNOWN_EXECUTION_ERROR` (unclassified) instead of generic `TASK_EXECUTION_FAILED` (PR: #13)
58
+
-`ToolLLMSimulator` now raises `ToolSimulatorError` (classified as `ENVIRONMENT_ERROR`) on failure (PR: #13)
59
+
-`UserLLMSimulator` now raises `UserSimulatorError` (classified as `USER_ERROR`) on failure (PR: #13)
60
+
61
+
**Environment**
62
+
63
+
-`Environment.create_tools()` now returns `Dict[str, Any]` instead of `list` (PR: #13)
-`Benchmark.run()` now uses `execution_loop()` internally to handle agent-user interaction cycles (PR: #13)
21
69
-`Benchmark` class now has a `fail_on_setup_error` flag that raises errors observed during setup of task (PR: #10)
70
+
71
+
**Callback**
72
+
73
+
-`FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
74
+
75
+
**Evaluator**
76
+
22
77
- The `Evaluator` class now has a `filter_traces` base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).
78
+
79
+
**Simulator**
80
+
81
+
- The `LLMSimulator` now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).
82
+
83
+
**Other**
84
+
85
+
- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
23
86
- Improved Quick Start Guide in `docs/getting-started/quickstart.md`. (PR: #10)
24
87
-`maseval.interface.agents` structure changed. Tools requiring framework imports (beyond just typing) now in `<framework>_optional.py` and imported dynamically from `<framework>.py`. (PR: #12)
25
88
- Various formatting improvements in the documentation (PR: #12)
26
89
- Added documentation for View Source Code pattern in `CONTRIBUTING.md` and `_optional.py` pattern in interface README (PR: #12)
27
90
28
91
### Fixed
29
92
93
+
**Interface**
94
+
30
95
-`LlamaIndexAgentAdapter` now supports multiple LlamaIndex agent types including `ReActAgent` (workflow-based), `FunctionAgent`, and legacy agents by checking for `.chat()`, `.query()`, and `.run()` methods in priority order (PR: #10)
96
+
97
+
**Other**
98
+
31
99
- Consistent naming of agent `adapter` over `wrapper` (PR: #3)
32
-
- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shwon in documentation properly (#PR: 12)
100
+
- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shown in documentation properly (#PR: 12)
MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.
4
+
5
+
## Adding Custom Benchmarks
6
+
7
+
You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.
8
+
9
+
## Licensing
10
+
11
+
For detailed source and licensing information for each benchmark's data, see [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md).
The **Multi-Agent Collaboration Scenarios (MACS)** benchmark evaluates how well multi-agent systems collaborate to solve complex enterprise tasks across multiple domains.
4
+
5
+
## Overview
6
+
7
+
[Multi-Agent Collaboration Scenarios (MACS)](https://arxiv.org/abs/2412.05449) is designed to test collaborative problem-solving in realistic enterprise scenarios. The benchmark includes tasks spanning multiple domains such as travel planning, retail, and more. Each task involves multiple agents that must coordinate their actions to achieve user goals.
8
+
9
+
Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
|[Tutorial](tutorial.ipynb)| Introduction to MASEval's core concepts and basic usage |
8
+
|[Five-a-Day Benchmark](five_a_day_benchmark.ipynb)| Building a custom benchmark from scratch |
9
+
|[Multi-Agent Collaboration Scenario Benchmark (MACS)](https://github.com/parameterlab/MASEval/blob/main/examples/macs_benchmark/macs_benchmark.py)| An adaptation of the `maseval.benchmark.MACSBenchmark`. |
0 commit comments