Skip to content

Commit 8d8f626

Browse files
authored
Adding MACS Benchmark (#13)
### Added **Exceptions and Error Classification** - Added `AgentError`, `EnvironmentError`, `UserError` exception hierarchy in `maseval.core.exceptions` for classifying execution failures by responsibility - Added `TaskExecutionStatus.AGENT_ERROR`, `ENVIRONMENT_ERROR`, `USER_ERROR`, `UNKNOWN_EXECUTION_ERROR` for fine-grained error classification enabling fair scoring - Added validation helpers: `validate_argument_type()`, `validate_required_arguments()`, `validate_no_extra_arguments()`, `validate_arguments_from_schema()` for tool implementers - Added `ToolSimulatorError` and `UserSimulatorError` exception subclasses for simulator-specific context while inheriting proper classification **Documentation** - Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks **Benchmarks** - MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark **Benchmark** - Added `execution_loop()` method to `Benchmark` base class enabling iterative agent-user interaction - Added `max_invocations` constructor parameter to `Benchmark` (default: 1 for backwards compatibility) - Added abstract `get_model_adapter(model_id, **kwargs)` method to `Benchmark` base class as universal model factory to be used throughout the benchmarks. **User** - Added `max_turns` and `stop_token` parameters to `User` base class for multi-turn support with early stopping. Same applied to `UserLLMSimulator`. - Added `is_done()`, `_check_stop_token()`, and `increment_turn()` methods to `User` base class - Added `get_initial_query()` method to `User` base class for LLM-generated initial messages - Added `initial_query` parameter in `User` base class to trigger the agentic system. **Environment** - Added `Environment.get_tool(name)` method for single-tool lookup ### Changed **Exception Handling** - Benchmark now classifies execution errors into `AGENT_ERROR` (agent's fault), `ENVIRONMENT_ERROR` (tool/infra failure), `USER_ERROR` (user simulator failure), or `UNKNOWN_EXECUTION_ERROR` (unclassified) instead of generic `TASK_EXECUTION_FAILED` (PR: #13) - `ToolLLMSimulator` now raises `ToolSimulatorError` (classified as `ENVIRONMENT_ERROR`) on failure - `UserLLMSimulator` now raises `UserSimulatorError` (classified as `USER_ERROR`) on failure **Environment** - `Environment.create_tools()` now returns `Dict[str, Any]` instead of `list` **Benchmark** - `Benchmark.run_agents()` signature changed: added `query: str` parameter - `Benchmark.run()` now uses `execution_loop()` internally to handle agent-user interaction cycles **Simulator** - The `LLMSimulator` now throws an exception when json cannot be decoded instead of returning the error message as text to the agent.
1 parent 214e40e commit 8d8f626

56 files changed

Lines changed: 10075 additions & 1047 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/test.yml

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,30 @@ jobs:
2929
run: |
3030
uv run pytest -m core -v
3131
32+
test-benchmark:
33+
name: Benchmark Tests
34+
runs-on: ubuntu-latest
35+
strategy:
36+
matrix:
37+
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
38+
39+
steps:
40+
- uses: actions/checkout@v3
41+
- name: Set up Python ${{ matrix.python-version }}
42+
uses: actions/setup-python@v4
43+
with:
44+
python-version: ${{ matrix.python-version }}
45+
- name: Install dependencies
46+
run: |
47+
pip install uv
48+
uv sync --group dev
49+
- name: Run benchmark tests
50+
run: |
51+
uv run pytest -m benchmark -v
52+
3253
test-all:
3354
name: All Tests (With Optional Deps)
34-
needs: test-core
55+
needs: [test-core, test-benchmark]
3556
runs-on: ubuntu-latest
3657
strategy:
3758
matrix:

AGENTS.md

Lines changed: 92 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ uv run pytest tests/
3030

3131
```bash
3232
# Format code
33-
ruff format .
33+
uv run ruff format .
3434

3535
# Lint and auto-fix issues
36-
ruff check . --fix
36+
uv run ruff check . --fix
3737
```
3838

3939
## Testing Instructions
@@ -45,14 +45,14 @@ ruff check . --fix
4545

4646
```bash
4747
# Run all tests
48-
pytest -v
48+
uv run pytest -v
4949

5050
# Core tests only (minimal dependencies)
51-
pytest -m core -v
51+
uv run pytest -m core -v
5252

5353
# Specific integration tests
54-
pytest -m smolagents -v
55-
pytest -m interface -v
54+
uv run pytest -m smolagents -v
55+
uv run pytest -m interface -v
5656
```
5757

5858
## Dependency Management
@@ -209,7 +209,7 @@ Example workflow:
209209
uv sync --all-extras --all-groups
210210

211211
# Before committing
212-
ruff format . && ruff check . --fix && pytest -v
212+
uv run ruff format . && uv run ruff check . --fix && uv run pytest -v
213213

214214
# Run example
215215
uv run python examples/amazon_collab.py
@@ -221,7 +221,7 @@ uv sync --all-extras --all-groups
221221
uv add --optional <extra-name> <package-name>
222222

223223
# Check specific test file
224-
pytest tests/test_core/test_agent.py -v
224+
uv run pytest tests/test_core/test_agent.py -v
225225
```
226226

227227
## Type Hinting
@@ -239,4 +239,87 @@ For lists and dictionaries, use `Dict[...,...]`, `List[...]`, `Sequence[...]` et
239239

240240
## Changelog
241241

242-
When the task is completed, add your changes to the Changelog.
242+
When you complete a task, document your changes in the Changelog. Multiple tasks contribute to a single PR, and PRs are compiled into release changelogs.
243+
244+
### User-Facing Documentation
245+
246+
Write changelog entries from the **user's perspective** - describe what the change means for someone using the library, not what you did internally. Focus on features, fixes, and improvements they'll notice or benefit from.
247+
248+
### Task-Level Documentation
249+
250+
Add an entry for your completed task under the `## Unreleased` section.
251+
252+
### Important Rules
253+
254+
- If you modified something already listed under "Added" in `Unreleased`, **update that existing entry** instead of adding a new one under "Changed"
255+
- Keep entries focused on user impact, not implementation details
256+
- Multiple task entries will be grouped together under the same PR
257+
- PR changelogs are then compiled into release notes between versions
258+
259+
### Format
260+
261+
Brief description of the user-facing change (PR: #PR_NUMBER_PLACEHOLDER)
262+
263+
### Example (User-Facing)
264+
265+
**Good:**
266+
267+
- Added support for custom retry strategies in API client with argument `retry` for `Client.__init__`. (PR: #13)
268+
- Fixed timeout errors when processing large datasets in `func` (PR: #4)
269+
270+
**Bad (not user-focused):**
271+
272+
- Refactored retry logic into separate module
273+
- Updated error handling in data_processor.py
274+
275+
## Docstrings
276+
277+
Write docstrings for **users**, not about your implementation process.
278+
279+
### Rules
280+
281+
- Describe what the code does and how to use it
282+
- Explain parameters, return values, and behavior
283+
- Never write narratives: "I did...", "First we...", "Then I..."
284+
- Never include quality claims: "rigorously tested", "well-optimized"
285+
- Omit implementation details users don't need
286+
287+
### Bad (narrative, claims, implementation details)
288+
289+
```
290+
def calculate_average(numbers: list) -> float:
291+
"""
292+
I implemented this to calculate averages. First I sum the numbers,
293+
then divide by count. Rigorously tested and optimized.
294+
"""
295+
```
296+
297+
### Good (clear, user-focused)
298+
299+
```
300+
def calculate_average(numbers: list) -> float:
301+
"""
302+
Calculate the arithmetic mean of numbers.
303+
304+
Args:
305+
numbers: List of numeric values
306+
307+
Returns:
308+
Average as float
309+
310+
Raises:
311+
ValueError: If list is empty
312+
"""
313+
```
314+
315+
## Early-Release Status
316+
317+
**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**
318+
319+
- Break APIs if it improves design
320+
- Refactor poor implementations
321+
- Remove technical debt as soon as you identify it
322+
- Don't preserve bad patterns for compatibility reasons
323+
- Focus on getting it right, not keeping it the same
324+
325+
We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.

BENCHMARKS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This document provides detailed information, sources, and licensing for all benc
44

55
---
66

7-
## 1. AWS Multi-Agent Collaboration Scenario
7+
## 1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)
88

99
This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.
1010

CHANGELOG.md

Lines changed: 73 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,27 +9,95 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
**Exceptions and Error Classification**
13+
14+
- Added `AgentError`, `EnvironmentError`, `UserError` exception hierarchy in `maseval.core.exceptions` for classifying execution failures by responsibility (PR: #13)
15+
- Added `TaskExecutionStatus.AGENT_ERROR`, `ENVIRONMENT_ERROR`, `USER_ERROR`, `UNKNOWN_EXECUTION_ERROR` for fine-grained error classification enabling fair scoring (PR: #13)
16+
- Added validation helpers: `validate_argument_type()`, `validate_required_arguments()`, `validate_no_extra_arguments()`, `validate_arguments_from_schema()` for tool implementers (PR: #13)
17+
- Added `ToolSimulatorError` and `UserSimulatorError` exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)
18+
19+
**Documentation**
20+
21+
- Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)
22+
23+
**Benchmarks**
24+
25+
- MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)
26+
27+
**Benchmark**
28+
29+
- Added `execution_loop()` method to `Benchmark` base class enabling iterative agent-user interaction (PR: #13)
30+
- Added `max_invocations` constructor parameter to `Benchmark` (default: 1 for backwards compatibility) (PR: #13)
31+
- Added abstract `get_model_adapter(model_id, **kwargs)` method to `Benchmark` base class as universal model factory to be used throughout the benchmarks. (PR: #13)
32+
33+
**User**
34+
35+
- Added `max_turns` and `stop_token` parameters to `User` base class for multi-turn support with early stopping. Same applied to `UserLLMSimulator`. (PR: #13)
36+
- Added `is_done()`, `_check_stop_token()`, and `increment_turn()` methods to `User` base class (PR: #13)
37+
- Added `get_initial_query()` method to `User` base class for LLM-generated initial messages (PR: #13)
38+
- Added `initial_query` parameter in `User` base class to trigger the agentic system. (PR: #13)
39+
40+
**Environment**
41+
42+
- Added `Environment.get_tool(name)` method for single-tool lookup (PR: #13)
43+
44+
**Interface**
45+
1246
- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
13-
- Supports async workflow execution with proper event loop handling
14-
- Added a new example: The `5_a_day_benchmark` (PR: #10)
1547
- The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)
1648

49+
**Examples**
50+
51+
- Added a new example: The `5_a_day_benchmark` (PR: #10)
52+
1753
### Changed
1854

19-
- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
20-
- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
55+
**Exception Handling**
56+
57+
- Benchmark now classifies execution errors into `AGENT_ERROR` (agent's fault), `ENVIRONMENT_ERROR` (tool/infra failure), `USER_ERROR` (user simulator failure), or `UNKNOWN_EXECUTION_ERROR` (unclassified) instead of generic `TASK_EXECUTION_FAILED` (PR: #13)
58+
- `ToolLLMSimulator` now raises `ToolSimulatorError` (classified as `ENVIRONMENT_ERROR`) on failure (PR: #13)
59+
- `UserLLMSimulator` now raises `UserSimulatorError` (classified as `USER_ERROR`) on failure (PR: #13)
60+
61+
**Environment**
62+
63+
- `Environment.create_tools()` now returns `Dict[str, Any]` instead of `list` (PR: #13)
64+
65+
**Benchmark**
66+
67+
- `Benchmark.run_agents()` signature changed: added `query: str` parameter (PR: #13)
68+
- `Benchmark.run()` now uses `execution_loop()` internally to handle agent-user interaction cycles (PR: #13)
2169
- `Benchmark` class now has a `fail_on_setup_error` flag that raises errors observed during setup of task (PR: #10)
70+
71+
**Callback**
72+
73+
- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
74+
75+
**Evaluator**
76+
2277
- The `Evaluator` class now has a `filter_traces` base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).
78+
79+
**Simulator**
80+
81+
- The `LLMSimulator` now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).
82+
83+
**Other**
84+
85+
- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
2386
- Improved Quick Start Guide in `docs/getting-started/quickstart.md`. (PR: #10)
2487
- `maseval.interface.agents` structure changed. Tools requiring framework imports (beyond just typing) now in `<framework>_optional.py` and imported dynamically from `<framework>.py`. (PR: #12)
2588
- Various formatting improvements in the documentation (PR: #12)
2689
- Added documentation for View Source Code pattern in `CONTRIBUTING.md` and `_optional.py` pattern in interface README (PR: #12)
2790

2891
### Fixed
2992

93+
**Interface**
94+
3095
- `LlamaIndexAgentAdapter` now supports multiple LlamaIndex agent types including `ReActAgent` (workflow-based), `FunctionAgent`, and legacy agents by checking for `.chat()`, `.query()`, and `.run()` methods in priority order (PR: #10)
96+
97+
**Other**
98+
3199
- Consistent naming of agent `adapter` over `wrapper` (PR: #3)
32-
- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shwon in documentation properly (#PR: 12)
100+
- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shown in documentation properly (#PR: 12)
33101

34102
### Removed
35103

docs/benchmark/index.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Benchmarks
2+
3+
MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.
4+
5+
## Adding Custom Benchmarks
6+
7+
You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.
8+
9+
## Licensing
10+
11+
For detailed source and licensing information for each benchmark's data, see [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md).

docs/benchmark/macs.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# MACS: Multi-Agent Collaboration Scenarios
2+
3+
The **Multi-Agent Collaboration Scenarios (MACS)** benchmark evaluates how well multi-agent systems collaborate to solve complex enterprise tasks across multiple domains.
4+
5+
## Overview
6+
7+
[Multi-Agent Collaboration Scenarios (MACS)](https://arxiv.org/abs/2412.05449) is designed to test collaborative problem-solving in realistic enterprise scenarios. The benchmark includes tasks spanning multiple domains such as travel planning, retail, and more. Each task involves multiple agents that must coordinate their actions to achieve user goals.
8+
9+
Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
10+
11+
## Quick Start
12+
13+
```python
14+
from maseval.benchmark.macs import (
15+
MACSBenchmark, MACSEnvironment, MACSEvaluator, MACSGenericTool,
16+
load_tasks, load_agent_config,
17+
)
18+
19+
# Load data
20+
tasks = load_tasks("travel", limit=5)
21+
agent_config = load_agent_config("travel")
22+
23+
# Create your framework-specific benchmark subclass
24+
class MyMACSBenchmark(MACSBenchmark):
25+
def setup_agents(self, agent_data, environment, task, user):
26+
# Your framework-specific agent creation
27+
...
28+
29+
# Run
30+
benchmark = MyMACSBenchmark(agent_data=agent_config, model=my_model)
31+
results = benchmark.run(tasks)
32+
```
33+
34+
::: maseval.benchmark.macs.MACSBenchmark
35+
36+
::: maseval.benchmark.macs.MACSUser
37+
38+
::: maseval.benchmark.macs.MACSEnvironment
39+
40+
::: maseval.benchmark.macs.MACSEvaluator
41+
42+
::: maseval.benchmark.macs.MACSGenericTool

docs/examples/index.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Examples
2+
3+
Learn MASEval through hands-on examples covering common use cases and benchmarks.
4+
5+
| Example | Description |
6+
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
7+
| [Tutorial](tutorial.ipynb) | Introduction to MASEval's core concepts and basic usage |
8+
| [Five-a-Day Benchmark](five_a_day_benchmark.ipynb) | Building a custom benchmark from scratch |
9+
| [Multi-Agent Collaboration Scenario Benchmark (MACS)](https://github.com/parameterlab/MASEval/blob/main/examples/macs_benchmark/macs_benchmark.py) | An adaptation of the `maseval.benchmark.MACSBenchmark`. |

0 commit comments

Comments
 (0)