Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
4d533da
added dataloading infrastructure
cemde Dec 3, 2025
389119a
intiial implementation of benchmark.
cemde Dec 3, 2025
ae806fa
updated benchmark
cemde Dec 3, 2025
a9aef7c
updated benchmark implementation
cemde Dec 3, 2025
6aa222c
updated docs with new benchmark
cemde Dec 3, 2025
da47e96
refactored environment tool storage
cemde Dec 3, 2025
644fb5b
initial tests for macs
cemde Dec 3, 2025
1f4b41a
consolidated tests
cemde Dec 3, 2025
0408910
fixed evaluation
cemde Dec 4, 2025
5ef7014
added execution loop to Benchmark and updated user accordingly
cemde Dec 4, 2025
e878ac1
updated user benchmark interaction
cemde Dec 4, 2025
cb918cf
moved macs example
cemde Dec 4, 2025
f6e2d1e
added model factory abstract method to benchmark.
cemde Dec 4, 2025
c1f8b2f
fixed llama index documentation
cemde Dec 4, 2025
aceacde
updated benchmark model adapter factory
cemde Dec 4, 2025
055e9b8
added model factory pattern to macs benchmark
cemde Dec 4, 2025
b08bf55
fixed typing issues
cemde Dec 4, 2025
75925b2
fixed tests for GHA
cemde Dec 4, 2025
b192afb
fixed gitignore bug for macs
cemde Dec 4, 2025
7b1131f
[skip ci] fixed formatting
cemde Dec 4, 2025
1efd5c0
[skip ci] formatting fixes
cemde Dec 4, 2025
5663ca9
small clean up to user turn counting
cemde Dec 4, 2025
2ff59e8
[skip ci] user termination reason recorded
cemde Dec 5, 2025
d1b896a
added better early stopping support to base User class
cemde Dec 5, 2025
7ee0b62
fixed test typing error
cemde Dec 5, 2025
9ea0182
updated examples
cemde Dec 5, 2025
aa29258
updated agent instructions
cemde Dec 5, 2025
bf6c100
fixed examples
cemde Dec 5, 2025
1d32b31
fixed small issues in macs example
cemde Dec 5, 2025
3353a32
changed llm simulator to raise error
cemde Dec 5, 2025
fce3d1f
refined exception handling
cemde Dec 5, 2025
45fcad2
added better documentation of exceptions
cemde Dec 5, 2025
07229c9
fixed small bug for macs with langgraph
cemde Dec 5, 2025
7bb1faa
removed debugging file
cemde Dec 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,30 @@ jobs:
run: |
uv run pytest -m core -v

test-benchmark:
name: Benchmark Tests
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
pip install uv
uv sync --group dev
- name: Run benchmark tests
run: |
uv run pytest -m benchmark -v

test-all:
name: All Tests (With Optional Deps)
needs: test-core
needs: [test-core, test-benchmark]
runs-on: ubuntu-latest
strategy:
matrix:
Expand Down
101 changes: 92 additions & 9 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ uv run pytest tests/

```bash
# Format code
ruff format .
uv run ruff format .

# Lint and auto-fix issues
ruff check . --fix
uv run ruff check . --fix
```

## Testing Instructions
Expand All @@ -45,14 +45,14 @@ ruff check . --fix

```bash
# Run all tests
pytest -v
uv run pytest -v

# Core tests only (minimal dependencies)
pytest -m core -v
uv run pytest -m core -v

# Specific integration tests
pytest -m smolagents -v
pytest -m interface -v
uv run pytest -m smolagents -v
uv run pytest -m interface -v
```

## Dependency Management
Expand Down Expand Up @@ -209,7 +209,7 @@ Example workflow:
uv sync --all-extras --all-groups

# Before committing
ruff format . && ruff check . --fix && pytest -v
uv run ruff format . && uv run ruff check . --fix && uv run pytest -v

# Run example
uv run python examples/amazon_collab.py
Expand All @@ -221,7 +221,7 @@ uv sync --all-extras --all-groups
uv add --optional <extra-name> <package-name>

# Check specific test file
pytest tests/test_core/test_agent.py -v
uv run pytest tests/test_core/test_agent.py -v
```

## Type Hinting
Expand All @@ -239,4 +239,87 @@ For lists and dictionaries, use `Dict[...,...]`, `List[...]`, `Sequence[...]` et

## Changelog

When the task is completed, add your changes to the Changelog.
When you complete a task, document your changes in the Changelog. Multiple tasks contribute to a single PR, and PRs are compiled into release changelogs.

### User-Facing Documentation

Write changelog entries from the **user's perspective** - describe what the change means for someone using the library, not what you did internally. Focus on features, fixes, and improvements they'll notice or benefit from.

### Task-Level Documentation

Add an entry for your completed task under the `## Unreleased` section.

### Important Rules

- If you modified something already listed under "Added" in `Unreleased`, **update that existing entry** instead of adding a new one under "Changed"
- Keep entries focused on user impact, not implementation details
- Multiple task entries will be grouped together under the same PR
- PR changelogs are then compiled into release notes between versions

### Format

Brief description of the user-facing change (PR: #PR_NUMBER_PLACEHOLDER)

### Example (User-Facing)

**Good:**

- Added support for custom retry strategies in API client with argument `retry` for `Client.__init__`. (PR: #13)
- Fixed timeout errors when processing large datasets in `func` (PR: #4)

**Bad (not user-focused):**

- Refactored retry logic into separate module
- Updated error handling in data_processor.py

## Docstrings

Write docstrings for **users**, not about your implementation process.

### Rules

- Describe what the code does and how to use it
- Explain parameters, return values, and behavior
- Never write narratives: "I did...", "First we...", "Then I..."
- Never include quality claims: "rigorously tested", "well-optimized"
- Omit implementation details users don't need

### Bad (narrative, claims, implementation details)

```
def calculate_average(numbers: list) -> float:
"""
I implemented this to calculate averages. First I sum the numbers,
then divide by count. Rigorously tested and optimized.
"""
```

### Good (clear, user-focused)

```
def calculate_average(numbers: list) -> float:
"""
Calculate the arithmetic mean of numbers.

Args:
numbers: List of numeric values

Returns:
Average as float

Raises:
ValueError: If list is empty
"""
```

## Early-Release Status

**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**

- Break APIs if it improves design
- Refactor poor implementations
- Remove technical debt as soon as you identify it
- Don't preserve bad patterns for compatibility reasons
- Focus on getting it right, not keeping it the same

We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
2 changes: 1 addition & 1 deletion BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This document provides detailed information, sources, and licensing for all benc

---

## 1. AWS Multi-Agent Collaboration Scenario
## 1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)

This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.

Expand Down
78 changes: 73 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,95 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

**Exceptions and Error Classification**

- Added `AgentError`, `EnvironmentError`, `UserError` exception hierarchy in `maseval.core.exceptions` for classifying execution failures by responsibility (PR: #13)
- Added `TaskExecutionStatus.AGENT_ERROR`, `ENVIRONMENT_ERROR`, `USER_ERROR`, `UNKNOWN_EXECUTION_ERROR` for fine-grained error classification enabling fair scoring (PR: #13)
- Added validation helpers: `validate_argument_type()`, `validate_required_arguments()`, `validate_no_extra_arguments()`, `validate_arguments_from_schema()` for tool implementers (PR: #13)
- Added `ToolSimulatorError` and `UserSimulatorError` exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)

**Documentation**

- Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)

**Benchmarks**

- MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)

**Benchmark**

- Added `execution_loop()` method to `Benchmark` base class enabling iterative agent-user interaction (PR: #13)
- Added `max_invocations` constructor parameter to `Benchmark` (default: 1 for backwards compatibility) (PR: #13)
- Added abstract `get_model_adapter(model_id, **kwargs)` method to `Benchmark` base class as universal model factory to be used throughout the benchmarks. (PR: #13)

**User**

- Added `max_turns` and `stop_token` parameters to `User` base class for multi-turn support with early stopping. Same applied to `UserLLMSimulator`. (PR: #13)
- Added `is_done()`, `_check_stop_token()`, and `increment_turn()` methods to `User` base class (PR: #13)
- Added `get_initial_query()` method to `User` base class for LLM-generated initial messages (PR: #13)
- Added `initial_query` parameter in `User` base class to trigger the agentic system. (PR: #13)

**Environment**

- Added `Environment.get_tool(name)` method for single-tool lookup (PR: #13)

**Interface**

- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
- Supports async workflow execution with proper event loop handling
- Added a new example: The `5_a_day_benchmark` (PR: #10)
- The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)

**Examples**

- Added a new example: The `5_a_day_benchmark` (PR: #10)

### Changed

- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
**Exception Handling**

- Benchmark now classifies execution errors into `AGENT_ERROR` (agent's fault), `ENVIRONMENT_ERROR` (tool/infra failure), `USER_ERROR` (user simulator failure), or `UNKNOWN_EXECUTION_ERROR` (unclassified) instead of generic `TASK_EXECUTION_FAILED` (PR: #13)
- `ToolLLMSimulator` now raises `ToolSimulatorError` (classified as `ENVIRONMENT_ERROR`) on failure (PR: #13)
- `UserLLMSimulator` now raises `UserSimulatorError` (classified as `USER_ERROR`) on failure (PR: #13)

**Environment**

- `Environment.create_tools()` now returns `Dict[str, Any]` instead of `list` (PR: #13)

**Benchmark**

- `Benchmark.run_agents()` signature changed: added `query: str` parameter (PR: #13)
- `Benchmark.run()` now uses `execution_loop()` internally to handle agent-user interaction cycles (PR: #13)
- `Benchmark` class now has a `fail_on_setup_error` flag that raises errors observed during setup of task (PR: #10)

**Callback**

- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.

**Evaluator**

- The `Evaluator` class now has a `filter_traces` base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).

**Simulator**

- The `LLMSimulator` now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).

**Other**

- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
- Improved Quick Start Guide in `docs/getting-started/quickstart.md`. (PR: #10)
- `maseval.interface.agents` structure changed. Tools requiring framework imports (beyond just typing) now in `<framework>_optional.py` and imported dynamically from `<framework>.py`. (PR: #12)
- Various formatting improvements in the documentation (PR: #12)
- Added documentation for View Source Code pattern in `CONTRIBUTING.md` and `_optional.py` pattern in interface README (PR: #12)

### Fixed

**Interface**

- `LlamaIndexAgentAdapter` now supports multiple LlamaIndex agent types including `ReActAgent` (workflow-based), `FunctionAgent`, and legacy agents by checking for `.chat()`, `.query()`, and `.run()` methods in priority order (PR: #10)

**Other**

- Consistent naming of agent `adapter` over `wrapper` (PR: #3)
- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shwon in documentation properly (#PR: 12)
- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shown in documentation properly (#PR: 12)

### Removed

Expand Down
11 changes: 11 additions & 0 deletions docs/benchmark/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Benchmarks

MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.

## Adding Custom Benchmarks

You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.

## Licensing

For detailed source and licensing information for each benchmark's data, see [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md).
42 changes: 42 additions & 0 deletions docs/benchmark/macs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# MACS: Multi-Agent Collaboration Scenarios

The **Multi-Agent Collaboration Scenarios (MACS)** benchmark evaluates how well multi-agent systems collaborate to solve complex enterprise tasks across multiple domains.

## Overview

[Multi-Agent Collaboration Scenarios (MACS)](https://arxiv.org/abs/2412.05449) is designed to test collaborative problem-solving in realistic enterprise scenarios. The benchmark includes tasks spanning multiple domains such as travel planning, retail, and more. Each task involves multiple agents that must coordinate their actions to achieve user goals.

Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.

## Quick Start

```python
from maseval.benchmark.macs import (
MACSBenchmark, MACSEnvironment, MACSEvaluator, MACSGenericTool,
load_tasks, load_agent_config,
)

# Load data
tasks = load_tasks("travel", limit=5)
agent_config = load_agent_config("travel")

# Create your framework-specific benchmark subclass
class MyMACSBenchmark(MACSBenchmark):
def setup_agents(self, agent_data, environment, task, user):
# Your framework-specific agent creation
...

# Run
benchmark = MyMACSBenchmark(agent_data=agent_config, model=my_model)
results = benchmark.run(tasks)
```

::: maseval.benchmark.macs.MACSBenchmark

::: maseval.benchmark.macs.MACSUser

::: maseval.benchmark.macs.MACSEnvironment

::: maseval.benchmark.macs.MACSEvaluator

::: maseval.benchmark.macs.MACSGenericTool
9 changes: 9 additions & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Examples

Learn MASEval through hands-on examples covering common use cases and benchmarks.

| Example | Description |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| [Tutorial](tutorial.ipynb) | Introduction to MASEval's core concepts and basic usage |
| [Five-a-Day Benchmark](five_a_day_benchmark.ipynb) | Building a custom benchmark from scratch |
| [Multi-Agent Collaboration Scenario Benchmark (MACS)](https://github.com/parameterlab/MASEval/blob/main/examples/macs_benchmark/macs_benchmark.py) | An adaptation of the `maseval.benchmark.MACSBenchmark`. |
Loading
Loading