maseval · cemde · Dec 5, 2025 · Dec 3, 2025 · Dec 3, 2025 · Dec 3, 2025
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -29,9 +29,30 @@ jobs:
         run: |
           uv run pytest -m core -v
 
+  test-benchmark:
+    name: Benchmark Tests
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
+
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          pip install uv
+          uv sync --group dev
+      - name: Run benchmark tests
+        run: |
+          uv run pytest -m benchmark -v
+
   test-all:
     name: All Tests (With Optional Deps)
-    needs: test-core
+    needs: [test-core, test-benchmark]
     runs-on: ubuntu-latest
     strategy:
       matrix:

diff --git a/AGENTS.md b/AGENTS.md
@@ -30,10 +30,10 @@ uv run pytest tests/
 
 ```bash
 # Format code
-ruff format .
+uv run ruff format .
 
 # Lint and auto-fix issues
-ruff check . --fix
+uv run ruff check . --fix
 ```
 
 ## Testing Instructions
@@ -45,14 +45,14 @@ ruff check . --fix
 
 ```bash
 # Run all tests
-pytest -v
+uv run pytest -v
 
 # Core tests only (minimal dependencies)
-pytest -m core -v
+uv run pytest -m core -v
 
 # Specific integration tests
-pytest -m smolagents -v
-pytest -m interface -v
+uv run pytest -m smolagents -v
+uv run pytest -m interface -v
 ```
 
 ## Dependency Management
@@ -209,7 +209,7 @@ Example workflow:
 uv sync --all-extras --all-groups
 
 # Before committing
-ruff format . && ruff check . --fix && pytest -v
+uv run ruff format . && uv run ruff check . --fix && uv run pytest -v
 
 # Run example
 uv run python examples/amazon_collab.py
@@ -221,7 +221,7 @@ uv sync --all-extras --all-groups
 uv add --optional <extra-name> <package-name>
 
 # Check specific test file
-pytest tests/test_core/test_agent.py -v
+uv run pytest tests/test_core/test_agent.py -v
 ```
 
 ## Type Hinting
@@ -239,4 +239,87 @@ For lists and dictionaries, use `Dict[...,...]`, `List[...]`, `Sequence[...]` et
 
 ## Changelog
 
-When the task is completed, add your changes to the Changelog.
+When you complete a task, document your changes in the Changelog. Multiple tasks contribute to a single PR, and PRs are compiled into release changelogs.
+
+### User-Facing Documentation
+
+Write changelog entries from the **user's perspective** - describe what the change means for someone using the library, not what you did internally. Focus on features, fixes, and improvements they'll notice or benefit from.
+
+### Task-Level Documentation
+
+Add an entry for your completed task under the `## Unreleased` section.
+
+### Important Rules
+
+- If you modified something already listed under "Added" in `Unreleased`, **update that existing entry** instead of adding a new one under "Changed"
+- Keep entries focused on user impact, not implementation details
+- Multiple task entries will be grouped together under the same PR
+- PR changelogs are then compiled into release notes between versions
+
+### Format
+
+Brief description of the user-facing change (PR: #PR_NUMBER_PLACEHOLDER)
+
+### Example (User-Facing)
+
+**Good:**
+
+- Added support for custom retry strategies in API client with argument `retry` for `Client.__init__`. (PR: #13)
+- Fixed timeout errors when processing large datasets in `func` (PR: #4)
+
+**Bad (not user-focused):**
+
+- Refactored retry logic into separate module
+- Updated error handling in data_processor.py
+
+## Docstrings
+
+Write docstrings for **users**, not about your implementation process.
+
+### Rules
+
+- Describe what the code does and how to use it
+- Explain parameters, return values, and behavior
+- Never write narratives: "I did...", "First we...", "Then I..."
+- Never include quality claims: "rigorously tested", "well-optimized"
+- Omit implementation details users don't need
+
+### Bad (narrative, claims, implementation details)
+
+```
+def calculate_average(numbers: list) -> float:
+    """
+    I implemented this to calculate averages. First I sum the numbers,
+    then divide by count. Rigorously tested and optimized.
+    """
+```
+
+### Good (clear, user-focused)
+
+```
+def calculate_average(numbers: list) -> float:
+    """
+    Calculate the arithmetic mean of numbers.
+
+    Args:
+        numbers: List of numeric values
+
+    Returns:
+        Average as float
+
+    Raises:
+        ValueError: If list is empty
+    """
+```
+
+## Early-Release Status
+
+**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**
+
+- Break APIs if it improves design
+- Refactor poor implementations
+- Remove technical debt as soon as you identify it
+- Don't preserve bad patterns for compatibility reasons
+- Focus on getting it right, not keeping it the same
+
+We have zero obligation to maintain backwards compatibility. If you find code messy, propose a fix.
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -4,7 +4,7 @@ This document provides detailed information, sources, and licensing for all benc
 
 ---
 
-## 1. AWS Multi-Agent Collaboration Scenario
+## 1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)
 
 This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,27 +9,95 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+**Exceptions and Error Classification**
+
+- Added `AgentError`, `EnvironmentError`, `UserError` exception hierarchy in `maseval.core.exceptions` for classifying execution failures by responsibility (PR: #13)
+- Added `TaskExecutionStatus.AGENT_ERROR`, `ENVIRONMENT_ERROR`, `USER_ERROR`, `UNKNOWN_EXECUTION_ERROR` for fine-grained error classification enabling fair scoring (PR: #13)
+- Added validation helpers: `validate_argument_type()`, `validate_required_arguments()`, `validate_no_extra_arguments()`, `validate_arguments_from_schema()` for tool implementers (PR: #13)
+- Added `ToolSimulatorError` and `UserSimulatorError` exception subclasses for simulator-specific context while inheriting proper classification (PR: #13)
+
+**Documentation**
+
+- Added Exception Handling guide explaining error classification, fair scoring, and rerunning failed tasks (PR: #13)
+
+**Benchmarks**
+
+- MACS Benchmark: Multi-Agent Collaboration Scenarios benchmark (PR: #13)
+
+**Benchmark**
+
+- Added `execution_loop()` method to `Benchmark` base class enabling iterative agent-user interaction (PR: #13)
+- Added `max_invocations` constructor parameter to `Benchmark` (default: 1 for backwards compatibility) (PR: #13)
+- Added abstract `get_model_adapter(model_id, **kwargs)` method to `Benchmark` base class as universal model factory to be used throughout the benchmarks. (PR: #13)
+
+**User**
+
+- Added `max_turns` and `stop_token` parameters to `User` base class for multi-turn support with early stopping. Same applied to `UserLLMSimulator`. (PR: #13)
+- Added `is_done()`, `_check_stop_token()`, and `increment_turn()` methods to `User` base class (PR: #13)
+- Added `get_initial_query()` method to `User` base class for LLM-generated initial messages (PR: #13)
+- Added `initial_query` parameter in `User` base class to trigger the agentic system. (PR: #13)
+
+**Environment**
+
+- Added `Environment.get_tool(name)` method for single-tool lookup (PR: #13)
+
+**Interface**
+
 - [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
-  - Supports async workflow execution with proper event loop handling
-- Added a new example: The `5_a_day_benchmark` (PR: #10)
 - The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)
 
+**Examples**
+
+- Added a new example: The `5_a_day_benchmark` (PR: #10)
+
 ### Changed
 
-- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
-- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
+**Exception Handling**
+
+- Benchmark now classifies execution errors into `AGENT_ERROR` (agent's fault), `ENVIRONMENT_ERROR` (tool/infra failure), `USER_ERROR` (user simulator failure), or `UNKNOWN_EXECUTION_ERROR` (unclassified) instead of generic `TASK_EXECUTION_FAILED` (PR: #13)
+- `ToolLLMSimulator` now raises `ToolSimulatorError` (classified as `ENVIRONMENT_ERROR`) on failure (PR: #13)
+- `UserLLMSimulator` now raises `UserSimulatorError` (classified as `USER_ERROR`) on failure (PR: #13)
+
+**Environment**
+
+- `Environment.create_tools()` now returns `Dict[str, Any]` instead of `list` (PR: #13)
+
+**Benchmark**
+
+- `Benchmark.run_agents()` signature changed: added `query: str` parameter (PR: #13)
+- `Benchmark.run()` now uses `execution_loop()` internally to handle agent-user interaction cycles (PR: #13)
 - `Benchmark` class now has a `fail_on_setup_error` flag that raises errors observed during setup of task (PR: #10)
+
+**Callback**
+
+- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
+
+**Evaluator**
+
 - The `Evaluator` class now has a `filter_traces` base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).
+
+**Simulator**
+
+- The `LLMSimulator` now throws an exception when json cannot be decoded instead of returning the error message as text to the agent (PR: #13).
+
+**Other**
+
+- Documentation formatting improved. Added darkmode and links to `Github` (PR: #11).
 - Improved Quick Start Guide in `docs/getting-started/quickstart.md`. (PR: #10)
 - `maseval.interface.agents` structure changed. Tools requiring framework imports (beyond just typing) now in `<framework>_optional.py` and imported dynamically from `<framework>.py`. (PR: #12)
 - Various formatting improvements in the documentation (PR: #12)
 - Added documentation for View Source Code pattern in `CONTRIBUTING.md` and `_optional.py` pattern in interface README (PR: #12)
 
 ### Fixed
 
+**Interface**
+
 - `LlamaIndexAgentAdapter` now supports multiple LlamaIndex agent types including `ReActAgent` (workflow-based), `FunctionAgent`, and legacy agents by checking for `.chat()`, `.query()`, and `.run()` methods in priority order (PR: #10)
+
+**Other**
+
 - Consistent naming of agent `adapter` over `wrapper` (PR: #3)
-- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shwon in documentation properly (#PR: 12)
+- Fixed an issue that `LiteLLM` interface and `Mixin`s were not shown in documentation properly (#PR: 12)
 
 ### Removed
 

diff --git a/docs/benchmark/index.md b/docs/benchmark/index.md
@@ -0,0 +1,11 @@
+# Benchmarks
+
+MASEval includes pre-implemented benchmarks for evaluating multi-agent systems.
+
+## Adding Custom Benchmarks
+
+You can also create your own benchmarks by subclassing the [`Benchmark`](../reference/benchmark.md) class. See the [Five-a-Day example](../examples/five_a_day_benchmark.ipynb) for a complete walkthrough.
+
+## Licensing
+
+For detailed source and licensing information for each benchmark's data, see [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md).
diff --git a/docs/benchmark/macs.md b/docs/benchmark/macs.md
@@ -0,0 +1,42 @@
+# MACS: Multi-Agent Collaboration Scenarios
+
+The **Multi-Agent Collaboration Scenarios (MACS)** benchmark evaluates how well multi-agent systems collaborate to solve complex enterprise tasks across multiple domains.
+
+## Overview
+
+[Multi-Agent Collaboration Scenarios (MACS)](https://arxiv.org/abs/2412.05449) is designed to test collaborative problem-solving in realistic enterprise scenarios. The benchmark includes tasks spanning multiple domains such as travel planning, retail, and more. Each task involves multiple agents that must coordinate their actions to achieve user goals.
+
+Check out the [BENCHMARKS.md](https://github.com/parameterlab/MASEval/blob/main/BENCHMARKS.md) file for more information including licenses.
+
+## Quick Start
+
+```python
+from maseval.benchmark.macs import (
+    MACSBenchmark, MACSEnvironment, MACSEvaluator, MACSGenericTool,
+    load_tasks, load_agent_config,
+)
+
+# Load data
+tasks = load_tasks("travel", limit=5)
+agent_config = load_agent_config("travel")
+
+# Create your framework-specific benchmark subclass
+class MyMACSBenchmark(MACSBenchmark):
+    def setup_agents(self, agent_data, environment, task, user):
+        # Your framework-specific agent creation
+        ...
+
+# Run
+benchmark = MyMACSBenchmark(agent_data=agent_config, model=my_model)
+results = benchmark.run(tasks)
+```
+
+::: maseval.benchmark.macs.MACSBenchmark
+
+::: maseval.benchmark.macs.MACSUser
+
+::: maseval.benchmark.macs.MACSEnvironment
+
+::: maseval.benchmark.macs.MACSEvaluator
+
+::: maseval.benchmark.macs.MACSGenericTool
diff --git a/docs/examples/index.md b/docs/examples/index.md
@@ -0,0 +1,9 @@
+# Examples
+
+Learn MASEval through hands-on examples covering common use cases and benchmarks.
+
+| Example                                                                                                                                            | Description                                             |
+| -------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+| [Tutorial](tutorial.ipynb)                                                                                                                         | Introduction to MASEval's core concepts and basic usage |
+| [Five-a-Day Benchmark](five_a_day_benchmark.ipynb)                                                                                                 | Building a custom benchmark from scratch                |
+| [Multi-Agent Collaboration Scenario Benchmark (MACS)](https://github.com/parameterlab/MASEval/blob/main/examples/macs_benchmark/macs_benchmark.py) | An adaptation of the `maseval.benchmark.MACSBenchmark`. |
-Original file line number
+Diff line change
@@ Expand Up @@
     ---
-    ## 1. AWS Multi-Agent Collaboration Scenario
+    ## 1. Multi-Agent Collaboration Scenario Benchmark (MACS Benchmark)
     This benchmark is designed to test and evaluate the collaborative problem-solving capabilities of multi-agent systems. The implementation in this library provides the necessary code to set up and run these scenarios.
@@ Expand Down @@