Skip to content

Commit 8dff0f9

Browse files
committed
Merge remote-tracking branch 'origin/main' into implement-multi-agent-bench
2 parents ebc0961 + c02eed0 commit 8dff0f9

76 files changed

Lines changed: 5792 additions & 1035 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.pre-commit-config.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.14.0 # Keep in sync with pyproject.toml
4+
hooks:
5+
- id: ruff-check
6+
args: [--fix, --ignore, "F401,F841"]
7+
- id: ruff-format

AGENTS.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -387,6 +387,86 @@ def calculate_average(numbers: list) -> float:
387387
"""
388388
```
389389

390+
### mkdocs Rendering
391+
392+
This project uses mkdocstrings to render docstrings as HTML. Follow these rules to ensure proper rendering:
393+
394+
**Lists require a blank line before them:**
395+
396+
```python
397+
# Bad - renders as one paragraph
398+
"""Subclasses must provide:
399+
- method_one(): Description
400+
- method_two(): Description
401+
"""
402+
403+
# Good - renders as proper bullet list
404+
"""Subclasses must provide:
405+
406+
- `method_one()` - Description
407+
- `method_two()` - Description
408+
"""
409+
```
410+
411+
**Return descriptions must be single-line** (multi-line creates multiple table rows):
412+
413+
```python
414+
# Bad
415+
"""
416+
Returns:
417+
TerminationReason indicating why is_done() returns True,
418+
or NOT_TERMINATED if the interaction is still ongoing.
419+
"""
420+
421+
# Good
422+
"""
423+
Returns:
424+
Why `is_done()` returns True, or `NOT_TERMINATED` if still ongoing.
425+
"""
426+
```
427+
428+
**For dictionary returns, document fields in the docstring body** using "Output fields:":
429+
430+
```python
431+
# Bad - creates multiple table rows in Returns
432+
"""
433+
Returns:
434+
Dictionary containing:
435+
- `name` - User identifier
436+
- `profile` - User profile data
437+
"""
438+
439+
# Good - fields in body, single-line Returns
440+
"""
441+
Gather execution traces from this user.
442+
443+
Output fields:
444+
445+
- `name` - User identifier
446+
- `profile` - User profile data
447+
- `message_count` - Number of messages in history
448+
449+
Returns:
450+
Dictionary containing user state and interaction data.
451+
"""
452+
```
453+
454+
**HTML-like strings must be in backticks** (otherwise stripped as HTML):
455+
456+
```python
457+
# Bad - </stop> disappears
458+
"""Uses "</stop>" to signal satisfaction."""
459+
460+
# Good
461+
"""Uses `"</stop>"` to signal satisfaction."""
462+
```
463+
464+
**Use backticks for code references** - method names, parameters, and values: `` `is_done()` ``, `` `stop_tokens` ``, `` `None` ``
465+
466+
## Seeding for Reproducibility
467+
468+
MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade from a global seed through all components, ensuring deterministic behavior when model providers support seeding. Study code and documentation of `Benchmark, DefaultSeedGenerator` to gain an understanding.
469+
390470
## Early-Release Status
391471

392472
**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**

CHANGELOG.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
**Seeding System**
13+
14+
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
15+
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
16+
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
17+
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
18+
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
19+
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
20+
21+
**Interface**
22+
23+
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
24+
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
25+
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
26+
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
27+
1228
### Changed
1329

30+
**User**
31+
32+
- Refactored `User` class into abstract base class defining the interface (`get_initial_query()`, `respond()`, `is_done()`) with `LLMUser` as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
33+
- Renamed `AgenticUser``AgenticLLMUser` for consistency with the new hierarchy (PR: #22)
34+
35+
**Interface**
36+
37+
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
38+
- `SmolAgentUser``SmolAgentLLMUser`
39+
- `LangGraphUser``LangGraphLLMUser`
40+
- `LlamaIndexUser``LlamaIndexLLMUser`
41+
1442
### Fixed
1543

1644
### Removed
@@ -126,7 +154,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
126154

127155
**Interface**
128156

129-
- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
157+
- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexLLMUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
130158
- The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)
131159

132160
**Examples**

CONTRIBUTING.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,18 @@ ruff check . --fix
8585

8686
If you haven't activated your virtual environment, you can use `uv run ruff format .` and `uv run ruff check . --fix` instead.
8787

88+
For convenience, you can enable **pre-commit hooks** to automatically format and lint code on every commit:
89+
90+
```bash
91+
uv run pre-commit install
92+
```
93+
94+
This is optional—CI will catch any issues regardless. But if enabled, the hooks will:
95+
- **Format** code with `ruff format` (using project settings from `pyproject.toml`)
96+
- **Lint and auto-fix** issues with `ruff check --fix`
97+
98+
> **Note**: The pre-commit hooks intentionally skip removing unused imports (`F401`) and unused variables (`F841`) to avoid disrupting work-in-progress code. Run `uv run ruff check . --fix` manually before opening a PR to clean these up.
99+
88100
### 3. Dependency Management
89101

90102
Dependencies are defined in `pyproject.toml` and locked in `uv.lock`. Understanding the different dependency types is important:

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ Examples are available in the [Documentation](https://maseval.readthedocs.io/en/
114114

115115
## Contribute
116116

117-
We welcome any contributions. Please read the [CONTRIBUTING.md](https://github.com/parameterlab/MASEval/tree/fix-porting-issue?tab=contributing-ov-file) file to learn more!
117+
We welcome any contributions. Please read the [CONTRIBUTING.md](CONTRIBUTING.md) file to learn more!
118118

119119
## Benchmarks
120120

docs/getting-started/faq.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
# FAQ
22

3-
## Q: Test
3+
## Q: Who is this library for?
44

5-
## A: Test
5+
Anyone! We had a few groups in mind when building MASEval.
6+
7+
1. **Benchmark Developers**: Researchers proposing new benchmarks for multi-agent systems can use MASEval to handle all the boilerplate.
8+
2. **Benchmark Consumers**: Researchers studying multi-agent systems can use MASEval as a unified interface across different benchmarks.
9+
3. **System Comparison**: Developers who want to test different agentic systems against each other can do so with MASEval.
10+
11+
## Q: I am looking for a specific feature, but I cannot find it.
12+
13+
1. Check this documentation.
14+
2. If the feature does not exist, please [open an issue on GitHub](https://github.com/parameterlab/MASEval/issues/new). Feature requests are welcome.
15+
3. Consider implementing it yourself. Check out the [contributing guide](contributing.md) for details.
16+
17+
## Q: Can I only test multi-agent systems?
18+
19+
No. MASEval works well for single-agent systems too. We designed the library to handle the complexity of multi-agent systems, but single-agent evaluation is fully supported. You can even run model comparisons, for example GPT against Claude.

docs/getting-started/quickstart.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,41 @@ See the [Agent Adapters](../interface/agents/smolagents.md) documentation for th
166166

167167
### Existing Benchmarks
168168

169-
Pre-built benchmarks for established evaluation suites are coming soon.
169+
MASEval includes pre-built benchmarks for established evaluation suites. See the [Benchmarks](../benchmark/index.md) section for the full list.
170+
171+
**Using a default agent:** For quick evaluation or baseline comparisons, use the default benchmark class directly:
172+
173+
```python
174+
from maseval.benchmark.tau2 import (
175+
DefaultAgentTau2Benchmark, load_tasks, ensure_data_exists,
176+
)
177+
178+
ensure_data_exists(domain="retail")
179+
tasks = load_tasks("retail", split="base", limit=5)
180+
181+
benchmark = DefaultAgentTau2Benchmark(
182+
agent_data={"model_id": "gpt-4o"},
183+
n_task_repeats=4,
184+
)
185+
results = benchmark.run(tasks)
186+
```
187+
188+
**Plugging in your own agent:** Subclass the base benchmark to use your own agent implementation:
189+
190+
```python
191+
from maseval.benchmark.tau2 import Tau2Benchmark, load_tasks
192+
193+
class MyTau2Benchmark(Tau2Benchmark):
194+
def setup_agents(self, agent_data, environment, task, user):
195+
tools = environment.tools
196+
# Create your agent with these tools
197+
...
198+
199+
benchmark = MyTau2Benchmark(agent_data={}, n_task_repeats=4)
200+
results = benchmark.run(tasks)
201+
```
202+
203+
The base class handles environment setup, user simulation, and evaluation—you only implement `setup_agents()` and `run_agents()`.
170204

171205
---
172206

@@ -187,6 +221,7 @@ Topic-based discussions covering specific features and best practices:
187221

188222
- [Message Tracing](../guides/message-tracing.md) — Capture and analyze agent conversations
189223
- [Configuration Gathering](../guides/config-gathering.md) — Collect reproducible experiment configurations
224+
- [Seeding](../guides/seeding.md) — Enable reproducible benchmark runs
190225

191226
### Reference
192227

docs/guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
77
| [Message Tracing](message-tracing.md) | Capture and inspect agent conversations during benchmark runs |
88
| [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility |
99
| [Exception Handling](exception-handling.md) | Distinguish agent errors from infrastructure failures |
10+
| [Seeding](seeding.md) | Enable reproducible benchmark runs with deterministic seeds |

0 commit comments

Comments
 (0)