You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
or NOT_TERMINATED if the interaction is still ongoing.
419
+
"""
420
+
421
+
# Good
422
+
"""
423
+
Returns:
424
+
Why `is_done()` returns True, or `NOT_TERMINATED` if still ongoing.
425
+
"""
426
+
```
427
+
428
+
**For dictionary returns, document fields in the docstring body** using "Output fields:":
429
+
430
+
```python
431
+
# Bad - creates multiple table rows in Returns
432
+
"""
433
+
Returns:
434
+
Dictionary containing:
435
+
- `name` - User identifier
436
+
- `profile` - User profile data
437
+
"""
438
+
439
+
# Good - fields in body, single-line Returns
440
+
"""
441
+
Gather execution traces from this user.
442
+
443
+
Output fields:
444
+
445
+
- `name` - User identifier
446
+
- `profile` - User profile data
447
+
- `message_count` - Number of messages in history
448
+
449
+
Returns:
450
+
Dictionary containing user state and interaction data.
451
+
"""
452
+
```
453
+
454
+
**HTML-like strings must be in backticks** (otherwise stripped as HTML):
455
+
456
+
```python
457
+
# Bad - </stop> disappears
458
+
"""Uses "</stop>" to signal satisfaction."""
459
+
460
+
# Good
461
+
"""Uses `"</stop>"` to signal satisfaction."""
462
+
```
463
+
464
+
**Use backticks for code references** - method names, parameters, and values: `` `is_done()` ``, `` `stop_tokens` ``, `` `None` ``
465
+
466
+
## Seeding for Reproducibility
467
+
468
+
MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade from a global seed through all components, ensuring deterministic behavior when model providers support seeding. Study code and documentation of `Benchmark, DefaultSeedGenerator` to gain an understanding.
469
+
390
470
## Early-Release Status
391
471
392
472
**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**
Copy file name to clipboardExpand all lines: CHANGELOG.md
+29-1Lines changed: 29 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,8 +9,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
9
9
10
10
### Added
11
11
12
+
**Seeding System**
13
+
14
+
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
15
+
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
16
+
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
17
+
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
18
+
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
19
+
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
20
+
21
+
**Interface**
22
+
23
+
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
24
+
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
25
+
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
26
+
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
27
+
12
28
### Changed
13
29
30
+
**User**
31
+
32
+
- Refactored `User` class into abstract base class defining the interface (`get_initial_query()`, `respond()`, `is_done()`) with `LLMUser` as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
33
+
- Renamed `AgenticUser` → `AgenticLLMUser` for consistency with the new hierarchy (PR: #22)
34
+
35
+
**Interface**
36
+
37
+
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
38
+
-`SmolAgentUser` → `SmolAgentLLMUser`
39
+
-`LangGraphUser` → `LangGraphLLMUser`
40
+
-`LlamaIndexUser` → `LlamaIndexLLMUser`
41
+
14
42
### Fixed
15
43
16
44
### Removed
@@ -126,7 +154,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
126
154
127
155
**Interface**
128
156
129
-
-[LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
157
+
-[LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexLLMUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
130
158
- The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -85,6 +85,18 @@ ruff check . --fix
85
85
86
86
If you haven't activated your virtual environment, you can use `uv run ruff format .` and `uv run ruff check . --fix` instead.
87
87
88
+
For convenience, you can enable **pre-commit hooks** to automatically format and lint code on every commit:
89
+
90
+
```bash
91
+
uv run pre-commit install
92
+
```
93
+
94
+
This is optional—CI will catch any issues regardless. But if enabled, the hooks will:
95
+
-**Format** code with `ruff format` (using project settings from `pyproject.toml`)
96
+
-**Lint and auto-fix** issues with `ruff check --fix`
97
+
98
+
> **Note**: The pre-commit hooks intentionally skip removing unused imports (`F401`) and unused variables (`F841`) to avoid disrupting work-in-progress code. Run `uv run ruff check . --fix` manually before opening a PR to clean these up.
99
+
88
100
### 3. Dependency Management
89
101
90
102
Dependencies are defined in `pyproject.toml` and locked in `uv.lock`. Understanding the different dependency types is important:
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -114,7 +114,7 @@ Examples are available in the [Documentation](https://maseval.readthedocs.io/en/
114
114
115
115
## Contribute
116
116
117
-
We welcome any contributions. Please read the [CONTRIBUTING.md](https://github.com/parameterlab/MASEval/tree/fix-porting-issue?tab=contributing-ov-file) file to learn more!
117
+
We welcome any contributions. Please read the [CONTRIBUTING.md](CONTRIBUTING.md) file to learn more!
Anyone! We had a few groups in mind when building MASEval.
6
+
7
+
1.**Benchmark Developers**: Researchers proposing new benchmarks for multi-agent systems can use MASEval to handle all the boilerplate.
8
+
2.**Benchmark Consumers**: Researchers studying multi-agent systems can use MASEval as a unified interface across different benchmarks.
9
+
3.**System Comparison**: Developers who want to test different agentic systems against each other can do so with MASEval.
10
+
11
+
## Q: I am looking for a specific feature, but I cannot find it.
12
+
13
+
1. Check this documentation.
14
+
2. If the feature does not exist, please [open an issue on GitHub](https://github.com/parameterlab/MASEval/issues/new). Feature requests are welcome.
15
+
3. Consider implementing it yourself. Check out the [contributing guide](contributing.md) for details.
16
+
17
+
## Q: Can I only test multi-agent systems?
18
+
19
+
No. MASEval works well for single-agent systems too. We designed the library to handle the complexity of multi-agent systems, but single-agent evaluation is fully supported. You can even run model comparisons, for example GPT against Claude.
0 commit comments