Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,82 @@ def calculate_average(numbers: list) -> float:
"""
```

### mkdocs Rendering

This project uses mkdocstrings to render docstrings as HTML. Follow these rules to ensure proper rendering:

**Lists require a blank line before them:**

```python
# Bad - renders as one paragraph
"""Subclasses must provide:
- method_one(): Description
- method_two(): Description
"""

# Good - renders as proper bullet list
"""Subclasses must provide:

- `method_one()` - Description
- `method_two()` - Description
"""
```

**Return descriptions must be single-line** (multi-line creates multiple table rows):

```python
# Bad
"""
Returns:
TerminationReason indicating why is_done() returns True,
or NOT_TERMINATED if the interaction is still ongoing.
"""

# Good
"""
Returns:
Why `is_done()` returns True, or `NOT_TERMINATED` if still ongoing.
"""
```

**For dictionary returns, document fields in the docstring body** using "Output fields:":

```python
# Bad - creates multiple table rows in Returns
"""
Returns:
Dictionary containing:
- `name` - User identifier
- `profile` - User profile data
"""

# Good - fields in body, single-line Returns
"""
Gather execution traces from this user.

Output fields:

- `name` - User identifier
- `profile` - User profile data
- `message_count` - Number of messages in history

Returns:
Dictionary containing user state and interaction data.
"""
```

**HTML-like strings must be in backticks** (otherwise stripped as HTML):

```python
# Bad - </stop> disappears
"""Uses "</stop>" to signal satisfaction."""

# Good
"""Uses `"</stop>"` to signal satisfaction."""
```

**Use backticks for code references** - method names, parameters, and values: `` `is_done()` ``, `` `stop_tokens` ``, `` `None` ``

## Early-Release Status

**This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**
Expand Down
16 changes: 15 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

**Interface**

- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

### Changed

**Interface**

- Renamed framework-specific `LLMUser` subclasses for clarity (PR: #22):
- `SmolAgentUser` → `SmolAgentLLMUser`
- `LangGraphUser` → `LangGraphLLMUser`
- `LlamaIndexUser` → `LlamaIndexLLMUser`

### Fixed

### Removed
Expand Down Expand Up @@ -126,7 +140,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Interface**

- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexLLMUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
- The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)

**Examples**
Expand Down
18 changes: 16 additions & 2 deletions docs/getting-started/faq.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# FAQ

## Q: Test
## Q: Who is this library for?

## A: Test
Anyone! We had a few groups in mind when building MASEval.

1. **Benchmark Developers**: Researchers proposing new benchmarks for multi-agent systems can use MASEval to handle all the boilerplate.
2. **Benchmark Consumers**: Researchers studying multi-agent systems can use MASEval as a unified interface across different benchmarks.
3. **System Comparison**: Developers who want to test different agentic systems against each other can do so with MASEval.

## Q: I am looking for a specific feature, but I cannot find it.

1. Check this documentation.
2. If the feature does not exist, please [open an issue on GitHub](https://github.com/parameterlab/MASEval/issues/new). Feature requests are welcome.
3. Consider implementing it yourself. Check out the [contributing guide](contributing.md) for details.

## Q: Can I only test multi-agent systems?

No. MASEval works well for single-agent systems too. We designed the library to handle the complexity of multi-agent systems, but single-agent evaluation is fully supported. You can even run model comparisons, for example GPT against Claude.
34 changes: 34 additions & 0 deletions docs/interface/agents/camel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# CAMEL-AI

Adapter for the CAMEL-AI multi-agent framework.

- [Documentation](https://docs.camel-ai.org/)
- [Code Repository](https://github.com/camel-ai/camel)

## Installation

```bash
pip install maseval[camel]
```

Alternatively, install camel-ai directly:

```bash
pip install camel-ai
```

## API Reference

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/camel.py){ .md-source-file }

::: maseval.interface.agents.camel.CamelAgentAdapter

::: maseval.interface.agents.camel.CamelLLMUser

::: maseval.interface.agents.camel.CamelAgentUser

::: maseval.interface.agents.camel.camel_role_playing_execution_loop

::: maseval.interface.agents.camel.CamelRolePlayingTracer

::: maseval.interface.agents.camel.CamelWorkforceTracer
2 changes: 1 addition & 1 deletion docs/interface/agents/langgraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ pip install langgraph

::: maseval.interface.agents.langgraph.LangGraphAgentAdapter

::: maseval.interface.agents.langgraph.LangGraphUser
::: maseval.interface.agents.langgraph.LangGraphLLMUser
2 changes: 1 addition & 1 deletion docs/interface/agents/llamaindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ pip install llama-index-core

::: maseval.interface.agents.llamaindex.LlamaIndexAgentAdapter

::: maseval.interface.agents.llamaindex.LlamaIndexUser
::: maseval.interface.agents.llamaindex.LlamaIndexLLMUser
2 changes: 1 addition & 1 deletion docs/interface/agents/smolagents.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pip install smolagents

::: maseval.interface.agents.smolagents.SmolAgentAdapter

::: maseval.interface.agents.smolagents.SmolAgentUser
::: maseval.interface.agents.smolagents.SmolAgentLLMUser

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents_optional.py){ .md-source-file }

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ Some agent adapters expose helper tools or user-simulation tools that can be use

::: maseval.interface.agents.smolagents.SmolAgentAdapter

::: maseval.interface.agents.smolagents.SmolAgentUser
::: maseval.interface.agents.smolagents.SmolAgentLLMUser
20 changes: 14 additions & 6 deletions docs/reference/user.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,35 @@
# User

In many real-world applications, Multi-Agent Systems (MAS) are designed to interact with human users to accomplish tasks. To effectively benchmark such systems, it is crucial to have a standardized way to simulate these interactions. The `User` class in MASEval provides this capability by acting as a programmable, LLM-driven user that can engage with the MAS in a realistic manner.
In many real-world applications, Multi-Agent Systems (MAS) are designed to interact with human users to accomplish tasks. To effectively benchmark such systems, it is crucial to have a standardized way to simulate these interactions. MASEval provides this capability through a `User` hierarchy: the abstract `User` base class defines the interface, while `LLMUser` provides an LLM-driven implementation that can engage with the MAS in a realistic manner.

The User is initialized with a persona and a scenario, both of which are typically defined within a Task. This tight integration allows for dynamic and context-aware simulations. For example, a Task might generate a random birthdate for the user. This birthdate is then passed to both the `User` and the `Evaluator`. The User will use this information in its conversation with the MAS, and the `Evaluator` will check if the MAS correctly processes and remembers this information. This mechanism enables the creation of sophisticated and reliable benchmarks that can assess the interactive capabilities of a MAS.
The `LLMUser` is initialized with a persona and a scenario, both of which are typically defined within a Task. This tight integration allows for dynamic and context-aware simulations. For example, a Task might generate a random birthdate for the user. This birthdate is then passed to both the `LLMUser` and the `Evaluator`. The user will use this information in its conversation with the MAS, and the `Evaluator` will check if the MAS correctly processes and remembers this information. This mechanism enables the creation of sophisticated and reliable benchmarks that can assess the interactive capabilities of a MAS.

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/user.py){ .md-source-file }

::: maseval.core.user.User

::: maseval.core.user.AgenticUser
::: maseval.core.user.LLMUser

::: maseval.core.user.AgenticLLMUser

## Interfaces

Some integrations provide convenience user/tool implementations for specific agent frameworks. For example:

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/smolagents.py){ .md-source-file }

::: maseval.interface.agents.smolagents.SmolAgentUser
::: maseval.interface.agents.smolagents.SmolAgentLLMUser

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/langgraph.py){ .md-source-file }

::: maseval.interface.agents.langgraph.LangGraphUser
::: maseval.interface.agents.langgraph.LangGraphLLMUser

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/llamaindex.py){ .md-source-file }

::: maseval.interface.agents.llamaindex.LlamaIndexUser
::: maseval.interface.agents.llamaindex.LlamaIndexLLMUser

[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/agents/camel.py){ .md-source-file }

::: maseval.interface.agents.camel.CamelLLMUser

::: maseval.interface.agents.camel.CamelAgentUser
4 changes: 2 additions & 2 deletions examples/macs_benchmark/macs_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ class UserInputTool(SmolagentsTool):
output_type = "string"

def forward(self, question: str) -> str:
return user.simulate_response(question)
return user.respond(question)

return UserInputTool()

Expand Down Expand Up @@ -393,7 +393,7 @@ def get_tool(self):

def user_input(question: str) -> str:
"""Ask the user a question to understand their complete requirements."""
return self.simulate_response(question)
return self.respond(question)

return StructuredTool.from_function(
func=user_input,
Expand Down
6 changes: 3 additions & 3 deletions examples/tau2_benchmark/tau2_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ def get_tool(self) -> Dict[str, Any]:

def ask_user(question: str) -> str:
"""Ask the customer a question to clarify their request or get additional information."""
return self.simulate_response(question)
return self.respond(question)

return {"ask_user": ask_user}

Expand Down Expand Up @@ -345,7 +345,7 @@ class UserInputTool(SmolagentsTool):
output_type = "string"

def forward(self, question: str) -> str:
return user.simulate_response(question)
return user.respond(question)

return UserInputTool()

Expand Down Expand Up @@ -506,7 +506,7 @@ def get_tool(self):

def user_input(question: str) -> str:
"""Ask the customer a question to clarify their request."""
return self.simulate_response(question)
return self.respond(question)

return StructuredTool.from_function(
func=user_input,
Expand Down
4 changes: 3 additions & 1 deletion maseval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
UserSimulatorError,
)
from .core.model import ModelAdapter, ChatResponse
from .core.user import User, TerminationReason
from .core.user import User, LLMUser, AgenticLLMUser, TerminationReason
from .core.evaluator import Evaluator
from .core.history import MessageHistory, ToolInvocationHistory
from .core.tracing import TraceableMixin
Expand Down Expand Up @@ -75,6 +75,8 @@
"UserSimulatorError",
# User simulation
"User",
"LLMUser",
"AgenticLLMUser",
"TerminationReason",
# Evaluation
"Evaluator",
Expand Down
7 changes: 4 additions & 3 deletions maseval/benchmark/macs/macs.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ def get_model_adapter(self, model_id, **kwargs):
from maseval import (
AgentAdapter,
Benchmark,
User,
Environment,
Evaluator,
MessageHistory,
Expand All @@ -56,7 +57,7 @@ def get_model_adapter(self, model_id, **kwargs):
TaskExecutionStatus,
ToolInvocationHistory,
ToolLLMSimulator,
User,
LLMUser,
AgentError,
EnvironmentError,
validate_arguments_from_schema,
Expand Down Expand Up @@ -456,10 +457,10 @@ def _compute_gsr(self, report: List[Dict[str, Any]]) -> Tuple[float, float]:
# =============================================================================


class MACSUser(User):
class MACSUser(LLMUser):
"""MACS-specific user simulator with conversation limits.

Extends the base User class with MACS-specific behavior:
Extends the LLMUser class with MACS-specific behavior:
- Maximum 5 turns of interaction (as per MACS paper)
- </stop> token detection for natural conversation ending
- User profile and scenario-aware responses
Expand Down
6 changes: 3 additions & 3 deletions maseval/benchmark/tau2/tau2.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def get_model_adapter(self, model_id, **kwargs):
from typing import Any, Dict, List, Optional, Sequence, Tuple, Callable

from maseval import AgentAdapter, Benchmark, Evaluator, ModelAdapter, Task, User
from maseval.core.user import AgenticUser
from maseval.core.user import AgenticLLMUser
from maseval.core.callback import BenchmarkCallback

from maseval.benchmark.tau2.environment import Tau2Environment
Expand All @@ -73,10 +73,10 @@ def get_model_adapter(self, model_id, **kwargs):
# =============================================================================


class Tau2User(AgenticUser):
class Tau2User(AgenticLLMUser):
"""Tau2-specific user simulator with customer service personas.

Extends the AgenticUser class with tau2-specific behavior:
Extends the AgenticLLMUser class with tau2-specific behavior:
- Customer personas from user_scenario
- Domain-aware responses (airline, retail, telecom)
- Multi-turn interaction support
Expand Down
Loading
Loading