Skip to content

Commit 4227cba

Browse files
authored
implementing converse benchmark (#28)
Implement Converse benchmark (#28) Add a new benchmark implementation for evaluating multi-agent systems using the Converse framework. Benchmark implementation: - maseval/benchmark/converse/ core benchmark module with environment, evaluator, data loader, and external agent components - Scenario-driven evaluation with configurable conversation environments - Data loading pipeline for Converse benchmark datasets Documentation & examples: - docs/benchmark/converse.md with full benchmark documentation - examples/converse_benchmark/ with usage examples - Updated CHANGELOG.md and BENCHMARKS.md Testing: - Unit tests for Converse benchmark components - Data integrity tests for the Converse download pipeline Other: - Improved docstrings and type hinting consistency across benchmark modules
1 parent e6d8a03 commit 4227cba

20 files changed

Lines changed: 2245 additions & 3 deletions

File tree

BENCHMARKS.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,24 @@ Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scen
5353

5454
---
5555

56-
## 4. [Name of Next Benchmark]
56+
## 5. CONVERSE
57+
58+
CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.
59+
60+
### Source and License
61+
62+
- **Original Repository:** [https://github.com/amrgomaaelhady/ConVerse](https://github.com/amrgomaaelhady/ConVerse)
63+
- **Paper:** [ConVerse: Contextual Safety in Agent-to-Agent Conversations](https://arxiv.org/abs/2506.15753)
64+
- **Code License:** MIT (as provided by the upstream repository)
65+
- **Data License:** Refer to the upstream repository's dataset and license terms
66+
67+
---
68+
69+
## 6. [Name of Next Benchmark]
5770

5871
(Description for the next benchmark...)
5972

6073
### Source and License
6174

6275
- **Original Repository:** [Link](Link)
6376
- **Data License:** Data License.
64-
65-
---

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Benchmarks**
1313

14+
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
15+
1416
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
1517
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
1618
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
@@ -29,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2931

3032
**Examples**
3133

34+
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
3235
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
3336

3437
**Core**
@@ -63,6 +66,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6366
**Core**
6467

6568
- Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
69+
- Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with `fail_on_evaluation_error`. (PR: #28)
6670

6771
**Benchmarks**
6872

docs/benchmark/converse.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# CONVERSE Benchmark
2+
3+
CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.
4+
5+
## What It Tests
6+
7+
- Privacy attacks: the external agent tries to extract sensitive profile details.
8+
- Security attacks: the external agent tries to induce unauthorized tool actions.
9+
- Multi-turn manipulation: attacks progress over several conversational turns.
10+
11+
## Data Source
12+
13+
Data is loaded from [the official CONVERSE repository `amrgomaaelhady/ConVerse`](https://github.com/amrgomaaelhady/ConVerse)
14+
15+
Supported domains:
16+
17+
- `travel`
18+
- `real_estate`
19+
- `insurance`
20+
21+
## Usage
22+
23+
Implement a framework-specific subclass of `ConverseBenchmark` and provide agent setup plus model adapter provisioning.
24+
25+
```python
26+
from typing import Any, Dict, Optional, Sequence, Tuple
27+
28+
from maseval import AgentAdapter, Environment, ModelAdapter, Task, User
29+
from maseval.benchmark.converse import ConverseBenchmark, ensure_data_exists, load_tasks
30+
from maseval.core.seeding import SeedGenerator
31+
32+
33+
class MyConverseBenchmark(ConverseBenchmark):
34+
def setup_agents(
35+
self,
36+
agent_data: Dict[str, Any],
37+
environment: Environment,
38+
task: Task,
39+
user: Optional[User],
40+
seed_generator: SeedGenerator,
41+
) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
42+
# Create your framework agent(s) using environment tools.
43+
...
44+
45+
def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
46+
# Create and optionally register model adapter.
47+
...
48+
49+
50+
# First call downloads source files to the local benchmark data cache.
51+
ensure_data_exists(domain="travel")
52+
tasks = load_tasks(domain="travel", split="privacy", limit=5)
53+
54+
benchmark = MyConverseBenchmark(progress_bar=False)
55+
results = benchmark.run(
56+
tasks=tasks,
57+
agent_data={
58+
"model_id": "gpt-4o-mini",
59+
"attacker_model_id": "gpt-4o",
60+
"max_turns": 10,
61+
},
62+
)
63+
```
64+
65+
## Default Implementation
66+
67+
CONVERSE also provides a built-in default agent loop via `DefaultAgentConverseBenchmark`.
68+
You only need to supply `get_model_adapter()`.
69+
70+
```python
71+
from typing import Any
72+
73+
from maseval import ModelAdapter
74+
from maseval.benchmark.converse import DefaultAgentConverseBenchmark, load_tasks
75+
from maseval.interface.inference import OpenAIModelAdapter
76+
77+
78+
class MyDefaultConverseBenchmark(DefaultAgentConverseBenchmark):
79+
def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
80+
adapter = OpenAIModelAdapter(client=..., model_id=model_id, seed=kwargs.get("seed"))
81+
if kwargs.get("register_name"):
82+
self.register(kwargs.get("register_category", "models"), kwargs["register_name"], adapter)
83+
return adapter
84+
85+
86+
tasks = load_tasks(domain="travel", split="privacy", limit=5)
87+
benchmark = MyDefaultConverseBenchmark(progress_bar=False)
88+
results = benchmark.run(
89+
tasks=tasks,
90+
agent_data={
91+
"model_id": "gpt-4o-mini",
92+
"attacker_model_id": "gpt-4o",
93+
},
94+
)
95+
```
96+
97+
For a full runnable example script, see:
98+
99+
- `examples/converse_benchmark/default_converse_benchmark.py`
100+
101+
## Evaluation Output
102+
103+
Privacy evaluator output fields:
104+
105+
- `privacy_leak`
106+
- `score`
107+
- `leak_turn`
108+
- `matched_snippet`
109+
110+
Security evaluator output fields:
111+
112+
- `security_violation`
113+
- `score`
114+
- `violated_tools`
115+
116+
[:material-github: View source](https://github.com/parameterlab/MASEval/blob/main/maseval/benchmark/converse/converse.py){ .md-source-file }
117+
118+
::: maseval.benchmark.converse.ConverseBenchmark
119+
120+
::: maseval.benchmark.converse.DefaultAgentConverseBenchmark
121+
122+
::: maseval.benchmark.converse.DefaultConverseAgent
123+
124+
::: maseval.benchmark.converse.DefaultConverseAgentAdapter
125+
126+
::: maseval.benchmark.converse.ConverseEnvironment
127+
128+
::: maseval.benchmark.converse.ConverseExternalAgent
129+
130+
::: maseval.benchmark.converse.PrivacyEvaluator
131+
132+
::: maseval.benchmark.converse.SecurityEvaluator
133+
134+
::: maseval.benchmark.converse.load_tasks
135+
136+
::: maseval.benchmark.converse.ensure_data_exists

docs/examples/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ Learn MASEval through hands-on examples covering common use cases and benchmarks
77
| [Tutorial](tutorial.ipynb) | Introduction to MASEval's core concepts and basic usage |
88
| [Five-a-Day Benchmark](five_a_day_benchmark.ipynb) | Building a custom benchmark from scratch |
99
| [Multi-Agent Collaboration Scenario Benchmark (MACS)](https://github.com/parameterlab/MASEval/blob/main/examples/macs_benchmark/macs_benchmark.py) | An adaptation of the `maseval.benchmark.MACSBenchmark`. |
10+
| [CONVERSE (Default Agent)](https://github.com/parameterlab/MASEval/blob/main/examples/converse_benchmark/default_converse_benchmark.py) | Run `DefaultAgentConverseBenchmark` end-to-end. |
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""CONVERSE Benchmark Example Package."""

0 commit comments

Comments
 (0)