parameterlab
diff --git a/‎BENCHMARKS.md‎
Lines changed: 14 additions & 3 deletions b/‎BENCHMARKS.md‎
Lines changed: 14 additions & 3 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/benchmark/converse.md‎
Lines changed: 136 additions & 0 deletions b/‎docs/benchmark/converse.md‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎docs/examples/index.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/examples/index.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/converse_benchmark/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎examples/converse_benchmark/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -53,13 +53,24 @@ Gaia2 is a benchmark for evaluating LLM-based agents on dynamic, multi-step scen
 
 ---
 
-## 4. [Name of Next Benchmark]
+## 5. CONVERSE
+
+CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses on adversarial interactions where an external service-provider agent attempts privacy extraction or unauthorized action induction over multiple turns.
+
+### Source and License
+
+- **Original Repository:** [https://github.com/amrgomaaelhady/ConVerse](https://github.com/amrgomaaelhady/ConVerse)
+- **Paper:** [ConVerse: Contextual Safety in Agent-to-Agent Conversations](https://arxiv.org/abs/2506.15753)
+- **Code License:** MIT (as provided by the upstream repository)
+- **Data License:** Refer to the upstream repository's dataset and license terms
+
+---
+
+## 6. [Name of Next Benchmark]
 
 (Description for the next benchmark...)
 
 ### Source and License
 
 - **Original Repository:** [Link](Link)
 - **Data License:** Data License.
-
----
 
@@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Benchmarks**
 
+- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
+
 - GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
   - `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
   - `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
@@ -29,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Examples**
 
+- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
 - Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
 
 **Core**
@@ -63,6 +66,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 **Core**
 
 - Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
+- Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with `fail_on_evaluation_error`. (PR: #28)
 
 **Benchmarks**
 
 
@@ -0,0 +1,136 @@
+# CONVERSE Benchmark
+
+CONVERSE evaluates privacy and security robustness in agent-to-agent conversations where the external counterpart is adversarial.
+
+## What It Tests
+
+- Privacy attacks: the external agent tries to extract sensitive profile details.
+- Security attacks: the external agent tries to induce unauthorized tool actions.
+- Multi-turn manipulation: attacks progress over several conversational turns.
+
+## Data Source
+
+Data is loaded from [the official CONVERSE repository `amrgomaaelhady/ConVerse`](https://github.com/amrgomaaelhady/ConVerse)
+
+Supported domains:
+
+- `travel`
+- `real_estate`
+- `insurance`
+
+## Usage
+
+Implement a framework-specific subclass of `ConverseBenchmark` and provide agent setup plus model adapter provisioning.
+
+```python
+from typing import Any, Dict, Optional, Sequence, Tuple
+
+from maseval import AgentAdapter, Environment, ModelAdapter, Task, User
+from maseval.benchmark.converse import ConverseBenchmark, ensure_data_exists, load_tasks
+from maseval.core.seeding import SeedGenerator
+
+
+class MyConverseBenchmark(ConverseBenchmark):
+    def setup_agents(
+        self,
+        agent_data: Dict[str, Any],
+        environment: Environment,
+        task: Task,
+        user: Optional[User],
+        seed_generator: SeedGenerator,
+    ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
+        # Create your framework agent(s) using environment tools.
+        ...
+
+    def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
+        # Create and optionally register model adapter.
+        ...
+
+
+# First call downloads source files to the local benchmark data cache.
+ensure_data_exists(domain="travel")
+tasks = load_tasks(domain="travel", split="privacy", limit=5)
+
+benchmark = MyConverseBenchmark(progress_bar=False)
+results = benchmark.run(
+    tasks=tasks,
+    agent_data={
+        "model_id": "gpt-4o-mini",
+        "attacker_model_id": "gpt-4o",
+        "max_turns": 10,
+    },
+)
+```
+
+## Default Implementation
+
+CONVERSE also provides a built-in default agent loop via `DefaultAgentConverseBenchmark`.
+You only need to supply `get_model_adapter()`.
+
+```python
+from typing import Any
+
+from maseval import ModelAdapter
+from maseval.benchmark.converse import DefaultAgentConverseBenchmark, load_tasks
+from maseval.interface.inference import OpenAIModelAdapter
+
+
+class MyDefaultConverseBenchmark(DefaultAgentConverseBenchmark):
+    def get_model_adapter(self, model_id: str, **kwargs: Any) -> ModelAdapter:
+        adapter = OpenAIModelAdapter(client=..., model_id=model_id, seed=kwargs.get("seed"))
+        if kwargs.get("register_name"):
+            self.register(kwargs.get("register_category", "models"), kwargs["register_name"], adapter)
+        return adapter
+
+
+tasks = load_tasks(domain="travel", split="privacy", limit=5)
+benchmark = MyDefaultConverseBenchmark(progress_bar=False)
+results = benchmark.run(
+    tasks=tasks,
+    agent_data={
+        "model_id": "gpt-4o-mini",
+        "attacker_model_id": "gpt-4o",
+    },
+)
+```
+
+For a full runnable example script, see:
+
+- `examples/converse_benchmark/default_converse_benchmark.py`
+
+## Evaluation Output
+
+Privacy evaluator output fields:
+
+- `privacy_leak`
+- `score`
+- `leak_turn`
+- `matched_snippet`
+
+Security evaluator output fields:
+
+- `security_violation`
+- `score`
+- `violated_tools`
+
+[:material-github: View source](https://github.com/parameterlab/MASEval/blob/main/maseval/benchmark/converse/converse.py){ .md-source-file }
+
+::: maseval.benchmark.converse.ConverseBenchmark
+
+::: maseval.benchmark.converse.DefaultAgentConverseBenchmark
+
+::: maseval.benchmark.converse.DefaultConverseAgent
+
+::: maseval.benchmark.converse.DefaultConverseAgentAdapter
+
+::: maseval.benchmark.converse.ConverseEnvironment
+
+::: maseval.benchmark.converse.ConverseExternalAgent
+
+::: maseval.benchmark.converse.PrivacyEvaluator
+
+::: maseval.benchmark.converse.SecurityEvaluator
+
+::: maseval.benchmark.converse.load_tasks
+
+::: maseval.benchmark.converse.ensure_data_exists
@@ -7,3 +7,4 @@ Learn MASEval through hands-on examples covering common use cases and benchmarks
 | [Tutorial](tutorial.ipynb)                                                                                                                         | Introduction to MASEval's core concepts and basic usage |
 | [Five-a-Day Benchmark](five_a_day_benchmark.ipynb)                                                                                                 | Building a custom benchmark from scratch                |
 | [Multi-Agent Collaboration Scenario Benchmark (MACS)](https://github.com/parameterlab/MASEval/blob/main/examples/macs_benchmark/macs_benchmark.py) | An adaptation of the `maseval.benchmark.MACSBenchmark`. |
+| [CONVERSE (Default Agent)](https://github.com/parameterlab/MASEval/blob/main/examples/converse_benchmark/default_converse_benchmark.py)         | Run `DefaultAgentConverseBenchmark` end-to-end.         |
@@ -0,0 +1 @@
+"""CONVERSE Benchmark Example Package."""
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""CONVERSE Benchmark Example Package."""`