maseval
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 7 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 80 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 80 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 29 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 29 additions & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 12 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/getting-started/faq.md‎
Lines changed: 16 additions & 2 deletions b/‎docs/getting-started/faq.md‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎docs/getting-started/quickstart.md‎
Lines changed: 36 additions & 1 deletion b/‎docs/getting-started/quickstart.md‎
Lines changed: 36 additions & 1 deletion
diff --git a/‎docs/guides/index.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/guides/index.md‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,7 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.14.0  # Keep in sync with pyproject.toml
+    hooks:
+      - id: ruff-check
+        args: [--fix, --ignore, "F401,F841"]
+      - id: ruff-format
@@ -387,6 +387,86 @@ def calculate_average(numbers: list) -> float:
     """
 ```
 
+### mkdocs Rendering
+
+This project uses mkdocstrings to render docstrings as HTML. Follow these rules to ensure proper rendering:
+
+**Lists require a blank line before them:**
+
+```python
+# Bad - renders as one paragraph
+"""Subclasses must provide:
+- method_one(): Description
+- method_two(): Description
+"""
+
+# Good - renders as proper bullet list
+"""Subclasses must provide:
+
+- `method_one()` - Description
+- `method_two()` - Description
+"""
+```
+
+**Return descriptions must be single-line** (multi-line creates multiple table rows):
+
+```python
+# Bad
+"""
+Returns:
+    TerminationReason indicating why is_done() returns True,
+    or NOT_TERMINATED if the interaction is still ongoing.
+"""
+
+# Good
+"""
+Returns:
+    Why `is_done()` returns True, or `NOT_TERMINATED` if still ongoing.
+"""
+```
+
+**For dictionary returns, document fields in the docstring body** using "Output fields:":
+
+```python
+# Bad - creates multiple table rows in Returns
+"""
+Returns:
+    Dictionary containing:
+    - `name` - User identifier
+    - `profile` - User profile data
+"""
+
+# Good - fields in body, single-line Returns
+"""
+Gather execution traces from this user.
+
+Output fields:
+
+- `name` - User identifier
+- `profile` - User profile data
+- `message_count` - Number of messages in history
+
+Returns:
+    Dictionary containing user state and interaction data.
+"""
+```
+
+**HTML-like strings must be in backticks** (otherwise stripped as HTML):
+
+```python
+# Bad - </stop> disappears
+"""Uses "</stop>" to signal satisfaction."""
+
+# Good
+"""Uses `"</stop>"` to signal satisfaction."""
+```
+
+**Use backticks for code references** - method names, parameters, and values: `` `is_done()` ``, `` `stop_tokens` ``, `` `None` ``
+
+## Seeding for Reproducibility
+
+MASEval provides a seeding system for reproducible benchmark runs. Seeds cascade from a global seed through all components, ensuring deterministic behavior when model providers support seeding. Study code and documentation of `Benchmark, DefaultSeedGenerator` to gain an understanding.
+
 ## Early-Release Status
 
 **This project is early-release. Clean, maintainable code is the priority - not backwards compatibility.**
 
@@ -9,8 +9,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+**Seeding System**
+
+- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
+- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
+- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
+- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
+- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
+- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
+
+**Interface**
+
+- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
+- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
+- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
+- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
+
 ### Changed
 
+**User**
+
+- Refactored `User` class into abstract base class defining the interface (`get_initial_query()`, `respond()`, `is_done()`) with `LLMUser` as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
+- Renamed `AgenticUser` → `AgenticLLMUser` for consistency with the new hierarchy (PR: #22)
+
+**Interface**
+
+- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
+  - `SmolAgentUser` → `SmolAgentLLMUser`
+  - `LangGraphUser` → `LangGraphLLMUser`
+  - `LlamaIndexUser` → `LlamaIndexLLMUser`
+
 ### Fixed
 
 ### Removed
@@ -126,7 +154,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 **Interface**
 
-- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
+- [LlamaIndex](https://github.com/run-llama/llama_index) integration: `LlamaIndexAgentAdapter` and `LlamaIndexLLMUser` for evaluating LlamaIndex workflow-based agents (PR: #7)
 - The `logs` property inside `SmolAgentAdapter` and `LanggraphAgentAdapter` are now properly filled. (PR: #3)
 
 **Examples**
 
@@ -85,6 +85,18 @@ ruff check . --fix
 
 If you haven't activated your virtual environment, you can use `uv run ruff format .` and `uv run ruff check . --fix` instead.
 
+For convenience, you can enable **pre-commit hooks** to automatically format and lint code on every commit:
+
+```bash
+uv run pre-commit install
+```
+
+This is optional—CI will catch any issues regardless. But if enabled, the hooks will:
+- **Format** code with `ruff format` (using project settings from `pyproject.toml`)
+- **Lint and auto-fix** issues with `ruff check --fix`
+
+> **Note**: The pre-commit hooks intentionally skip removing unused imports (`F401`) and unused variables (`F841`) to avoid disrupting work-in-progress code. Run `uv run ruff check . --fix` manually before opening a PR to clean these up.
+
 ### 3. Dependency Management
 
 Dependencies are defined in `pyproject.toml` and locked in `uv.lock`. Understanding the different dependency types is important:
 
@@ -114,7 +114,7 @@ Examples are available in the [Documentation](https://maseval.readthedocs.io/en/
 
 ## Contribute
 
-We welcome any contributions. Please read the [CONTRIBUTING.md](https://github.com/parameterlab/MASEval/tree/fix-porting-issue?tab=contributing-ov-file) file to learn more!
+We welcome any contributions. Please read the [CONTRIBUTING.md](CONTRIBUTING.md) file to learn more!
 
 ## Benchmarks
 
 
@@ -1,5 +1,19 @@
 # FAQ
 
-## Q: Test
+## Q: Who is this library for?
 
-## A: Test
+Anyone! We had a few groups in mind when building MASEval.
+
+1. **Benchmark Developers**: Researchers proposing new benchmarks for multi-agent systems can use MASEval to handle all the boilerplate.
+2. **Benchmark Consumers**: Researchers studying multi-agent systems can use MASEval as a unified interface across different benchmarks.
+3. **System Comparison**: Developers who want to test different agentic systems against each other can do so with MASEval.
+
+## Q: I am looking for a specific feature, but I cannot find it.
+
+1. Check this documentation.
+2. If the feature does not exist, please [open an issue on GitHub](https://github.com/parameterlab/MASEval/issues/new). Feature requests are welcome.
+3. Consider implementing it yourself. Check out the [contributing guide](contributing.md) for details.
+
+## Q: Can I only test multi-agent systems?
+
+No. MASEval works well for single-agent systems too. We designed the library to handle the complexity of multi-agent systems, but single-agent evaluation is fully supported. You can even run model comparisons, for example GPT against Claude.
@@ -166,7 +166,41 @@ See the [Agent Adapters](../interface/agents/smolagents.md) documentation for th
 
 ### Existing Benchmarks
 
-Pre-built benchmarks for established evaluation suites are coming soon.
+MASEval includes pre-built benchmarks for established evaluation suites. See the [Benchmarks](../benchmark/index.md) section for the full list.
+
+**Using a default agent:** For quick evaluation or baseline comparisons, use the default benchmark class directly:
+
+```python
+from maseval.benchmark.tau2 import (
+    DefaultAgentTau2Benchmark, load_tasks, ensure_data_exists,
+)
+
+ensure_data_exists(domain="retail")
+tasks = load_tasks("retail", split="base", limit=5)
+
+benchmark = DefaultAgentTau2Benchmark(
+    agent_data={"model_id": "gpt-4o"},
+    n_task_repeats=4,
+)
+results = benchmark.run(tasks)
+```
+
+**Plugging in your own agent:** Subclass the base benchmark to use your own agent implementation:
+
+```python
+from maseval.benchmark.tau2 import Tau2Benchmark, load_tasks
+
+class MyTau2Benchmark(Tau2Benchmark):
+    def setup_agents(self, agent_data, environment, task, user):
+        tools = environment.tools
+        # Create your agent with these tools
+        ...
+
+benchmark = MyTau2Benchmark(agent_data={}, n_task_repeats=4)
+results = benchmark.run(tasks)
+```
+
+The base class handles environment setup, user simulation, and evaluation—you only implement `setup_agents()` and `run_agents()`.
 
 ---
 
@@ -187,6 +221,7 @@ Topic-based discussions covering specific features and best practices:
 
 - [Message Tracing](../guides/message-tracing.md) — Capture and analyze agent conversations
 - [Configuration Gathering](../guides/config-gathering.md) — Collect reproducible experiment configurations
+- [Seeding](../guides/seeding.md) — Enable reproducible benchmark runs
 
 ### Reference
 
 
@@ -7,3 +7,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
 | [Message Tracing](message-tracing.md)          | Capture and inspect agent conversations during benchmark runs |
 | [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility          |
 | [Exception Handling](exception-handling.md)    | Distinguish agent errors from infrastructure failures         |
+| [Seeding](seeding.md)                          | Enable reproducible benchmark runs with deterministic seeds   |