Skip to content

Commit deadf25

Browse files
committed
added example notebooks
1 parent 46dac83 commit deadf25

10 files changed

Lines changed: 1058 additions & 59 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1919
- `FileResultLogger` now accepts `pathlib.Path` for argument `output_dir` and has an `overwrite` argument to prevent overwriting of existing logs files.
2020
- `Benchmark` class now has a `fail_on_setup_error` flag that raises errors observed during setup of task (PR: #10)
2121
- The `Evaluator` class now has a `filter_traces` base method to conveniently adapt the same evaluator to different entities in the traces (PR: #10).
22+
- Improved Quick Start Guide in `docs/getting-started/quickstart.md`. (PR: #10)
2223

2324
### Fixed
2425

docs/examples/example1.md

Lines changed: 0 additions & 8 deletions
This file was deleted.

docs/examples/example2.md

Lines changed: 0 additions & 8 deletions
This file was deleted.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../examples/five_a_day_benchmark/five_a_day_benchmark.ipynb

docs/examples/tutorial.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../examples/introduction/tutorial.ipynb

docs/getting-started/quickstart.md

Lines changed: 167 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Getting Started
22

3-
This guide will help you get started with MASEval.
3+
This guide introduces the core concepts of MASEval and helps you get started quickly.
44

55
## Installation
66

@@ -23,48 +23,190 @@ This includes all core functionality for defining benchmarks, tasks, and evaluat
2323

2424
### Optional Dependencies
2525

26-
Install additional integrations based on your agent framework and tooling. These can also be installed separately, but are offered here for convenience.
27-
28-
**Agent Frameworks:**
26+
Install additional integrations based on your agent framework. For example,
2927

3028
```bash
29+
# Agent framework integrations
3130
pip install "maseval[smolagents]" # SmolAgents integration
32-
```
3331

34-
**LLM Providers:**
35-
36-
```bash
32+
# LLM providers
3733
pip install "maseval[openai]" # OpenAI models
38-
pip install "maseval[google]" # Google GenAI models
34+
35+
# Combine multiple extras
36+
pip install "maseval[smolagents,openai]"
37+
38+
# Install everything (for examples or development)
39+
pip install "maseval[all]"
40+
pip install "maseval[examples]"
3941
```
4042

41-
**Observability & Tracing:**
43+
---
4244

43-
```bash
44-
# TODO
45+
## Using the Library
46+
47+
### Philosophy
48+
49+
MASEval follows a clear separation of concerns:
50+
51+
1. **You implement your agents** using any framework (LangChain, AutoGen, smolagents, custom code, etc.)
52+
2. **MASEval provides the evaluation infrastructure** — benchmarks, tasks, environments, and metrics
53+
3. **Adapters bridge the gap** — thin wrappers that connect your agent to MASEval's interface
54+
55+
Think of MASEval like pytest for agents: you bring the code, MASEval runs the tests.
56+
57+
### Key Concepts
58+
59+
| Term | Description |
60+
| ---------------- | ------------------------------------------------------------------------------------------------------ |
61+
| **Benchmark** | Orchestrates the evaluation lifecycle: setup, execution, and measurement across a collection of tasks. |
62+
| **Task** | A single evaluation unit with a query, expected outcome, and evaluation criteria. |
63+
| **Environment** | The context in which agents operate (e.g., simulated tools, databases, file systems). |
64+
| **AgentAdapter** | Wraps your agent to provide a unified interface for MASEval. |
65+
| **Evaluator** | Measures agent performance by comparing outputs or states to expected results. |
66+
| **Callback** | Hooks into the evaluation lifecycle for logging, tracing, or custom metrics. |
67+
68+
### Implementing a Benchmark
69+
70+
To create your own benchmark, subclass `Benchmark` and implement the required abstract methods. Here's the typical workflow:
71+
72+
1. **Agents and environment** Define your agents and environment using any tool you prefer to use. Wrap them in `Environment` and `AgentAdapter`.
73+
2. **Create your tasks** as `Task` objects with queries and evaluation data
74+
3. **Subclass `Benchmark`** and implement the abstract setup/run/evaluate methods
75+
4. **Call `Benchmark.run(tasks)`** to execute the complete benchmark
76+
77+
```python
78+
from maseval import Benchmark, AgentAdapter, Environment, Evaluator, Task
79+
80+
class MyBenchmark(Benchmark):
81+
"""Custom benchmark for evaluating agents on my tasks."""
82+
83+
def setup_environment(self, agent_data, task) -> Environment:
84+
# Initialize the environment for this task
85+
# e.g., set up tools, databases, or simulated systems
86+
...
87+
88+
def setup_user(self, agent_data, environment, task):
89+
# Optional: create a user simulator for interactive tasks
90+
# Return None if not needed
91+
return None
92+
93+
def setup_agents(self, agent_data, environment, task, user):
94+
# Create your agent(s) and wrap them in AgentAdapter
95+
# Returns a tuple: (agents_to_run, agents_dict)
96+
# - agents_to_run: list of agents to invoke in run_agents()
97+
# - agents_dict: dict mapping names to all agents for tracing
98+
...
99+
100+
def setup_evaluators(self, environment, task, agents, user):
101+
# Define how success is measured
102+
# Return: list of Evaluator instances
103+
...
104+
105+
def run_agents(self, agents, task, environment):
106+
# Execute your agent system to solve the task
107+
# Return the final answer (message traces are captured automatically)
108+
...
109+
110+
def evaluate(self, evaluators, agents, final_answer, traces):
111+
# Run each evaluator with the execution data
112+
# Return: list of evaluation result dicts
113+
...
45114
```
46115

47-
**Combine Multiple Extras:**
116+
Once implemented, run your benchmark:
48117

49-
```bash
50-
pip install "maseval[smolagents,openai,wandb]"
118+
```python
119+
# Define your tasks
120+
tasks = TaskCollection([Task(query="...", expected="..."), ...])
121+
122+
# Configure your agents (e.g., model parameters, tool settings)
123+
agent_config = {"model": "gpt-4", "temperature": 0.7}
124+
125+
# Instantiate and run the evaluation
126+
benchmark = MyBenchmark(agent_data=agent_config)
127+
reports = benchmark.run(tasks)
51128
```
52129

53-
**Install Everything:**
130+
For the complete interface and lifecycle details, see the [Benchmark reference](../reference/benchmark.md).
54131

55-
```bash
56-
pip install "maseval[all]" # All integrations
57-
pip install "maseval[examples]" # All dependencies needed for examples
132+
### Adapters
133+
134+
Adapters are lightweight wrappers that connect your agent implementation to MASEval. They provide:
135+
136+
- A unified `run()` method for executing agents
137+
- Message history tracking for tracing
138+
- Callback hooks for monitoring
139+
140+
**Creating an adapter:**
141+
142+
```python
143+
from maseval import AgentAdapter
144+
145+
class MyAgentAdapter(AgentAdapter):
146+
"""Adapter for my custom agent framework."""
147+
148+
def _run_agent(self, query: str):
149+
# Call your agent's execution method
150+
result = self.agent.execute(query)
151+
152+
# Return the final answer (message history is tracked separately)
153+
return result
154+
155+
def get_messages(self):
156+
# Return the conversation history from your agent
157+
return self.agent.get_conversation_history()
58158
```
59159

60-
## Use the Library
160+
MASEval provides built-in adapters for popular frameworks in `maseval.interface.agents`. For example:
161+
162+
- `SmolAgentsAdapter` — for HuggingFace smolagents
163+
- `LangGraphAdapter` — for LangGraph agents
164+
165+
See the [Agent Adapters](../interface/agents/smolagents.md) documentation for the full list.
166+
167+
### Existing Benchmarks
168+
169+
Pre-built benchmarks for established evaluation suites are coming soon.
170+
171+
---
172+
173+
## Using the Documentation
174+
175+
This documentation is organized to help you find what you need quickly:
176+
177+
### Examples
178+
179+
End-to-end walkthroughs demonstrating complete evaluation pipelines. Start here to see MASEval in action with real agent implementations.
180+
181+
- [Tiny Tutorial](../examples/tutorial.ipynb)
182+
- [5-A-Day Benchmark](../examples/five_a_day_benchmark.ipynb)
183+
184+
### Guides
185+
186+
Topic-based discussions covering specific features and best practices:
187+
188+
- [Message Tracing](../guides/message-tracing.md) — Capture and analyze agent conversations
189+
- [Configuration Gathering](../guides/config-gathering.md) — Collect reproducible experiment configurations
190+
191+
### Reference
192+
193+
Formal API documentation for all MASEval components. The reference is split into two sections:
194+
195+
**Core** — The fundamental building blocks (no optional dependencies):
61196

62-
Start with examples.
197+
- [Benchmark](../reference/benchmark.md) — Evaluation orchestration
198+
- [Task](../reference/task.md) — Individual evaluation units
199+
- [Environment](../reference/environment.md) — Agent execution context
200+
- [AgentAdapter](../reference/agent.md) — Agent interface wrappers
201+
- [Evaluator](../reference/evaluator.md) — Performance measurement
202+
- [Callback](../reference/callback.md) — Lifecycle hooks
203+
- [MessageHistory](../reference/history.md) — Conversation tracking
63204

64-
TODO: Insert examples
205+
**Interface** — Optional integrations for specific frameworks:
65206

66-
## Use the Docs
207+
- [Agent Adapters](../interface/agents/smolagents.md) — Pre-built adapters (smolagents, langgraph, etc.)
208+
- [Inference Providers](../interface/inference/openai.md) — LLM provider integrations
67209

68-
The docs are hosted here: TODO
210+
## Next Steps
69211

70-
For comprehensive documentation on how to piece together the library's components—including detailed explanations of the execution lifecycle, setup methods, and best practices — see the [`Benchmark`](../reference/benchmark.md) class documentation.
212+
Work through the examples listed above. 3. **Explore the [`examples/five_a_day_benchmark/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark) folder** for tool implementations, evaluators, and the CLI script (`five_a_day_benchmark.py`) 4. **Build your own benchmark** using the patterns you've learned

0 commit comments

Comments
 (0)