This guide introduces the core concepts of MASEval and helps you get started quickly.
MASEval is designed with a modular architecture:
- Core: Framework-agnostic benchmark infrastructure (tasks, evaluation, simulation)
- Interface: Optional adapters for specific agent frameworks (smolagents, langgraph, etc.)
Install only what you need to keep dependencies minimal.
Install the base package from PyPI:
pip install masevalThis includes all core functionality for defining benchmarks, tasks, and evaluators.
Install additional integrations based on your agent framework. For example,
# Agent framework integrations
pip install "maseval[smolagents]" # SmolAgents integration
# LLM providers
pip install "maseval[openai]" # OpenAI models
# Combine multiple extras
pip install "maseval[smolagents,openai]"
# Install everything (for examples or development)
pip install "maseval[all]"
pip install "maseval[examples]"MASEval follows a clear separation of concerns:
- You implement your agents using any framework (LangChain, AutoGen, smolagents, custom code, etc.)
- MASEval provides the evaluation infrastructure — benchmarks, tasks, environments, and metrics
- Adapters bridge the gap — thin wrappers that connect your agent to MASEval's interface
Think of MASEval like pytest for agents: you bring the code, MASEval runs the tests.
| Term | Description |
|---|---|
| Benchmark | Orchestrates the evaluation lifecycle: setup, execution, and measurement across a collection of tasks. |
| Task | A single evaluation unit with a query, expected outcome, and evaluation criteria. |
| Environment | The context in which agents operate (e.g., simulated tools, databases, file systems). |
| AgentAdapter | Wraps your agent to provide a unified interface for MASEval. |
| Evaluator | Measures agent performance by comparing outputs or states to expected results. |
| Callback | Hooks into the evaluation lifecycle for logging, tracing, or custom metrics. |
To create your own benchmark, subclass Benchmark and implement the required abstract methods. Here's the typical workflow:
- Agents and environment Define your agents and environment using any tool you prefer to use. Wrap them in
EnvironmentandAgentAdapter. - Create your tasks as
Taskobjects with queries and evaluation data - Subclass
Benchmarkand implement the abstract setup/run/evaluate methods - Call
Benchmark.run(tasks)to execute the complete benchmark
from maseval import Benchmark, AgentAdapter, Environment, Evaluator, Task
class MyBenchmark(Benchmark):
"""Custom benchmark for evaluating agents on my tasks."""
def setup_environment(self, agent_data, task) -> Environment:
# Initialize the environment for this task
# e.g., set up tools, databases, or simulated systems
...
def setup_user(self, agent_data, environment, task):
# Optional: create a user simulator for interactive tasks
# Return None if not needed
return None
def setup_agents(self, agent_data, environment, task, user):
# Create your agent(s) and wrap them in AgentAdapter
# Returns a tuple: (agents_to_run, agents_dict)
# - agents_to_run: list of agents to invoke in run_agents()
# - agents_dict: dict mapping names to all agents for tracing
...
def setup_evaluators(self, environment, task, agents, user):
# Define how success is measured
# Return: list of Evaluator instances
...
def run_agents(self, agents, task, environment):
# Execute your agent system to solve the task
# Return the final answer (message traces are captured automatically)
...
def evaluate(self, evaluators, agents, final_answer, traces):
# Run each evaluator with the execution data
# Return: list of evaluation result dicts
...Once implemented, run your benchmark:
# Define your tasks
tasks = TaskCollection([Task(query="...", expected="..."), ...])
# Configure your agents (e.g., model parameters, tool settings)
agent_config = {"model": "gpt-4", "temperature": 0.7}
# Instantiate and run the evaluation
benchmark = MyBenchmark(agent_data=agent_config)
reports = benchmark.run(tasks)For the complete interface and lifecycle details, see the Benchmark reference.
Adapters are lightweight wrappers that connect your agent implementation to MASEval. They provide:
- A unified
run()method for executing agents - Message history tracking for tracing
- Callback hooks for monitoring
Creating an adapter:
from maseval import AgentAdapter
class MyAgentAdapter(AgentAdapter):
"""Adapter for my custom agent framework."""
def _run_agent(self, query: str):
# Call your agent's execution method
result = self.agent.execute(query)
# Return the final answer (message history is tracked separately)
return result
def get_messages(self):
# Return the conversation history from your agent
return self.agent.get_conversation_history()MASEval provides built-in adapters for popular frameworks in maseval.interface.agents. For example:
SmolAgentsAdapter— for HuggingFace smolagentsLangGraphAdapter— for LangGraph agents
See the Agent Adapters documentation for the full list.
Pre-built benchmarks for established evaluation suites are coming soon.
This documentation is organized to help you find what you need quickly:
End-to-end walkthroughs demonstrating complete evaluation pipelines. Start here to see MASEval in action with real agent implementations.
Topic-based discussions covering specific features and best practices:
- Message Tracing — Capture and analyze agent conversations
- Configuration Gathering — Collect reproducible experiment configurations
Formal API documentation for all MASEval components. The reference is split into two sections:
Core — The fundamental building blocks (no optional dependencies):
- Benchmark — Evaluation orchestration
- Task — Individual evaluation units
- Environment — Agent execution context
- AgentAdapter — Agent interface wrappers
- Evaluator — Performance measurement
- Callback — Lifecycle hooks
- MessageHistory — Conversation tracking
Interface — Optional integrations for specific frameworks:
- Agent Adapters — Pre-built adapters (smolagents, langgraph, etc.)
- Inference Providers — LLM provider integrations
Work through the examples listed above. 3. Explore the examples/five_a_day_benchmark/ folder for tool implementations, evaluators, and the CLI script (five_a_day_benchmark.py) 4. Build your own benchmark using the patterns you've learned