Getting Started

This guide introduces the core concepts of MASEval and helps you get started quickly.

Installation

MASEval is designed with a modular architecture:

Core: Framework-agnostic benchmark infrastructure (tasks, evaluation, simulation)
Interface: Optional adapters for specific agent frameworks (smolagents, langgraph, etc.)

Install only what you need to keep dependencies minimal.

Core Installation

Install the base package from PyPI:

pip install maseval

This includes all core functionality for defining benchmarks, tasks, and evaluators.

Optional Dependencies

Install additional integrations based on your agent framework. For example,

# Agent framework integrations
pip install "maseval[smolagents]"  # SmolAgents integration

# LLM providers
pip install "maseval[openai]"      # OpenAI models

# Combine multiple extras
pip install "maseval[smolagents,openai]"

# Install everything (for examples or development)
pip install "maseval[all]"
pip install "maseval[examples]"

Using the Library

Philosophy

MASEval follows a clear separation of concerns:

You implement your agents using any framework (LangChain, AutoGen, smolagents, custom code, etc.)
MASEval provides the evaluation infrastructure — benchmarks, tasks, environments, and metrics
Adapters bridge the gap — thin wrappers that connect your agent to MASEval's interface

Think of MASEval like pytest for agents: you bring the code, MASEval runs the tests.

Key Concepts

Term	Description
Benchmark	Orchestrates the evaluation lifecycle: setup, execution, and measurement across a collection of tasks.
Task	A single evaluation unit with a query, expected outcome, and evaluation criteria.
Environment	The context in which agents operate (e.g., simulated tools, databases, file systems).
AgentAdapter	Wraps your agent to provide a unified interface for MASEval.
Evaluator	Measures agent performance by comparing outputs or states to expected results.
Callback	Hooks into the evaluation lifecycle for logging, tracing, or custom metrics.

Implementing a Benchmark

To create your own benchmark, subclass Benchmark and implement the required abstract methods. Here's the typical workflow:

Agents and environment Define your agents and environment using any tool you prefer to use. Wrap them in Environment and AgentAdapter.
Create your tasks as Task objects with queries and evaluation data
Subclass Benchmark and implement the abstract setup/run/evaluate methods
Call Benchmark.run(tasks) to execute the complete benchmark

from maseval import Benchmark, AgentAdapter, Environment, Evaluator, Task

class MyBenchmark(Benchmark):
    """Custom benchmark for evaluating agents on my tasks."""

    def setup_environment(self, agent_data, task) -> Environment:
        # Initialize the environment for this task
        # e.g., set up tools, databases, or simulated systems
        ...

    def setup_user(self, agent_data, environment, task):
        # Optional: create a user simulator for interactive tasks
        # Return None if not needed
        return None

    def setup_agents(self, agent_data, environment, task, user):
        # Create your agent(s) and wrap them in AgentAdapter
        # Returns a tuple: (agents_to_run, agents_dict)
        #   - agents_to_run: list of agents to invoke in run_agents()
        #   - agents_dict: dict mapping names to all agents for tracing
        ...

    def setup_evaluators(self, environment, task, agents, user):
        # Define how success is measured
        # Return: list of Evaluator instances
        ...

    def run_agents(self, agents, task, environment):
        # Execute your agent system to solve the task
        # Return the final answer (message traces are captured automatically)
        ...

    def evaluate(self, evaluators, agents, final_answer, traces):
        # Run each evaluator with the execution data
        # Return: list of evaluation result dicts
        ...

Once implemented, run your benchmark:

# Define your tasks
tasks = TaskCollection([Task(query="...", expected="..."), ...])

# Configure your agents (e.g., model parameters, tool settings)
agent_config = {"model": "gpt-4", "temperature": 0.7}

# Instantiate and run the evaluation
benchmark = MyBenchmark(agent_data=agent_config)
reports = benchmark.run(tasks)

For the complete interface and lifecycle details, see the Benchmark reference.

Adapters

Adapters are lightweight wrappers that connect your agent implementation to MASEval. They provide:

A unified run() method for executing agents
Message history tracking for tracing
Callback hooks for monitoring

Creating an adapter:

from maseval import AgentAdapter

class MyAgentAdapter(AgentAdapter):
    """Adapter for my custom agent framework."""

    def _run_agent(self, query: str):
        # Call your agent's execution method
        result = self.agent.execute(query)

        # Return the final answer (message history is tracked separately)
        return result

    def get_messages(self):
        # Return the conversation history from your agent
        return self.agent.get_conversation_history()

MASEval provides built-in adapters for popular frameworks in maseval.interface.agents. For example:

SmolAgentsAdapter — for HuggingFace smolagents
LangGraphAdapter — for LangGraph agents

See the Agent Adapters documentation for the full list.

Existing Benchmarks

Pre-built benchmarks for established evaluation suites are coming soon.

Using the Documentation

This documentation is organized to help you find what you need quickly:

Examples

End-to-end walkthroughs demonstrating complete evaluation pipelines. Start here to see MASEval in action with real agent implementations.

Guides

Topic-based discussions covering specific features and best practices:

Message Tracing — Capture and analyze agent conversations
Configuration Gathering — Collect reproducible experiment configurations

Reference

Formal API documentation for all MASEval components. The reference is split into two sections:

Core — The fundamental building blocks (no optional dependencies):

Benchmark — Evaluation orchestration
Task — Individual evaluation units
Environment — Agent execution context
AgentAdapter — Agent interface wrappers
Evaluator — Performance measurement
Callback — Lifecycle hooks
MessageHistory — Conversation tracking

Interface — Optional integrations for specific frameworks:

Agent Adapters — Pre-built adapters (smolagents, langgraph, etc.)
Inference Providers — LLM provider integrations

Next Steps

Work through the examples listed above. 3. Explore the examples/five_a_day_benchmark/ folder for tool implementations, evaluators, and the CLI script (five_a_day_benchmark.py) 4. Build your own benchmark using the patterns you've learned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Installation

Core Installation

Optional Dependencies

Using the Library

Philosophy

Key Concepts

Implementing a Benchmark

Adapters

Existing Benchmarks

Using the Documentation

Examples

Guides

Reference

Next Steps

FilesExpand file tree

quickstart.md

Latest commit

History

quickstart.md

File metadata and controls

Getting Started

Installation

Core Installation

Optional Dependencies

Using the Library

Philosophy

Key Concepts

Implementing a Benchmark

Adapters

Existing Benchmarks

Using the Documentation

Examples

Guides

Reference

Next Steps