Five-A-Day Benchmark

Like the health advice to eat "5 a day" (fruits and vegetables), this benchmark provides 5 tasks a day to keep your agent evaluation healthy. An educational example demonstrating maseval through 5 diverse tasks across smolagents, LangGraph, and LlamaIndex.

What it shows: Framework-agnostic tools, diverse evaluation types (assertions, LLM judges, unit tests, optimization), single vs multi-agent patterns, and reproducible seeding.

Quick Start

export GOOGLE_API_KEY="your-api-key-here"

# Run all 5 tasks
uv run python examples/five_a_day_benchmark/five_a_day_benchmark.py

# Try different frameworks
uv run python examples/five_a_day_benchmark/five_a_day_benchmark.py --framework langgraph
uv run python examples/five_a_day_benchmark/five_a_day_benchmark.py --framework llamaindex

# Single task, multi-agent, with seed
uv run python examples/five_a_day_benchmark/five_a_day_benchmark.py --task 0 --config-type multi --seed 42

Options: --framework {smolagents,langgraph,llamaindex}, --config-type {single,multi}, --task N, --seed N, --limit N, --temperature N

The 5 Tasks

Task	Description	Tools	Evaluation Focus
0. Email & Banking	Verify payment, draft email	`email`, `banking`	Financial accuracy, email quality, privacy
1. Finance Calc	Stock inheritance split	`family_info`, `stock_price`, `calculator`	Arithmetic, data retrieval
2. Code Gen	House robber algorithm	`python_executor`	Unit tests, complexity, code quality
3. Calendar	Find meeting slots (MCP)	`my_calendar_mcp`, `other_calendar_mcp`	Slot matching, MCP integration
4. Hotel	Optimize hotel selection	`hotel_search`, `calculator`	Multi-criteria optimization

Architecture

Environment: FiveADayEnvironment loads task data from JSON and creates framework-agnostic tools using a collection pattern.

Tools: BaseTool subclasses with .to_smolagents(), .to_langgraph(), .to_llamaindex() methods. Tool adapters support tracing.

Tasks: Defined in data/tasks.json with query, environment data, evaluation criteria, and metadata.

Evaluators: 13 evaluators across 5 tasks using assertions, LLM judges, unit tests, pattern matching, static analysis, and optimization comparisons.

Agents: Single-agent (one agent, all tools) or multi-agent (orchestrator + specialists). Framework-specific adapters provide unified interface.

Seeding: derive_seed() creates deterministic per-agent seeds using SHA256 hashing of base_seed + task_id + agent_id.

Project Structure

five_a_day_benchmark/
├── five_a_day_benchmark.py       # Main benchmark (FiveADayBenchmark class)
├── utils.py                       # derive_seed(), sanitize_name()
├── tools/                         # BaseTool subclasses with framework converters
├── evaluators/                    # Task-specific Evaluator subclasses
└── data/
    ├── tasks.json                # Task definitions
    ├── singleagent.json          # Single-agent configs
    └── multiagent.json           # Multi-agent configs

Results saved to results/*.jsonl with full message traces.

Dependencies

maseval, litellm, smolagents, langgraph, langchain-litellm, llama-index-core, llama-index-llms-litellm, RestrictedPython

Set GOOGLE_API_KEY for Gemini models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Five-A-Day Benchmark

Quick Start

The 5 Tasks

Architecture

Project Structure

Dependencies

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Five-A-Day Benchmark

Quick Start

The 5 Tasks

Architecture

Project Structure

Dependencies