You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simplify seeding API with null-safe pattern
Change seed_generator from Optional[SeedGenerator] to SeedGenerator in all
benchmark setup methods. When seeding is disabled (seed=None), derive_seed()
returns None instead of the generator being None. This eliminates conditional
checks throughout the codebase - the same code works whether seeding is
enabled or disabled.
- Update Benchmark base class and all setup method signatures
- Update DefaultSeedGenerator to accept global_seed=None
- Update all benchmarks (GAIA2, MACS, MultiAgentBench, Tau2)
- Update seeding documentation and examples
- Update all tests to use new pattern
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,6 +49,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
49
49
50
50
### Changed
51
51
52
+
**Core**
53
+
54
+
- Simplified seeding API: `seed_generator` parameter in setup methods is now always non-None (`SeedGenerator` instead of `Optional[SeedGenerator]`). When seeding is disabled (`seed=None`), `derive_seed()` returns `None` instead of raising an error. This eliminates all `if seed_generator is not None:` conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
55
+
52
56
**Benchmarks**
53
57
54
58
-`MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
This creates a `DefaultSeedGenerator` internally and passes it to all setup methods.
40
40
41
+
### Disabling Seeding
42
+
43
+
If you don't need seeding, you can simply ignore the seed generators. However, in workflows where you mix seeded and non-seeded runs, you can disable seeding without writing `if/else` statements to check whether a seed is provided.
44
+
45
+
To disable seeding, omit the `seed` parameter when creating your `Benchmark` or `DefaultSeedGenerator` (or pass `seed=None`):
46
+
47
+
1. A `DefaultSeedGenerator(global_seed=None)` is still created internally
48
+
2. Setup methods still receive a `seed_generator` parameter
49
+
3.`derive_seed()` returns `None` instead of an integer
All setup methods receive an optional `seed_generator` parameter. Use it to derive seeds for your components:
65
+
All setup methods receive a `seed_generator` parameter. Use it to derive seeds for your components. When seeding is disabled (no `seed` passed to benchmark), `derive_seed()` returns `None`:
44
66
45
67
```python
46
68
from maseval import Benchmark, SeedGenerator
47
-
from typing import Optional
48
69
49
70
classMyBenchmark(Benchmark):
50
71
defsetup_agents(
@@ -53,18 +74,16 @@ class MyBenchmark(Benchmark):
53
74
environment,
54
75
task,
55
76
user,
56
-
seed_generator: Optional[SeedGenerator] =None,
77
+
seed_generator: SeedGenerator,
57
78
):
58
79
# Derive a seed for your agent using hierarchical paths
59
-
agent_seed =None
60
-
if seed_generator isnotNone:
61
-
# Use child() to create logical namespaces - results in "agents/orchestrator"
@@ -75,18 +94,17 @@ Seeds are derived from hierarchical paths, so `derive_seed("orchestrator")` with
75
94
When running multiple repetitions of the same task, you may want some components to vary while others remain constant. The `per_repetition` flag controls this:
Copy file name to clipboardExpand all lines: examples/five_a_day_benchmark/five_a_day_benchmark.ipynb
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -523,7 +523,7 @@
523
523
"id": "70c66cd0",
524
524
"metadata": {},
525
525
"outputs": [],
526
-
"source": "class FiveADayBenchmark(Benchmark):\n \"\"\"5-A-Day benchmark with multi-agent support.\"\"\"\n\n def setup_environment(self, agent_data: Dict[str, Any], task: Task, seed_generator: Optional[SeedGenerator] = None) -> Environment:\n \"\"\"Create environment from task data.\"\"\"\n task_data = {\n \"environment_data\": task.environment_data,\n \"query\": task.query,\n \"evaluation_data\": task.evaluation_data,\n \"metadata\": task.metadata,\n }\n\n environment = FiveADayEnvironment(task_data)\n\n # Register all tools for tracing\n for tool_name, tool_adapter in environment.get_tools().items():\n self.register(\"tools\", tool_name, tool_adapter)\n\n return environment\n\n def setup_agents(\n self,\n agent_data: Dict[str, Any],\n environment: Environment,\n task: Task,\n user=None,\n seed_generator: Optional[SeedGenerator] = None,\n ) -> tuple[list[SmolAgentAdapter], Dict[str, SmolAgentAdapter]]:\n \"\"\"Create multi-agent system with orchestrator and specialists.\n\n If seed_generator is provided, seeds are derived for each agent\n using the benchmark's seeding system with hierarchical paths.\n \"\"\"\n # Build seeds dict if seed_generator is available\n # Use child(\"agents\") to create logical paths like \"agents/primary_agent\"\n seeds = None\n if seed_generator is not None:\n agent_gen = seed_generator.child(\"agents\")\n seeds = {}\n for agent_spec in agent_data[\"agents\"]:\n seeds[agent_spec[\"agent_id\"]] = agent_gen.derive_seed(agent_spec[\"agent_id\"])\n\n agents_to_run, agents_to_monitor = build_agents(agent_data, environment, seeds)\n\n # Create adapters for the primary agent(s) to run\n adapters_to_run = [SmolAgentAdapter(agent, agent.name) for agent in agents_to_run]\n\n # This ensures all agent traces are collected by the benchmark\n all_agents = {agent.name: agent for agent in agents_to_run} | agents_to_monitor\n adapters_to_monitor = {name: SmolAgentAdapter(agent, name) for name, agent in all_agents.items()}\n return adapters_to_run, adapters_to_monitor\n\n def setup_evaluators(self, environment, task, agents, user, seed_generator: Optional[SeedGenerator] = None) -> Sequence[Evaluator]:\n \"\"\"Create evaluators based on task's evaluation criteria.\"\"\"\n if not task.evaluation_data[\"evaluators\"]:\n return []\n\n evaluator_instances = []\n for name in task.evaluation_data[\"evaluators\"]:\n evaluator_class = getattr(evaluators, name)\n evaluator_instances.append(evaluator_class(task, environment, user))\n\n return evaluator_instances\n\n def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Sequence[Any]:\n \"\"\"Execute agents and return their final answers.\"\"\"\n answers = [agent.run(query) for agent in agents]\n return answers\n\n def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:\n \"\"\"Return a model adapter for benchmark components that need LLM access.\n\n This benchmark doesn't use simulated tools, user simulators, or LLM judges,\n so this method is not called during execution.\n \"\"\"\n raise NotImplementedError(\"This benchmark doesn't use model adapters for tools/users/evaluators.\")\n\n def evaluate(\n self,\n evaluators: Sequence[Evaluator],\n agents: Dict[str, AgentAdapter],\n final_answer: Any,\n traces: Dict[str, Any],\n ) -> list[Dict[str, Any]]:\n \"\"\"Evaluate agent performance.\"\"\"\n results = []\n for evaluator in evaluators:\n filtered_traces = evaluator.filter_traces(traces)\n results.append(evaluator(filtered_traces, final_answer))\n return results"
526
+
"source": "class FiveADayBenchmark(Benchmark):\n \"\"\"5-A-Day benchmark with multi-agent support.\"\"\"\n\n def setup_environment(self, agent_data: Dict[str, Any], task: Task, seed_generator: SeedGenerator) -> Environment:\n \"\"\"Create environment from task data.\"\"\"\n task_data = {\n \"environment_data\": task.environment_data,\n \"query\": task.query,\n \"evaluation_data\": task.evaluation_data,\n \"metadata\": task.metadata,\n }\n\n environment = FiveADayEnvironment(task_data)\n\n # Register all tools for tracing\n for tool_name, tool_adapter in environment.get_tools().items():\n self.register(\"tools\", tool_name, tool_adapter)\n\n return environment\n\n def setup_agents(\n self,\n agent_data: Dict[str, Any],\n environment: Environment,\n task: Task,\n user,\n seed_generator: SeedGenerator,\n ) -> tuple[list[SmolAgentAdapter], Dict[str, SmolAgentAdapter]]:\n \"\"\"Create multi-agent system with orchestrator and specialists.\n\n Seeds are derived for each agent using the benchmark's seeding system\n with hierarchical paths. derive_seed() returns None if seeding is disabled.\n \"\"\"\n # Build seeds dict using seed_generator\n # Use child(\"agents\") to create logical paths like \"agents/primary_agent\"\n agent_gen = seed_generator.child(\"agents\")\n seeds = {}\n for agent_spec in agent_data[\"agents\"]:\n seeds[agent_spec[\"agent_id\"]] = agent_gen.derive_seed(agent_spec[\"agent_id\"])\n\n agents_to_run, agents_to_monitor = build_agents(agent_data, environment, seeds)\n\n # Create adapters for the primary agent(s) to run\n adapters_to_run = [SmolAgentAdapter(agent, agent.name) for agent in agents_to_run]\n\n # This ensures all agent traces are collected by the benchmark\n all_agents = {agent.name: agent for agent in agents_to_run} | agents_to_monitor\n adapters_to_monitor = {name: SmolAgentAdapter(agent, name) for name, agent in all_agents.items()}\n return adapters_to_run, adapters_to_monitor\n\n def setup_evaluators(self, environment, task, agents, user, seed_generator: SeedGenerator) -> Sequence[Evaluator]:\n \"\"\"Create evaluators based on task's evaluation criteria.\"\"\"\n if not task.evaluation_data[\"evaluators\"]:\n return []\n\n evaluator_instances = []\n for name in task.evaluation_data[\"evaluators\"]:\n evaluator_class = getattr(evaluators, name)\n evaluator_instances.append(evaluator_class(task, environment, user))\n\n return evaluator_instances\n\n def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Sequence[Any]:\n \"\"\"Execute agents and return their final answers.\"\"\"\n answers = [agent.run(query) for agent in agents]\n return answers\n\n def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:\n \"\"\"Return a model adapter for benchmark components that need LLM access.\n\n This benchmark doesn't use simulated tools, user simulators, or LLM judges,\n so this method is not called during execution.\n \"\"\"\n raise NotImplementedError(\"This benchmark doesn't use model adapters for tools/users/evaluators.\")\n\n def evaluate(\n self,\n evaluators: Sequence[Evaluator],\n agents: Dict[str, AgentAdapter],\n final_answer: Any,\n traces: Dict[str, Any],\n ) -> list[Dict[str, Any]]:\n \"\"\"Evaluate agent performance.\"\"\"\n results = []\n for evaluator in evaluators:\n filtered_traces = evaluator.filter_traces(traces)\n results.append(evaluator(filtered_traces, final_answer))\n return results"
0 commit comments