Skip to content

Commit d242b95

Browse files
authored
New Parallel Adaptive and Timed Running Engine (#14)
Implement parallel task execution in Benchmark.run() using ThreadPoolExecutor with num_workers parameter. Add thread-safe ComponentRegistry with thread-local storage for component isolation across workers. Introduce task scheduling abstractions: - TaskQueue ABC with iterator interface - SequentialQueue for FIFO ordering - PriorityQueue for priority-based scheduling - AdaptiveTaskQueue for feedback-based adaptive scheduling Add cooperative timeout handling: - TaskContext with check_timeout(), elapsed, remaining, is_expired - TaskProtocol with timeout_seconds, timeout_action, max_retries, priority, tags - TimeoutAction enum (SKIP, RETRY, RAISE) - TaskTimeoutError exception with partial trace preservation - TASK_TIMEOUT status in TaskExecutionStatus
1 parent 3cf9c43 commit d242b95

51 files changed

Lines changed: 3913 additions & 1269 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/issue_template.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
## Type
2+
3+
- [ ] Bug
4+
- [ ] Feature request
5+
- [ ] Question
6+
7+
## Summary
8+
9+
<!-- One-sentence description -->
10+
11+
## Details
12+
13+
<!--
14+
For bugs: Steps to reproduce, expected vs actual behavior
15+
For features: Motivation and proposed design
16+
For questions: Context and what you've tried
17+
-->
18+
19+
## Environment (if applicable)
20+
21+
- maseval version:
22+
- Python version:

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111

12+
**Parallel Execution**
13+
14+
- Added parallel task execution with `num_workers` parameter in `Benchmark.run()` using `ThreadPoolExecutor` (PR: #14)
15+
- Added `ComponentRegistry` class for thread-safe component registration with thread-local storage (PR: #14)
16+
- Added `TaskContext` for cooperative timeout checking with `check_timeout()`, `elapsed`, `remaining`, and `is_expired` properties (PR: #14)
17+
- Added `TaskProtocol` dataclass with `timeout_seconds`, `timeout_action`, `max_retries`, `priority`, and `tags` fields for task-level execution control (PR: #14)
18+
- Added `TimeoutAction` enum (`SKIP`, `RETRY`, `RAISE`) for configurable timeout behavior (PR: #14)
19+
- Added `TaskTimeoutError` exception with `elapsed`, `timeout`, and `partial_traces` attributes (PR: #14)
20+
- Added `TASK_TIMEOUT` to `TaskExecutionStatus` enum for timeout classification (PR: #14)
21+
22+
**Task Queue Abstraction**
23+
24+
- Added `TaskQueue` abstract base class with iterator interface for flexible task scheduling (PR: #14)
25+
- Added `SequentialQueue` for simple FIFO task ordering (PR: #14)
26+
- Added `PriorityQueue` for priority-based task scheduling using `TaskProtocol.priority` (PR: #14)
27+
- Added `AdaptiveTaskQueue` abstract base class for feedback-based adaptive scheduling with `initial_state()`, `select_next_task(remaining, state)`, and `update_state(task, report, state)` methods (PR: #14)
28+
1229
**ModelAdapter Chat Interface**
1330

1431
- Added `chat()` method to `ModelAdapter` as the primary interface for LLM inference, accepting a list of messages in OpenAI format and returning a `ChatResponse` object and accepting tools
@@ -48,6 +65,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4865
**Benchmark**
4966

5067
- `Benchmark.agent_data` parameter is now optional (defaults to empty dict) (PR: #16)
68+
- Refactored `Benchmark` to delegate registry operations to `ComponentRegistry` class (PR: #)
69+
- `Benchmark.run()` now accepts optional `queue` parameter (`BaseTaskQueue`) for custom task scheduling (PR: #14)
5170

5271
**Task**
5372

docs/getting-started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ Once implemented, run your benchmark:
117117

118118
```python
119119
# Define your tasks
120-
tasks = TaskCollection([Task(query="...", expected="..."), ...])
120+
tasks = TaskQueue([Task(query="..."), ...])
121121

122122
# Configure your agents (e.g., model parameters, tool settings)
123123
agent_config = {"model": "gpt-4", "temperature": 0.7}

docs/guides/exception-handling.md

Lines changed: 32 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,12 @@ When running benchmarks, tasks can fail for different reasons. MASEval provides
1212

1313
MASEval defines three error categories:
1414

15-
| Exception | Source | Default Scoring | Example |
16-
| ------------------ | -------------- | --------------- | ------------------------------- |
17-
| `AgentError` | Agent input | Included | Agent passed wrong type to tool |
18-
| `EnvironmentError` | Infrastructure | Excluded | Database connection failed |
19-
| `UserError` | User simulator | Excluded | LLM API unreachable |
15+
| Exception | Source | Default Scoring | Example |
16+
| ------------------- | -------------- | --------------- | ------------------------------- |
17+
| `AgentError` | Agent input | Included | Agent passed wrong type to tool |
18+
| `EnvironmentError` | Infrastructure | Excluded | Database connection failed |
19+
| `UserError` | User simulator | Excluded | LLM API unreachable |
20+
| `TaskTimeoutError` | Timeout | Excluded | Task exceeded deadline |
2021

2122
### AgentError
2223

@@ -90,27 +91,22 @@ class SimulatedUser:
9091

9192
One approach to exception handling places the boundary between agent responsibility and infrastructure responsibility at input validation:
9293

93-
```
94-
┌─────────────────────────────────────────────────────────────┐
95-
│ TOOL EXECUTION │
96-
├─────────────────────────────────────────────────────────────┤
97-
│ │
98-
│ ┌─────────────────┐ │
99-
│ │ INPUT │ Agent passes arguments │
100-
│ │ VALIDATION │ │
101-
│ │ │ ❌ Fails → AgentError │
102-
│ │ │ ✓ Passes ↓ │
103-
│ └─────────────────┘ │
104-
│ │ │
105-
│ ▼ │
106-
│ ┌─────────────────┐ │
107-
│ │ EXECUTION │ Tool runs its logic │
108-
│ │ │ │
109-
│ │ │ ❌ Fails → EnvironmentError │
110-
│ │ │ ✓ Passes → Result │
111-
│ └─────────────────┘ │
112-
│ │
113-
└─────────────────────────────────────────────────────────────┘
94+
```mermaid
95+
flowchart TD
96+
subgraph TOOL_EXECUTION[" "]
97+
A[Agent passes arguments] --> B{INPUT VALIDATION}
98+
B -->|Fails| C[AgentError]
99+
B -->|Passes| D{EXECUTION}
100+
D -->|Fails| E[EnvironmentError]
101+
D -->|Passes| F[Result]
102+
end
103+
104+
style TOOL_EXECUTION fill:none,stroke:#888
105+
style B fill:#f5f5f5,stroke:#333
106+
style D fill:#f5f5f5,stroke:#333
107+
style C fill:#ffebee,stroke:#c62828
108+
style E fill:#ffebee,stroke:#c62828
109+
style F fill:#e8f5e9,stroke:#2e7d32
114110
```
115111

116112
With this pattern:
@@ -153,15 +149,16 @@ Suggestion: Provide limit as an integer, e.g., 10
153149

154150
Each completed task has a status indicating what happened:
155151

156-
| Status | Description |
157-
| ------------------------- | ------------------------------ |
158-
| `success` | Task completed normally |
159-
| `agent_error` | AgentError was raised |
160-
| `environment_error` | EnvironmentError was raised |
161-
| `user_error` | UserError was raised |
162-
| `evaluation_failed` | Evaluator raised an exception |
163-
| `setup_failed` | Task setup raised an exception |
164-
| `unknown_execution_error` | Unclassified exception |
152+
| Status | Description |
153+
| ------------------------- | ----------------------------------- |
154+
| `success` | Task completed normally |
155+
| `agent_error` | AgentError was raised |
156+
| `environment_error` | EnvironmentError was raised |
157+
| `user_error` | UserError was raised |
158+
| `task_timeout` | Task exceeded configured timeout |
159+
| `evaluation_failed` | Evaluator raised an exception |
160+
| `setup_failed` | Task setup raised an exception |
161+
| `unknown_execution_error` | Unclassified exception |
165162

166163
## Scoring Considerations
167164

docs/reference/exceptions.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ Exception classes for error classification in benchmark execution.
1010
MASEvalError (base)
1111
├── AgentError - Agent violated contract (agent's fault)
1212
├── EnvironmentError - Environment/tool failed (not agent's fault)
13-
└── UserError - User simulator failed (not agent's fault)
13+
├── UserError - User simulator failed (not agent's fault)
14+
└── TaskTimeoutError - Task exceeded configured timeout
1415
1516
SimulatorError (base for simulators)
1617
├── ToolSimulatorError - Also inherits EnvironmentError
@@ -27,6 +28,8 @@ SimulatorError (base for simulators)
2728

2829
::: maseval.core.exceptions.UserError
2930

31+
::: maseval.core.exceptions.TaskTimeoutError
32+
3033
## Simulator Exceptions
3134

3235
::: maseval.core.simulator.SimulatorError

docs/reference/task.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,35 @@
1-
# Task
1+
# Tasks
22

3-
Tasks define individual benchmark scenarios including inputs, expected outputs, and any metadata needed for evaluation. TaskCollections group related tasks together.
3+
Tasks define individual benchmark scenarios including inputs, expected outputs, and metadata for evaluation. Task queues control execution order and scheduling strategy.
44

5-
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py){ .md-source-file }
5+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L55){ .md-source-file }
66

77
::: maseval.core.task.Task
88

9-
::: maseval.core.task.TaskCollection
9+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L27){ .md-source-file }
10+
11+
::: maseval.core.task.TaskProtocol
12+
13+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L18){ .md-source-file }
14+
15+
::: maseval.core.task.TimeoutAction
16+
17+
## Task Queues
18+
19+
Task queues determine the order in which tasks are executed. Pass a queue to `Benchmark.run(queue=...)` to customize scheduling.
20+
21+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L86){ .md-source-file }
22+
23+
::: maseval.core.task.BaseTaskQueue
24+
25+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L256){ .md-source-file }
26+
27+
::: maseval.core.task.SequentialTaskQueue
28+
29+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L276){ .md-source-file }
30+
31+
::: maseval.core.task.PriorityTaskQueue
32+
33+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/task.py#L322){ .md-source-file }
34+
35+
::: maseval.core.task.AdaptiveTaskQueue

examples/five_a_day_benchmark/five_a_day_benchmark.ipynb

Lines changed: 6 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@
124124
"from smolagents import ToolCallingAgent, LiteLLMModel, FinalAnswerTool\n",
125125
"\n",
126126
"# MASEval core components\n",
127-
"from maseval import Benchmark, Environment, Task, TaskCollection, AgentAdapter, Evaluator, ModelAdapter\n",
127+
"from maseval import Benchmark, Environment, Task, TaskQueue, AgentAdapter, Evaluator, ModelAdapter\n",
128128
"from maseval.interface.agents.smolagents import SmolAgentAdapter\n",
129129
"\n",
130130
"# Import evaluators module (dynamically loaded later)\n",
@@ -139,7 +139,7 @@
139139
" limit: int | None = None,\n",
140140
" seed: int | None = None,\n",
141141
" task_indices: list[int] | None = None,\n",
142-
") -> tuple[TaskCollection, list[Dict[str, Any]]]:\n",
142+
") -> tuple[TaskQueue, list[Dict[str, Any]]]:\n",
143143
" \"\"\"Load tasks and agent configurations.\n",
144144
"\n",
145145
" Args:\n",
@@ -152,7 +152,7 @@
152152
" task_indices: Optional list of task indices to load (e.g., [0, 2, 4])\n",
153153
"\n",
154154
" Returns:\n",
155-
" Tuple of (TaskCollection, list of agent configs)\n",
155+
" Tuple of (TaskQueue, list of agent configs)\n",
156156
" \"\"\"\n",
157157
" data_dir = Path(\"examples/five_a_day_benchmark/data\")\n",
158158
"\n",
@@ -200,7 +200,7 @@
200200
"\n",
201201
" configs_data.append(config)\n",
202202
"\n",
203-
" return TaskCollection(tasks_data), configs_data"
203+
" return TaskQueue(tasks_data), configs_data"
204204
]
205205
},
206206
{
@@ -745,17 +745,7 @@
745745
"id": "3764c0be",
746746
"metadata": {},
747747
"outputs": [],
748-
"source": [
749-
"# Create and run benchmark (will take approx. 2 min)\n",
750-
"benchmark = FiveADayBenchmark(\n",
751-
" agent_data=agent_configs,\n",
752-
" fail_on_setup_error=True,\n",
753-
" fail_on_task_error=True,\n",
754-
" fail_on_evaluation_error=True,\n",
755-
")\n",
756-
"\n",
757-
"results = benchmark.run(tasks=tasks)"
758-
]
748+
"source": "# Create and run benchmark (will take approx. 2 min)\nbenchmark = FiveADayBenchmark(\n fail_on_setup_error=True,\n fail_on_task_error=True,\n fail_on_evaluation_error=True,\n)\n\nresults = benchmark.run(tasks=tasks, agent_data=agent_configs)"
759749
},
760750
{
761751
"cell_type": "markdown",
@@ -899,4 +889,4 @@
899889
},
900890
"nbformat": 4,
901891
"nbformat_minor": 5
902-
}
892+
}

examples/five_a_day_benchmark/five_a_day_benchmark.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626

2727
from utils import derive_seed, sanitize_name # type: ignore[unresolved-import]
2828

29-
from maseval import Benchmark, Environment, Evaluator, Task, TaskCollection, AgentAdapter, ModelAdapter
29+
from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue, AgentAdapter, ModelAdapter
3030
from maseval.core.callbacks.result_logger import FileResultLogger
3131

3232
# Import tool implementations
@@ -825,7 +825,7 @@ def load_benchmark_data(
825825
limit: Optional[int] = None,
826826
specific_task: Optional[int] = None,
827827
seed: Optional[int] = None,
828-
) -> tuple[TaskCollection, List[Dict[str, Any]]]:
828+
) -> tuple[TaskQueue, List[Dict[str, Any]]]:
829829
"""Load tasks and agent configurations with validation.
830830
831831
Args:
@@ -838,7 +838,7 @@ def load_benchmark_data(
838838
seed: Base random seed for reproducibility (None for non-deterministic)
839839
840840
Returns:
841-
Tuple of (TaskCollection, agent_configs_list)
841+
Tuple of (TaskQueue, agent_configs_list)
842842
"""
843843
if limit is not None and specific_task is not None:
844844
raise ValueError("Cannot specify both limit and specific_task")
@@ -897,7 +897,7 @@ def load_benchmark_data(
897897

898898
print(f"Loaded {len(tasks_data)} tasks and {len(configs_data)} agent configs\n")
899899

900-
return TaskCollection(tasks_data), configs_data
900+
return TaskQueue(tasks_data), configs_data
901901

902902

903903
# ============================================================================
@@ -935,13 +935,12 @@ def load_benchmark_data(
935935
)
936936

937937
benchmark = FiveADayBenchmark(
938-
agent_data=agent_configs,
939938
callbacks=[logger],
940939
fail_on_setup_error=True,
941940
fail_on_task_error=True,
942941
fail_on_evaluation_error=True,
943942
)
944-
results = benchmark.run(tasks=tasks)
943+
results = benchmark.run(tasks=tasks, agent_data=agent_configs)
945944

946945
print("\n--- Benchmark Complete ---")
947946
print(f"Total tasks: {len(tasks)}")

examples/introduction/tutorial.ipynb

Lines changed: 3 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -330,7 +330,7 @@
330330
"metadata": {},
331331
"outputs": [],
332332
"source": [
333-
"from maseval import Benchmark, Environment, Evaluator, Task, TaskCollection\n",
333+
"from maseval import Benchmark, Environment, Evaluator, Task, TaskQueue\n",
334334
"from maseval.interface.agents.smolagents import SmolAgentAdapter\n",
335335
"\n",
336336
"print(\"MASEval components imported successfully!\")"
@@ -634,23 +634,7 @@
634634
"id": "b3ee60a7",
635635
"metadata": {},
636636
"outputs": [],
637-
"source": [
638-
"# Create benchmark instance with agent configuration\n",
639-
"agent_data = {\"model_id\": \"gemini/gemini-2.5-flash\", \"temperature\": 0.7}\n",
640-
"\n",
641-
"benchmark = SimpleBenchmark(agent_data=agent_data, progress_bar=False)\n",
642-
"\n",
643-
"# Create task collection\n",
644-
"tasks = TaskCollection([task])\n",
645-
"\n",
646-
"# Run the benchmark\n",
647-
"print(\"Running benchmark...\\n\")\n",
648-
"reports = benchmark.run(tasks=tasks)\n",
649-
"\n",
650-
"print(\"\\n\" + \"=\" * 60)\n",
651-
"print(\"BENCHMARK COMPLETE\")\n",
652-
"print(\"=\" * 60)"
653-
]
637+
"source": "# Create benchmark instance\nagent_data = {\"model_id\": \"gemini/gemini-2.5-flash\", \"temperature\": 0.7}\n\nbenchmark = SimpleBenchmark(progress_bar=False)\n\n# Create task queue\ntasks = TaskQueue([task])\n\n# Run the benchmark\nprint(\"Running benchmark...\\n\")\nreports = benchmark.run(tasks=tasks, agent_data=agent_data)\n\nprint(\"\\n\" + \"=\" * 60)\nprint(\"BENCHMARK COMPLETE\")\nprint(\"=\" * 60)"
654638
},
655639
{
656640
"cell_type": "markdown",
@@ -746,4 +730,4 @@
746730
},
747731
"nbformat": 4,
748732
"nbformat_minor": 5
749-
}
733+
}

examples/macs_benchmark/macs_benchmark.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -737,7 +737,6 @@ def run_benchmark(
737737
# Get benchmark class and instantiate
738738
BenchmarkClass = get_benchmark_class(framework)
739739
benchmark = BenchmarkClass(
740-
agent_data=agent_config,
741740
callbacks=[logger],
742741
n_task_repeats=n_task_repeats,
743742
fail_on_setup_error=True,
@@ -747,7 +746,7 @@ def run_benchmark(
747746

748747
# Run benchmark
749748
print(f"\nRunning {framework} benchmark on {domain} domain...")
750-
results = benchmark.run(tasks=tasks)
749+
results = benchmark.run(tasks=tasks, agent_data=agent_config)
751750

752751
# Compute summary metrics
753752
summary = compute_benchmark_metrics(results)

0 commit comments

Comments
 (0)