Skip to content

Commit 1c52f23

Browse files
authored
Fixing inconsistent error traces across failure modes. (#61)
Emit a uniform report schema on every task outcome `Benchmark.run()` documents a stable report schema, but the setup-failure and setup-timeout early returns in `_execute_task_repetition`, plus the `except Exception` fallback in `_run_parallel`, built reports without the `usage` and `task` keys (and with empty `traces`/`config`). Rows that failed in setup were therefore structurally different from rows that succeeded, breaking consumers that index `report["task"]` / `report["usage"]` on every row. Route every report through a new `Benchmark._build_report()` helper so all reports always carry `task_id`, `repeat_idx`, `status`, `error`, `traces`, `config`, `usage`, `eval`, and `task`; `error` is `None` only for `SUCCESS` and is otherwise always populated. Also fix parallel fail-fast: `_run_parallel` swallowed the deliberate re-raise triggered by `fail_on_setup_error` / `fail_on_task_error` / `fail_on_evaluation_error` into a degraded report and kept going, making those flags no-ops under `num_workers > 1`. It now re-raises (cancelling queued work) to abort the run like the sequential path; only genuinely unexpected worker failures become a full-schema `UNKNOWN_EXECUTION_ERROR` report so the rest of the batch continues. Add tests/test_core/test_benchmark/test_report_schema.py covering schema invariance across success / setup / execution / evaluation failures in both sequential and parallel mode, plus the parallel fail-fast behaviour.
1 parent dc2315a commit 1c52f23

3 files changed

Lines changed: 388 additions & 50 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2222
### Fixed
2323

2424
- Fixed MACS real-data tests passing `{"environment_data": task.environment_data}` instead of `task.environment_data` directly, which caused `setup_state` to silently receive an empty tools list. (PR: #58)
25+
- Benchmark reports from `Benchmark.run()` now have a consistent schema across every outcome. Setup failures, setup timeouts, and unexpected worker failures in parallel runs previously produced reports missing the `usage` and `task` keys (with empty `traces`/`config`). Every report now always includes `task_id`, `repeat_idx`, `status`, `error`, `traces`, `config`, `usage`, `eval`, and `task`, and `report["error"]` is always populated whenever `status` is not `SUCCESS`. (PR: #61)
26+
- `fail_on_setup_error`, `fail_on_task_error`, and `fail_on_evaluation_error` now abort a parallel `Benchmark.run()` the same way they abort a sequential run. Previously a parallel run swallowed the failure into a degraded report and kept going. (PR: #61)
2527

2628
### Removed
2729

maseval/core/benchmark.py

Lines changed: 105 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -519,6 +519,64 @@ def _invoke_callbacks(self, method_name: str, *args, suppress_errors: bool = Tru
519519

520520
return errors
521521

522+
def _build_report(
523+
self,
524+
task: Task,
525+
repeat_idx: int,
526+
status: TaskExecutionStatus,
527+
*,
528+
error: Optional[Dict[str, Any]] = None,
529+
traces: Optional[Dict[str, Any]] = None,
530+
config: Optional[Dict[str, Any]] = None,
531+
usage: Optional[Dict[str, Any]] = None,
532+
eval_results: Any = None,
533+
) -> Dict[str, Any]:
534+
"""Build a task-repetition report with the canonical schema.
535+
536+
Every report — success or failure, from setup, execution or evaluation,
537+
and from sequential or parallel runs — carries the same top-level keys so
538+
downstream consumers can rely on a stable schema. ``error`` is ``None``
539+
only for ``SUCCESS``; for any other status it is always populated (a
540+
generic placeholder is synthesised if a caller forgot to supply one).
541+
542+
Args:
543+
task: The task this report describes.
544+
repeat_idx: Repetition index (0 to n_task_repeats-1).
545+
status: Final execution status for this repetition.
546+
error: Error details dict (``error_type``/``error_message``/``traceback``,
547+
plus any extra fields), or ``None`` if the repetition succeeded.
548+
traces: Collected execution traces (defaults to ``{}`` when not available).
549+
config: Collected component/benchmark configuration (defaults to ``{}``).
550+
usage: Collected usage totals, or ``None`` if not collected.
551+
eval_results: Evaluation results, or ``None`` if evaluation was skipped or failed.
552+
553+
Returns:
554+
Report dictionary with keys: ``task_id``, ``repeat_idx``, ``status``,
555+
``error``, ``traces``, ``config``, ``usage``, ``eval``, ``task``.
556+
"""
557+
if status is not TaskExecutionStatus.SUCCESS and error is None:
558+
# Defensive: a non-success report must always carry error details.
559+
error = {
560+
"error_type": "UnknownError",
561+
"error_message": f"Task ended with status '{status.value}' but no error details were recorded.",
562+
"traceback": "",
563+
}
564+
return {
565+
"task_id": str(task.id),
566+
"repeat_idx": repeat_idx,
567+
"status": status.value,
568+
"error": error,
569+
"traces": traces if traces is not None else {},
570+
"config": config if config is not None else {},
571+
"usage": usage,
572+
"eval": eval_results,
573+
"task": {
574+
"query": task.query,
575+
"metadata": dict(task.metadata),
576+
"protocol": task.protocol.to_dict(),
577+
},
578+
}
579+
522580
def _append_report_safe(self, report: Dict[str, Any]) -> None:
523581
"""Append a report to the reports list (thread-safe).
524582
@@ -1089,16 +1147,14 @@ def _execute_task_repetition(
10891147
"traceback": "".join(traceback.format_exception(type(e), e, e.__traceback__)),
10901148
}
10911149

1092-
# Create a minimal report for this timeout
1093-
report = {
1094-
"task_id": str(task.id),
1095-
"repeat_idx": repeat_idx,
1096-
"status": execution_status.value,
1097-
"error": error_info,
1098-
"traces": e.partial_traces,
1099-
"config": {},
1100-
"eval": None,
1101-
}
1150+
# Create a minimal report for this timeout (canonical schema)
1151+
report = self._build_report(
1152+
task,
1153+
repeat_idx,
1154+
execution_status,
1155+
error=error_info,
1156+
traces=e.partial_traces,
1157+
)
11021158
self.clear_registry()
11031159
return report
11041160

@@ -1111,22 +1167,13 @@ def _execute_task_repetition(
11111167
"traceback": "".join(traceback.format_exception(type(e), e, e.__traceback__)),
11121168
}
11131169

1114-
# Create a minimal report for this failed setup
1115-
report = {
1116-
"task_id": str(task.id),
1117-
"repeat_idx": repeat_idx,
1118-
"status": execution_status.value,
1119-
"error": error_info,
1120-
"traces": {},
1121-
"config": {},
1122-
"eval": None,
1123-
}
11241170
self.clear_registry()
11251171

11261172
if self.fail_on_setup_error:
11271173
raise
11281174

1129-
return report
1175+
# Create a minimal report for this failed setup (canonical schema)
1176+
return self._build_report(task, repeat_idx, execution_status, error=error_info)
11301177

11311178
# 2. Execute agent system with optional user interaction loop
11321179
try:
@@ -1265,21 +1312,16 @@ def _execute_task_repetition(
12651312
eval_results = None
12661313

12671314
# 5. Build report — all keys always present for consistent schema
1268-
report: Dict[str, Any] = {
1269-
"task_id": str(task.id),
1270-
"repeat_idx": repeat_idx,
1271-
"status": execution_status.value,
1272-
"error": error_info,
1273-
"traces": execution_traces,
1274-
"config": execution_configs,
1275-
"usage": execution_usage,
1276-
"eval": eval_results,
1277-
"task": {
1278-
"query": task.query,
1279-
"metadata": dict(task.metadata),
1280-
"protocol": task.protocol.to_dict(),
1281-
},
1282-
}
1315+
report = self._build_report(
1316+
task,
1317+
repeat_idx,
1318+
execution_status,
1319+
error=error_info,
1320+
traces=execution_traces,
1321+
config=execution_configs,
1322+
usage=execution_usage,
1323+
eval_results=eval_results,
1324+
)
12831325

12841326
# Clear registry after task repetition completes
12851327
self.clear_registry()
@@ -1391,20 +1433,26 @@ def submit_task_repeats(task: Task) -> None:
13911433
try:
13921434
report = future.result()
13931435
except Exception as e:
1394-
# Create error report for unexpected failures
1395-
report = {
1396-
"task_id": task_id,
1397-
"repeat_idx": repeat_idx,
1398-
"status": TaskExecutionStatus.UNKNOWN_EXECUTION_ERROR.value,
1399-
"error": {
1436+
# A deliberate fail-fast re-raise from _execute_task_repetition
1437+
# (fail_on_setup_error / fail_on_task_error / fail_on_evaluation_error)
1438+
# must abort the parallel run, matching sequential semantics — rather
1439+
# than being swallowed into a degraded report and letting the run continue.
1440+
if self.fail_on_setup_error or self.fail_on_task_error or self.fail_on_evaluation_error:
1441+
executor.shutdown(wait=False, cancel_futures=True)
1442+
raise
1443+
# Otherwise this is an unexpected failure inside the worker itself
1444+
# (not handled by _execute_task_repetition) — record it with the
1445+
# canonical report schema and carry on with the remaining tasks.
1446+
report = self._build_report(
1447+
task,
1448+
repeat_idx,
1449+
TaskExecutionStatus.UNKNOWN_EXECUTION_ERROR,
1450+
error={
14001451
"error_type": type(e).__name__,
14011452
"error_message": str(e),
14021453
"traceback": "".join(traceback.format_exception(type(e), e, e.__traceback__)),
14031454
},
1404-
"traces": {},
1405-
"config": {},
1406-
"eval": None,
1407-
}
1455+
)
14081456

14091457
self._append_report_safe(report)
14101458

@@ -1447,20 +1495,27 @@ def run(
14471495
model parameters, agent architecture details, and tool specifications.
14481496
14491497
Returns:
1450-
List of report dictionaries, one per task repetition. Each report contains:
1498+
List of report dictionaries, one per task repetition. Every report carries the
1499+
same keys (consistent schema) regardless of success or failure:
14511500
- task_id: Task identifier (UUID)
14521501
- repeat_idx: Repetition index (0 to n_task_repeats-1)
14531502
- status: Execution status (one of TaskExecutionStatus enum values)
1454-
- traces: Execution traces from all registered components
1455-
- config: Configuration from all registered components and benchmark level
1503+
- traces: Execution traces from all registered components (``{}`` if unavailable, e.g. setup failure)
1504+
- config: Configuration from all registered components and benchmark level (``{}`` if unavailable)
1505+
- usage: Aggregated usage from all registered components (``None`` if not collected)
14561506
- eval: Evaluation results (None if task or evaluation failed)
1457-
- error: Error details dict (only present if status is not SUCCESS), containing:
1507+
- task: Task summary dict with ``query``, ``metadata``, and ``protocol``
1508+
- error: Error details dict — ``None`` only when status is SUCCESS; otherwise always populated, containing:
14581509
- error_type: Exception class name
14591510
- error_message: Exception message
14601511
- traceback: Full traceback string
1512+
- (plus any error-specific extras, e.g. ``component``, ``elapsed``, ``timeout``)
14611513
14621514
Raises:
14631515
ValueError: If agent_data length doesn't match number of tasks (when agent_data is an iterable).
1516+
Exception: If a ``fail_on_setup_error`` / ``fail_on_task_error`` / ``fail_on_evaluation_error``
1517+
flag is set and the corresponding failure occurs, the original exception is re-raised
1518+
and the run is aborted (this applies to both sequential and parallel execution).
14641519
14651520
How to use:
14661521
This is the framework's main orchestration method that runs your entire benchmark. It

0 commit comments

Comments
 (0)