You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fixing inconsistent error traces across failure modes. (#61)
Emit a uniform report schema on every task outcome
`Benchmark.run()` documents a stable report schema, but the setup-failure and setup-timeout early returns in
`_execute_task_repetition`, plus the `except Exception` fallback in `_run_parallel`, built reports without the `usage` and `task` keys (and with empty `traces`/`config`). Rows that failed in setup were therefore structurally different from rows that succeeded, breaking consumers that index `report["task"]` / `report["usage"]` on every row.
Route every report through a new `Benchmark._build_report()` helper so all reports always carry `task_id`, `repeat_idx`, `status`, `error`, `traces`, `config`, `usage`, `eval`, and `task`; `error` is `None` only for `SUCCESS` and is otherwise always populated.
Also fix parallel fail-fast: `_run_parallel` swallowed the deliberate re-raise triggered by `fail_on_setup_error` / `fail_on_task_error` / `fail_on_evaluation_error` into a degraded report and kept going, making those flags no-ops under `num_workers > 1`. It now re-raises (cancelling queued work) to abort the run like the sequential path; only genuinely unexpected worker failures become a full-schema `UNKNOWN_EXECUTION_ERROR` report so the rest of the batch continues.
Add tests/test_core/test_benchmark/test_report_schema.py covering schema invariance across success / setup / execution / evaluation failures in both sequential and parallel mode, plus the parallel fail-fast behaviour.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
22
22
### Fixed
23
23
24
24
- Fixed MACS real-data tests passing `{"environment_data": task.environment_data}` instead of `task.environment_data` directly, which caused `setup_state` to silently receive an empty tools list. (PR: #58)
25
+
- Benchmark reports from `Benchmark.run()` now have a consistent schema across every outcome. Setup failures, setup timeouts, and unexpected worker failures in parallel runs previously produced reports missing the `usage` and `task` keys (with empty `traces`/`config`). Every report now always includes `task_id`, `repeat_idx`, `status`, `error`, `traces`, `config`, `usage`, `eval`, and `task`, and `report["error"]` is always populated whenever `status` is not `SUCCESS`. (PR: #61)
26
+
-`fail_on_setup_error`, `fail_on_task_error`, and `fail_on_evaluation_error` now abort a parallel `Benchmark.run()` the same way they abort a sequential run. Previously a parallel run swallowed the failure into a degraded report and kept going. (PR: #61)
0 commit comments