fix(dab): isolate per-query validator failures in verify_batch#19
Merged
Conversation
A validator raising (e.g. validate_qN calling .lower() on a non-string answer when the agent returns a JSON list) aborted emit_reward for the whole dataset, so no reward.json was written and the entire dataset was dropped from the run as a RewardFileNotFoundError trial error — silently removing it from the stratified Pass@1 denominator rather than scoring 0. Wrap the per-query validate_fn call in try/except: score the offending query 0.0 with the exception as the reason and continue grading the rest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the DAB batch verifier so that a single query’s validator crashing at runtime no longer aborts reward emission for the entire dataset, preventing dropped datasets and inflated aggregate metrics.
Changes:
- Wrap per-query
validate_fn(answer)calls inverify_batch.pywithtry/exceptto convert runtime validator exceptions into a0.0reward and an explanatory reason, while continuing to grade remaining queries. - Add a unit test ensuring a malformed answer for one query does not prevent sibling queries from being graded and does not prevent
reward.jsonfrom being written.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/razorback-plugin-dab/src/razorback_plugin_dab/verify/verify_batch.py | Adds per-query runtime exception isolation around validator execution, recording a zero reward + reason instead of crashing. |
| packages/razorback-plugin-dab/tests/unit/test_verify_batch_reward_shape.py | Adds coverage to assert validator runtime errors are isolated per query and artifacts are still written. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In DAB batch mode,
verify_batch.pyrunsvalidate_qN.pyfor each query inside oneemit_rewardloop with no per-query error isolation. If a single validator raises atcall time, the whole script crashes — no
reward.jsonis written — and the harness recordsthe trial as a
RewardFileNotFoundErrorerror, which drops the entire dataset from therun. It silently leaves the stratified-Pass@1 denominator rather than scoring 0.
Observed on
PATENTS: the agent returned theq1answer as a JSON list;validate_q1.pycalls
llm_output.lower()→AttributeError: 'list' object has no attribute 'lower'. All ofq1/q2/q3lost their rewards and PATENTS vanished from the summary — inflating the reportedstratified Pass@1 (computed over the surviving 11 datasets instead of 12).
Fix
Wrap the per-query
validate_fn(answer)call intry/except: score the offending query0.0with the exception text as thereason, and continue grading the remaining queries.A malformed (e.g. non-string) answer is a content failure → reward 0, not a harness error.
Scope is deliberately narrow: the
exceptonly covers the runtime call. Validatorimport failures (e.g. a missing
common_scaffolddependency, which indicates a brokenverifier setup affecting every query) still crash loudly — preserved by the existing
test_batch_verify_does_not_mask_validator_import_errors.Tests
test_batch_verify_isolates_per_query_runtime_validator_error: a list-typed answerscores its query 0 (with reason), a sibling well-typed query still grades,
reward.jsoniswritten (mean 0.5).
🤖 Generated with Claude Code