fix(dab): isolate per-query validator failures in verify_batch by kentwelcome · Pull Request #19 · spacedock-dev/razorback

kentwelcome · 2026-06-21T11:41:55Z

Problem

In DAB batch mode, verify_batch.py runs validate_qN.py for each query inside one
emit_reward loop with no per-query error isolation. If a single validator raises at
call time, the whole script crashes — no reward.json is written — and the harness records
the trial as a RewardFileNotFoundError error, which drops the entire dataset from the
run. It silently leaves the stratified-Pass@1 denominator rather than scoring 0.

Observed on PATENTS: the agent returned the q1 answer as a JSON list; validate_q1.py
calls llm_output.lower() → AttributeError: 'list' object has no attribute 'lower'. All of
q1/q2/q3 lost their rewards and PATENTS vanished from the summary — inflating the reported
stratified Pass@1 (computed over the surviving 11 datasets instead of 12).

Fix

Wrap the per-query validate_fn(answer) call in try/except: score the offending query
0.0 with the exception text as the reason, and continue grading the remaining queries.
A malformed (e.g. non-string) answer is a content failure → reward 0, not a harness error.

Scope is deliberately narrow: the except only covers the runtime call. Validator
import failures (e.g. a missing common_scaffold dependency, which indicates a broken
verifier setup affecting every query) still crash loudly — preserved by the existing
test_batch_verify_does_not_mask_validator_import_errors.

Tests

New test_batch_verify_isolates_per_query_runtime_validator_error: a list-typed answer
scores its query 0 (with reason), a sibling well-typed query still grades, reward.json is
written (mean 0.5).
Existing import-error-stays-loud and happy-path tests unchanged and passing (3/3).

🤖 Generated with Claude Code

A validator raising (e.g. validate_qN calling .lower() on a non-string answer when the agent returns a JSON list) aborted emit_reward for the whole dataset, so no reward.json was written and the entire dataset was dropped from the run as a RewardFileNotFoundError trial error — silently removing it from the stratified Pass@1 denominator rather than scoring 0. Wrap the per-query validate_fn call in try/except: score the offending query 0.0 with the exception as the reason and continue grading the rest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR hardens the DAB batch verifier so that a single query’s validator crashing at runtime no longer aborts reward emission for the entire dataset, preventing dropped datasets and inflated aggregate metrics.

Changes:

Wrap per-query validate_fn(answer) calls in verify_batch.py with try/except to convert runtime validator exceptions into a 0.0 reward and an explanatory reason, while continuing to grade remaining queries.
Add a unit test ensuring a malformed answer for one query does not prevent sibling queries from being graded and does not prevent reward.json from being written.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
packages/razorback-plugin-dab/src/razorback_plugin_dab/verify/verify_batch.py	Adds per-query runtime exception isolation around validator execution, recording a zero reward + reason instead of crashing.
packages/razorback-plugin-dab/tests/unit/test_verify_batch_reward_shape.py	Adds coverage to assert validator runtime errors are isolated per query and artifacts are still written.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kentwelcome and others added 2 commits June 21, 2026 11:40

test(dab): per-query runtime validator error is isolated, not fatal

e50dab8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 21, 2026 11:41

Copilot started reviewing on behalf of kentwelcome June 21, 2026 11:42 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

kentwelcome merged commit d9fc8a8 into main Jun 21, 2026
1 check passed

kentwelcome deleted the fix/dab-verify-batch-per-query-isolation branch June 21, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dab): isolate per-query validator failures in verify_batch#19

fix(dab): isolate per-query validator failures in verify_batch#19
kentwelcome merged 2 commits into
mainfrom
fix/dab-verify-batch-per-query-isolation

kentwelcome commented Jun 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kentwelcome commented Jun 21, 2026

Problem

Fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants