spider2-dbt — duckdb_match verifier emitting binary reward.json#15
Merged
kentwelcome merged 26 commits intoJun 18, 2026
Merged
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…C-3) Emitted test.sh predicted-DB path consumes resolve_spider2_db_name (RIDER) instead of hardcoding /app/spider2.duckdb; covered by a NON-spider2 db-name test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(AC-3) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sion + duckdb_match faithfulness) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ess + leakage regression
…+ real eval-spec schema
Re-derive compare_duckdb against xlang-ai/Spider2 spider2-dbt/evaluation_suite/
eval_utils.duckdb_match + compare_pandas_table: transpose to column-vectors,
each gold condition_col column must match SOME pred column-vector, per-column
sort under ignore_order, numeric compare via math.isclose(abs_tol=1e-2), AND
across condition_tabs, missing pred table = mismatch. Replaces the prior
row-tuple/exact-== Counter compare.
EvalSpec now models the real gold-line shape
{instance_id, evaluation:{func, parameters:{gold, condition_tabs,
condition_cols:List[List[int]], ignore_orders:List[bool]}}} with per-table
parallel lists (was a flat dict + single bool that would mis-parse real gold).
Tests cover the reviewer's verdict-flipping cases (float within 1e-2, column
reorder, extra pred columns) plus a real multi-table gold line.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…de verify-time tests/) The verifier-only tests/ assets (gold.duckdb, eval spec, comparator) tripped test_translate_spider2_dbt.py's content-leakage scan. Harbor uploads tests/ to the container only at verify time and wipes/recreates it around the agent run (harbor verifier.py uploads inside verify(); trial.py _verify_step reset_dirs removes tests_dir first), so tests/ is never agent-visible. Scope _leakage_hits to everything OUTSIDE tests/ — real agent-view protection is unchanged. Add a guard test planting gold-content in BOTH an agent-visible path and tests/, proving the scanner still fires on the former and only excludes the latter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…B1 scoping sound) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A zero-table EvalSpec makes compare_duckdb's AND-loop never run and return True, so a corrupted/truncated/schema-drifted spider2_eval.jsonl silently scored every prediction reward 1.0 (fail-open). Now: - EvalSpec rejects empty condition_tabs at construction. - load_eval_spec raises on an empty file and on a non-duckdb_match evaluation.func. - emit_reward wraps load_eval_spec/compare in try/except -> reward 0, never crash-into-pass and never silent 1.0. Adds negative regression tests proving an empty file, a missing condition_tabs, and a wrong func each raise (loader) or emit reward 0 (CLI), not 1. Cycle-1 comparator/leakage fixes, the rider, and AC-3 reward shape untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eal + load-bearing) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cycle-3 validation finding: the verifier hardcoded gold.duckdb and discarded the eval spec's parameters.gold, so real Spider2 tasks (which name the gold DB per task, e.g. playbook.duckdb) would score against the wrong/missing file. - EvalSpec gains a gold field; load_eval_spec parses parameters.gold and fails closed when a real wrapped duckdb_match spec omits it. - _ensure_verifier_assets resolves the gold basename from the spec, copies the EXACT named file from tests/gold/<basename>, emits --gold-db /tests/<basename>, and raises if the named gold is missing (no scoring against a missing file). - Regression tests: non-default basename (playbook.duckdb, no gold.duckdb) is copied + scored + leakage-clean; missing named gold fails closed; loader parses/validates gold. Cycle-1 comparator + leakage scoping, cycle-2 fail-closed validation, the predicted-DB resolver rider, and the AC-3 reward shape are untouched. uv run pytest -k spider2_dbt (ignoring the pre-existing-broken scoring test) = 78 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…name real + load-bearing) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aversal guard) evaluation.parameters.gold is external Spider2 input that the materializer joins onto a path (tests/gold/<gold>) and emits into test.sh (--gold-db /tests/<gold>). Path preserves '..'/absolute components, so a malformed or hostile spec could read/write outside the verifier-only gold area. Fail closed at the trust boundary: require Path(g).name == g and reject '.'/'..'. One parametrized regression test (../, absolute, separators, ., ..). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction The cycle-4 basename guard rejected separators/'..'/absolute but still admitted POSIX-valid names with shell metacharacters (e.g. `x.duckdb; echo ... #`), which were interpolated UNQUOTED into the verifier test.sh --gold-db argument and would execute as shell during verification. Replace it with a conservative allowlist `[A-Za-z0-9._-]+\.duckdb` at the trust boundary in load_eval_spec — subsumes the path checks and blocks whitespace/metacharacters. Adds ;/space/$() regression cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ion) The cycle-5 gold allowlist only guarded --gold-db. The --predicted-db arg interpolated db_name (resolved from the task's profiles.yml `path:`, external input) unquoted into test.sh, leaving the symmetric shell-injection boundary open. shlex.quote BOTH args at the emission point in _ensure_verifier_assets so neither external-input value can be interpreted as shell during verification. Gold allowlist retained (also guards path traversal). Adds a load-bearing regression with a profiles.yml `$()` DuckDB path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckdb-native duckdb_match imported pandas at module level, but verify.py runs INSIDE the task image and that image guarantees only duckdb (the build-time preflight imports it), not pandas — pandas is not a project dep and no injected layer installs it. An image without pandas would crash verify.py on import before emit_reward() writes reward.json, scoring valid runs as infra failures. Re-port the column-containment compare onto duckdb's .fetchall()/.description + stdlib (sorted, math.isclose, an _isna helper). Spider2 duckdb_match semantics preserved verbatim (containment, isclose 1e-2, sort key, condition_cols, multi-table AND, missing-table->0, NA==NA); the behavioral comparator suite stays green unchanged as the faithfulness net. Verifier now imports only duckdb — zero undeclared runtime deps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_fetch_columns interpolated condition_tabs (external eval-spec input) raw into SELECT * FROM "<table>". A `realt"; select 999; --` value breaks out; DuckDB runs the injected statement and .fetchall() returns its rows, so a hostile spec forces gold+pred to return identical rows -> reward 1.0 over mismatched DBs. Add _quote_ident (double embedded ") and use it in _fetch_columns: a breakout value becomes an unresolvable identifier, the gold fetch raises, and emit_reward scores 0 (fail-closed). Adds a load-bearing regression (verified live: raw -> [(999,)], fix -> CatalogException). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng gold Two findings from adversarial review: 1. [high] DECIMAL tolerance regression (from the cycle-7 pandas removal). DuckDB native fetchall() returns decimal.Decimal for DECIMAL/NUMERIC, which is numbers.Number but NOT numbers.Real, so _vectors_match's (int,float) tolerance branch skipped it -> exact != compare. Spider2's pandas fetchdf() coerced DECIMAL->float64, so within-1e-2 DECIMAL outputs scored 0. Normalize Decimal->float in _fetch_columns (one place), comparator otherwise unchanged. 2. [medium] Missing tests/gold/ fail-open. _ensure_verifier_assets early-returned when gold was absent, leaving the source test.sh (e.g. a stub exit 0) -> an unscored / trivially-passing spider2-dbt task. Now fails closed (raises); fixture-002 gains a real gold so multi-instance/explain tests materialize two scored tasks. Load-bearing regressions for both; spider2_dbt suite 92 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hrough _ensure_verifier_assets copied the comparator modules + gold DB + eval spec into view/tests/ with bare shutil.copy2. In view_mode="link" a colliding source-provided name (tests/verify.py, tests/<gold>.duckdb, etc.) is a symlink back to the source, so copy2 follows it and overwrites the source task — spider2 translation binds to link mode, so this is on the live path. Route all 5 copies through _copy_into_view, which unlinks a symlink dst first (same guard already on the Dockerfile/preflight/test.sh writes). Adds a load-bearing link-mode regression seeding a source tests/verify.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent validation of HEAD 965f2ea: 3 ACs green from clean checkout (gating 93 passed, acceptance -k spider2_dbt_verify 49 passed), all dispatch focus areas confirmed by live probes (comparator faithfulness, duckdb-only, fail-closed guards, shell+SQL injection sealed, symlink write-through, reward.json shape). 4 full-suite failures pre-existing on merge-base 9c39af2 (verified). Code review: 0 blocking, 3 minor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012s6B1EQB2346B7KiM4hXPX
There was a problem hiding this comment.
Pull request overview
This PR adds a Spider2-dbt Harbor verifier that scores agent-produced DuckDB outputs against gold using a DuckDB-only, Spider2-faithful duckdb_match comparator, emitting Harbor-shaped {"reward": <float>} artifacts while failing closed on malformed inputs and hardening against injection and link-mode symlink write-through.
Changes:
- Introduces
duckdb_matchcomparator + eval-spec loader +verify.pyCLI, and wires verifier asset emission into the spider2-dbt Harbor task materializer. - Adds safety hardening: fail-closed spec handling, gold basename allowlist, shell quoting, SQL identifier quoting, and symlink-unlink copy guards in link mode.
- Adds comprehensive unit/integration tests, fixtures, and validation documentation for comparator faithfulness and verifier emission.
Reviewed changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_translate_spider2_dbt.py | Scopes leakage scanning to agent-visible paths (excludes verify-only tests/ subtree) and adds a guard test. |
| tests/unit/test_spider2_dbt_verify_comparator.py | New unit suite covering eval-spec parsing, fail-closed behavior, and comparator semantics (tolerance, containment, ordering). |
| tests/unit/test_spider2_dbt_verify_cli.py | New unit tests for emit_reward and verifier asset emission behavior. |
| tests/unit/test_spider2_dbt_harbor_view.py | Updates fixtures for fail-closed gold requirements and adds link-mode symlink overwrite regression coverage. |
| tests/integration/test_spider2_dbt_verify_test_sh.py | New integration tests exercising emitted verifier assets and test.sh contract. |
| tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/spider2_eval.jsonl | Adds a gold eval spec fixture for the second minimal task. |
| tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/spider2_eval.jsonl | Adds/updates gold eval spec fixture for the first minimal task. |
| tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/build_gold.py | Adds a script to regenerate the committed gold DuckDB fixture deterministically. |
| src/razorback/benchmarks/spider2_dbt/verify.py | New verifier CLI that writes Harbor-shaped reward.json and fails closed on spec/compare errors. |
| src/razorback/benchmarks/spider2_dbt/harbor_view.py | Emits verifier assets into view tests/ (verify-only), quotes shell args, copies spec-named gold DB, and guards link-mode symlink writes. |
| src/razorback/benchmarks/spider2_dbt/eval_spec.py | New eval-spec model/loader with fail-closed guards and gold basename allowlist. |
| src/razorback/benchmarks/spider2_dbt/duckdb_match.py | New DuckDB-only comparator implementing Spider2 duckdb_match semantics with tolerance and containment. |
| docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md | Adds a validation report documenting AC verification, probes, and known pre-existing failures. |
| docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md | Extends workflow/implementation log with multi-cycle validation history and final status. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+160
to
+166
| condition_tabs = list(params.get("condition_tabs", [])) | ||
| raw_cols = params.get("condition_cols") | ||
| condition_cols = ( | ||
| [list(inner) for inner in raw_cols] if raw_cols is not None else [] | ||
| ) | ||
| raw_orders = params.get("ignore_orders") | ||
| ignore_orders = [bool(b) for b in raw_orders] if raw_orders is not None else [] |
kentwelcome
added a commit
that referenced
this pull request
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the spider2-dbt Harbor verifier that scores agent dbt output against gold via faithful binary duckdb_match, so the benchmark produces trustworthy 1.0/0.0 rewards.
What changed
Evidence
Review guidance
Risk surface is the comparator + verifier emission: src/razorback/benchmarks/spider2_dbt/{duckdb_match.py,harbor_view.py}.
r5