spider2-dbt — duckdb_match verifier emitting binary reward.json#15

Merged

kentwelcome merged 26 commits into

mainfrom

spacedock-ensign/spider2-dbt-duckdb-match-verifier

Jun 18, 2026

kentwelcome commented Jun 18, 2026

Contributor

Adds the spider2-dbt Harbor verifier that scores agent dbt output against gold via faithful binary duckdb_match, so the benchmark produces trustworthy 1.0/0.0 rewards.

What changed

Add duckdb-only comparator reproducing Spider2 duckdb_match (column-containment, isclose 1e-2).
Emit tests/test.sh + bundled verifier writing Harbor {"reward": } to /logs/verifier.
Fail closed on empty/malformed/missing-gold eval specs and unsafe gold basenames.
Quote shell args and SQL identifiers; guard link-mode symlink writes.
Resolve gold + predicted DB names from the eval spec and dbt profiles.

Evidence

uv run pytest -k spider2_dbt: 93/93 passed (acceptance subset 49/49 passed).
Independent code review against the upstream Spider2 oracle: 0 blocking; no input forcing a false reward 1.0.

Review guidance

Risk surface is the comparator + verifier emission: src/razorback/benchmarks/spider2_dbt/{duckdb_match.py,harbor_view.py}.

kentwelcome and others added 26 commits

June 18, 2026 19:12


          feat(spider2): eval-spec + duckdb_match comparator (AC-1, AC-2)

c72d551

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(spider2): verify.py CLI + materialize verifier assets/test.sh (A…

06be8be

…C-3)

Emitted test.sh predicted-DB path consumes resolve_spider2_db_name (RIDER)
instead of hardcoding /app/spider2.duckdb; covered by a NON-spider2 db-name test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          test(spider2): end-to-end emitted verifier writes harbor reward.json …

1a3b612

…(AC-3)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): implementation stage report

7ab0bf7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): validation report — REJECTED (translate leakage regres…

637aac2

…sion + duckdb_match faithfulness)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feedback: validation gate rejected (cycle 1) — duckdb_match faithfuln…

98604e5

…ess + leakage regression


          fix(spider2): B2 — faithful duckdb_match (column-containment + 1e-2) …

61a1b9c

…+ real eval-spec schema

Re-derive compare_duckdb against xlang-ai/Spider2 spider2-dbt/evaluation_suite/
eval_utils.duckdb_match + compare_pandas_table: transpose to column-vectors,
each gold condition_col column must match SOME pred column-vector, per-column
sort under ignore_order, numeric compare via math.isclose(abs_tol=1e-2), AND
across condition_tabs, missing pred table = mismatch. Replaces the prior
row-tuple/exact-== Counter compare.

EvalSpec now models the real gold-line shape
{instance_id, evaluation:{func, parameters:{gold, condition_tabs,
condition_cols:List[List[int]], ignore_orders:List[bool]}}} with per-table
parallel lists (was a flat dict + single bool that would mis-parse real gold).

Tests cover the reviewer's verdict-flipping cases (float within 1e-2, column
reorder, extra pred columns) plus a real multi-table gold line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2): B1 — scope leakage scan to agent-visible portion (exclu…

22b9b6c

…de verify-time tests/)

The verifier-only tests/ assets (gold.duckdb, eval spec, comparator) tripped
test_translate_spider2_dbt.py's content-leakage scan. Harbor uploads tests/ to
the container only at verify time and wipes/recreates it around the agent run
(harbor verifier.py uploads inside verify(); trial.py _verify_step reset_dirs
removes tests_dir first), so tests/ is never agent-visible. Scope _leakage_hits
to everything OUTSIDE tests/ — real agent-view protection is unchanged.

Add a guard test planting gold-content in BOTH an agent-visible path and tests/,
proving the scanner still fires on the former and only excludes the latter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): implementation stage report (feedback cycle 1)

f878967

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): validation re-review (cycle 2) — PASSED (B2 faithful, …

0b64e92

…B1 scoping sound)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feedback: validation gate rejected (cycle 2) — fail-closed eval-spec …

f25597c

…validation


          fix(spider2-dbt): fail closed on empty/malformed gold eval spec

869918d

A zero-table EvalSpec makes compare_duckdb's AND-loop never run and return
True, so a corrupted/truncated/schema-drifted spider2_eval.jsonl silently
scored every prediction reward 1.0 (fail-open). Now:

- EvalSpec rejects empty condition_tabs at construction.
- load_eval_spec raises on an empty file and on a non-duckdb_match
  evaluation.func.
- emit_reward wraps load_eval_spec/compare in try/except -> reward 0,
  never crash-into-pass and never silent 1.0.

Adds negative regression tests proving an empty file, a missing
condition_tabs, and a wrong func each raise (loader) or emit reward 0
(CLI), not 1. Cycle-1 comparator/leakage fixes, the rider, and AC-3
reward shape untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2-dbt): cycle-2 implementation stage report (fail-closed)

11c66f4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): validation re-review (cycle 3) — PASSED (fail-closed r…

6698e26

…eal + load-bearing)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feedback: validation gate rejected (cycle 3) — honor eval-spec gold b…

c6749fd

…asename


          fix(spider2-dbt): drive gold DB basename from evaluation.parameters.gold

a677c36

Cycle-3 validation finding: the verifier hardcoded gold.duckdb and discarded
the eval spec's parameters.gold, so real Spider2 tasks (which name the gold DB
per task, e.g. playbook.duckdb) would score against the wrong/missing file.

- EvalSpec gains a gold field; load_eval_spec parses parameters.gold and
  fails closed when a real wrapped duckdb_match spec omits it.
- _ensure_verifier_assets resolves the gold basename from the spec, copies the
  EXACT named file from tests/gold/<basename>, emits --gold-db /tests/<basename>,
  and raises if the named gold is missing (no scoring against a missing file).
- Regression tests: non-default basename (playbook.duckdb, no gold.duckdb) is
  copied + scored + leakage-clean; missing named gold fails closed; loader
  parses/validates gold.

Cycle-1 comparator + leakage scoping, cycle-2 fail-closed validation, the
predicted-DB resolver rider, and the AC-3 reward shape are untouched.
uv run pytest -k spider2_dbt (ignoring the pre-existing-broken scoring test) = 78 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs: cycle-3 implementation stage report (gold basename fix)

bac0705

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): validation re-review (cycle-3 fix) — PASSED (gold base…

78b4a52

…name real + load-bearing)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): reject non-basename gold in load_eval_spec (path-tr…

bfed74c

…aversal guard)

evaluation.parameters.gold is external Spider2 input that the materializer
joins onto a path (tests/gold/<gold>) and emits into test.sh
(--gold-db /tests/<gold>). Path preserves '..'/absolute components, so a
malformed or hostile spec could read/write outside the verifier-only gold
area. Fail closed at the trust boundary: require Path(g).name == g and reject
'.'/'..'. One parametrized regression test (../, absolute, separators, ., ..).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): allowlist gold basename to block test.sh shell inje…

b774bb7

…ction

The cycle-4 basename guard rejected separators/'..'/absolute but still
admitted POSIX-valid names with shell metacharacters (e.g.
`x.duckdb; echo ... #`), which were interpolated UNQUOTED into the verifier
test.sh --gold-db argument and would execute as shell during verification.
Replace it with a conservative allowlist `[A-Za-z0-9._-]+\.duckdb` at the
trust boundary in load_eval_spec — subsumes the path checks and blocks
whitespace/metacharacters. Adds ;/space/$() regression cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): shlex.quote both verifier args (predicted-db inject…

843b4c1

…ion)

The cycle-5 gold allowlist only guarded --gold-db. The --predicted-db arg
interpolated db_name (resolved from the task's profiles.yml `path:`, external
input) unquoted into test.sh, leaving the symmetric shell-injection boundary
open. shlex.quote BOTH args at the emission point in _ensure_verifier_assets
so neither external-input value can be interpreted as shell during
verification. Gold allowlist retained (also guards path traversal). Adds a
load-bearing regression with a profiles.yml `$()` DuckDB path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): drop pandas from verifier; re-port comparator on du…

0ce7d51

…ckdb-native

duckdb_match imported pandas at module level, but verify.py runs INSIDE the
task image and that image guarantees only duckdb (the build-time preflight
imports it), not pandas — pandas is not a project dep and no injected layer
installs it. An image without pandas would crash verify.py on import before
emit_reward() writes reward.json, scoring valid runs as infra failures.

Re-port the column-containment compare onto duckdb's .fetchall()/.description
+ stdlib (sorted, math.isclose, an _isna helper). Spider2 duckdb_match
semantics preserved verbatim (containment, isclose 1e-2, sort key,
condition_cols, multi-table AND, missing-table->0, NA==NA); the behavioral
comparator suite stays green unchanged as the faithfulness net. Verifier now
imports only duckdb — zero undeclared runtime deps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): quote spec-supplied table names to block SQL injection

6c76cbe

_fetch_columns interpolated condition_tabs (external eval-spec input) raw into
SELECT * FROM "<table>". A `realt"; select 999; --` value breaks out; DuckDB
runs the injected statement and .fetchall() returns its rows, so a hostile spec
forces gold+pred to return identical rows -> reward 1.0 over mismatched DBs.

Add _quote_ident (double embedded ") and use it in _fetch_columns: a breakout
value becomes an unresolvable identifier, the gold fetch raises, and emit_reward
scores 0 (fail-closed). Adds a load-bearing regression (verified live: raw ->
[(999,)], fix -> CatalogException).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): DECIMAL tolerance regression + fail closed on missi…

f35a13b

…ng gold

Two findings from adversarial review:

1. [high] DECIMAL tolerance regression (from the cycle-7 pandas removal).
   DuckDB native fetchall() returns decimal.Decimal for DECIMAL/NUMERIC, which
   is numbers.Number but NOT numbers.Real, so _vectors_match's (int,float)
   tolerance branch skipped it -> exact != compare. Spider2's pandas fetchdf()
   coerced DECIMAL->float64, so within-1e-2 DECIMAL outputs scored 0. Normalize
   Decimal->float in _fetch_columns (one place), comparator otherwise unchanged.

2. [medium] Missing tests/gold/ fail-open. _ensure_verifier_assets early-returned
   when gold was absent, leaving the source test.sh (e.g. a stub exit 0) ->
   an unscored / trivially-passing spider2-dbt task. Now fails closed (raises);
   fixture-002 gains a real gold so multi-instance/explain tests materialize two
   scored tasks.

Load-bearing regressions for both; spider2_dbt suite 92 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(spider2-dbt): guard verifier-asset copies against symlink write-t…

965f2ea

…hrough

_ensure_verifier_assets copied the comparator modules + gold DB + eval spec
into view/tests/ with bare shutil.copy2. In view_mode="link" a colliding
source-provided name (tests/verify.py, tests/<gold>.duckdb, etc.) is a symlink
back to the source, so copy2 follows it and overwrites the source task —
spider2 translation binds to link mode, so this is on the live path. Route all
5 copies through _copy_into_view, which unlinks a symlink dst first (same guard
already on the Dockerfile/preflight/test.sh writes). Adds a load-bearing
link-mode regression seeding a source tests/verify.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(spider2): final validation (cycles 1-10) — PASSED

5cdcea3

Independent validation of HEAD 965f2ea: 3 ACs green from clean checkout
(gating 93 passed, acceptance -k spider2_dbt_verify 49 passed), all
dispatch focus areas confirmed by live probes (comparator faithfulness,
duckdb-only, fail-closed guards, shell+SQL injection sealed, symlink
write-through, reward.json shape). 4 full-suite failures pre-existing on
merge-base 9c39af2 (verified). Code review: 0 blocking, 3 minor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012s6B1EQB2346B7KiM4hXPX

Copilot AI review requested due to automatic review settings

June 18, 2026 14:44

Copilot started reviewing on behalf of kentwelcome

June 18, 2026 14:44

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

This PR adds a Spider2-dbt Harbor verifier that scores agent-produced DuckDB outputs against gold using a DuckDB-only, Spider2-faithful duckdb_match comparator, emitting Harbor-shaped {"reward": <float>} artifacts while failing closed on malformed inputs and hardening against injection and link-mode symlink write-through.

Changes:

Introduces duckdb_match comparator + eval-spec loader + verify.py CLI, and wires verifier asset emission into the spider2-dbt Harbor task materializer.
Adds safety hardening: fail-closed spec handling, gold basename allowlist, shell quoting, SQL identifier quoting, and symlink-unlink copy guards in link mode.
Adds comprehensive unit/integration tests, fixtures, and validation documentation for comparator faithfulness and verifier emission.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/test_translate_spider2_dbt.py	Scopes leakage scanning to agent-visible paths (excludes verify-only `tests/` subtree) and adds a guard test.
tests/unit/test_spider2_dbt_verify_comparator.py	New unit suite covering eval-spec parsing, fail-closed behavior, and comparator semantics (tolerance, containment, ordering).
tests/unit/test_spider2_dbt_verify_cli.py	New unit tests for `emit_reward` and verifier asset emission behavior.
tests/unit/test_spider2_dbt_harbor_view.py	Updates fixtures for fail-closed gold requirements and adds link-mode symlink overwrite regression coverage.
tests/integration/test_spider2_dbt_verify_test_sh.py	New integration tests exercising emitted verifier assets and `test.sh` contract.
tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/spider2_eval.jsonl	Adds a gold eval spec fixture for the second minimal task.
tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/spider2_eval.jsonl	Adds/updates gold eval spec fixture for the first minimal task.
tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/build_gold.py	Adds a script to regenerate the committed gold DuckDB fixture deterministically.
src/razorback/benchmarks/spider2_dbt/verify.py	New verifier CLI that writes Harbor-shaped `reward.json` and fails closed on spec/compare errors.
src/razorback/benchmarks/spider2_dbt/harbor_view.py	Emits verifier assets into view `tests/` (verify-only), quotes shell args, copies spec-named gold DB, and guards link-mode symlink writes.
src/razorback/benchmarks/spider2_dbt/eval_spec.py	New eval-spec model/loader with fail-closed guards and gold basename allowlist.
src/razorback/benchmarks/spider2_dbt/duckdb_match.py	New DuckDB-only comparator implementing Spider2 `duckdb_match` semantics with tolerance and containment.
docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md	Adds a validation report documenting AC verification, probes, and known pre-existing failures.
docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md	Extends workflow/implementation log with multi-cycle validation history and final status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/razorback/benchmarks/spider2_dbt/eval_spec.py

Comment on lines +160 to +166

+                  condition_tabs = list(params.get("condition_tabs", []))
+                  raw_cols = params.get("condition_cols")
+                  condition_cols = (
+                      [list(inner) for inner in raw_cols] if raw_cols is not None else []
+                  )
+                  raw_orders = params.get("ignore_orders")
+                  ignore_orders = [bool(b) for b in raw_orders] if raw_orders is not None else []

kentwelcome merged commit f74dc9f into main

1 check passed

kentwelcome added a commit that referenced this pull request


          done: spider2-dbt-duckdb-match-verifier merged (#15) -> archived (ver…

a4ca5df

…dict PASSED)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet