Skip to content

spider2-dbt — duckdb_match verifier emitting binary reward.json#15

Merged
kentwelcome merged 26 commits into
mainfrom
spacedock-ensign/spider2-dbt-duckdb-match-verifier
Jun 18, 2026
Merged

spider2-dbt — duckdb_match verifier emitting binary reward.json#15
kentwelcome merged 26 commits into
mainfrom
spacedock-ensign/spider2-dbt-duckdb-match-verifier

Conversation

@kentwelcome

Copy link
Copy Markdown
Contributor

Adds the spider2-dbt Harbor verifier that scores agent dbt output against gold via faithful binary duckdb_match, so the benchmark produces trustworthy 1.0/0.0 rewards.

What changed

  • Add duckdb-only comparator reproducing Spider2 duckdb_match (column-containment, isclose 1e-2).
  • Emit tests/test.sh + bundled verifier writing Harbor {"reward": } to /logs/verifier.
  • Fail closed on empty/malformed/missing-gold eval specs and unsafe gold basenames.
  • Quote shell args and SQL identifiers; guard link-mode symlink writes.
  • Resolve gold + predicted DB names from the eval spec and dbt profiles.

Evidence

  • uv run pytest -k spider2_dbt: 93/93 passed (acceptance subset 49/49 passed).
  • Independent code review against the upstream Spider2 oracle: 0 blocking; no input forcing a false reward 1.0.

Review guidance

Risk surface is the comparator + verifier emission: src/razorback/benchmarks/spider2_dbt/{duckdb_match.py,harbor_view.py}.


r5

kentwelcome and others added 26 commits June 18, 2026 19:12
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…C-3)

Emitted test.sh predicted-DB path consumes resolve_spider2_db_name (RIDER)
instead of hardcoding /app/spider2.duckdb; covered by a NON-spider2 db-name test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(AC-3)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sion + duckdb_match faithfulness)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+ real eval-spec schema

Re-derive compare_duckdb against xlang-ai/Spider2 spider2-dbt/evaluation_suite/
eval_utils.duckdb_match + compare_pandas_table: transpose to column-vectors,
each gold condition_col column must match SOME pred column-vector, per-column
sort under ignore_order, numeric compare via math.isclose(abs_tol=1e-2), AND
across condition_tabs, missing pred table = mismatch. Replaces the prior
row-tuple/exact-== Counter compare.

EvalSpec now models the real gold-line shape
{instance_id, evaluation:{func, parameters:{gold, condition_tabs,
condition_cols:List[List[int]], ignore_orders:List[bool]}}} with per-table
parallel lists (was a flat dict + single bool that would mis-parse real gold).

Tests cover the reviewer's verdict-flipping cases (float within 1e-2, column
reorder, extra pred columns) plus a real multi-table gold line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…de verify-time tests/)

The verifier-only tests/ assets (gold.duckdb, eval spec, comparator) tripped
test_translate_spider2_dbt.py's content-leakage scan. Harbor uploads tests/ to
the container only at verify time and wipes/recreates it around the agent run
(harbor verifier.py uploads inside verify(); trial.py _verify_step reset_dirs
removes tests_dir first), so tests/ is never agent-visible. Scope _leakage_hits
to everything OUTSIDE tests/ — real agent-view protection is unchanged.

Add a guard test planting gold-content in BOTH an agent-visible path and tests/,
proving the scanner still fires on the former and only excludes the latter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…B1 scoping sound)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A zero-table EvalSpec makes compare_duckdb's AND-loop never run and return
True, so a corrupted/truncated/schema-drifted spider2_eval.jsonl silently
scored every prediction reward 1.0 (fail-open). Now:

- EvalSpec rejects empty condition_tabs at construction.
- load_eval_spec raises on an empty file and on a non-duckdb_match
  evaluation.func.
- emit_reward wraps load_eval_spec/compare in try/except -> reward 0,
  never crash-into-pass and never silent 1.0.

Adds negative regression tests proving an empty file, a missing
condition_tabs, and a wrong func each raise (loader) or emit reward 0
(CLI), not 1. Cycle-1 comparator/leakage fixes, the rider, and AC-3
reward shape untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eal + load-bearing)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cycle-3 validation finding: the verifier hardcoded gold.duckdb and discarded
the eval spec's parameters.gold, so real Spider2 tasks (which name the gold DB
per task, e.g. playbook.duckdb) would score against the wrong/missing file.

- EvalSpec gains a gold field; load_eval_spec parses parameters.gold and
  fails closed when a real wrapped duckdb_match spec omits it.
- _ensure_verifier_assets resolves the gold basename from the spec, copies the
  EXACT named file from tests/gold/<basename>, emits --gold-db /tests/<basename>,
  and raises if the named gold is missing (no scoring against a missing file).
- Regression tests: non-default basename (playbook.duckdb, no gold.duckdb) is
  copied + scored + leakage-clean; missing named gold fails closed; loader
  parses/validates gold.

Cycle-1 comparator + leakage scoping, cycle-2 fail-closed validation, the
predicted-DB resolver rider, and the AC-3 reward shape are untouched.
uv run pytest -k spider2_dbt (ignoring the pre-existing-broken scoring test) = 78 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…name real + load-bearing)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aversal guard)

evaluation.parameters.gold is external Spider2 input that the materializer
joins onto a path (tests/gold/<gold>) and emits into test.sh
(--gold-db /tests/<gold>). Path preserves '..'/absolute components, so a
malformed or hostile spec could read/write outside the verifier-only gold
area. Fail closed at the trust boundary: require Path(g).name == g and reject
'.'/'..'. One parametrized regression test (../, absolute, separators, ., ..).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction

The cycle-4 basename guard rejected separators/'..'/absolute but still
admitted POSIX-valid names with shell metacharacters (e.g.
`x.duckdb; echo ... #`), which were interpolated UNQUOTED into the verifier
test.sh --gold-db argument and would execute as shell during verification.
Replace it with a conservative allowlist `[A-Za-z0-9._-]+\.duckdb` at the
trust boundary in load_eval_spec — subsumes the path checks and blocks
whitespace/metacharacters. Adds ;/space/$() regression cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ion)

The cycle-5 gold allowlist only guarded --gold-db. The --predicted-db arg
interpolated db_name (resolved from the task's profiles.yml `path:`, external
input) unquoted into test.sh, leaving the symmetric shell-injection boundary
open. shlex.quote BOTH args at the emission point in _ensure_verifier_assets
so neither external-input value can be interpreted as shell during
verification. Gold allowlist retained (also guards path traversal). Adds a
load-bearing regression with a profiles.yml `$()` DuckDB path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckdb-native

duckdb_match imported pandas at module level, but verify.py runs INSIDE the
task image and that image guarantees only duckdb (the build-time preflight
imports it), not pandas — pandas is not a project dep and no injected layer
installs it. An image without pandas would crash verify.py on import before
emit_reward() writes reward.json, scoring valid runs as infra failures.

Re-port the column-containment compare onto duckdb's .fetchall()/.description
+ stdlib (sorted, math.isclose, an _isna helper). Spider2 duckdb_match
semantics preserved verbatim (containment, isclose 1e-2, sort key,
condition_cols, multi-table AND, missing-table->0, NA==NA); the behavioral
comparator suite stays green unchanged as the faithfulness net. Verifier now
imports only duckdb — zero undeclared runtime deps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_fetch_columns interpolated condition_tabs (external eval-spec input) raw into
SELECT * FROM "<table>". A `realt"; select 999; --` value breaks out; DuckDB
runs the injected statement and .fetchall() returns its rows, so a hostile spec
forces gold+pred to return identical rows -> reward 1.0 over mismatched DBs.

Add _quote_ident (double embedded ") and use it in _fetch_columns: a breakout
value becomes an unresolvable identifier, the gold fetch raises, and emit_reward
scores 0 (fail-closed). Adds a load-bearing regression (verified live: raw ->
[(999,)], fix -> CatalogException).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng gold

Two findings from adversarial review:

1. [high] DECIMAL tolerance regression (from the cycle-7 pandas removal).
   DuckDB native fetchall() returns decimal.Decimal for DECIMAL/NUMERIC, which
   is numbers.Number but NOT numbers.Real, so _vectors_match's (int,float)
   tolerance branch skipped it -> exact != compare. Spider2's pandas fetchdf()
   coerced DECIMAL->float64, so within-1e-2 DECIMAL outputs scored 0. Normalize
   Decimal->float in _fetch_columns (one place), comparator otherwise unchanged.

2. [medium] Missing tests/gold/ fail-open. _ensure_verifier_assets early-returned
   when gold was absent, leaving the source test.sh (e.g. a stub exit 0) ->
   an unscored / trivially-passing spider2-dbt task. Now fails closed (raises);
   fixture-002 gains a real gold so multi-instance/explain tests materialize two
   scored tasks.

Load-bearing regressions for both; spider2_dbt suite 92 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hrough

_ensure_verifier_assets copied the comparator modules + gold DB + eval spec
into view/tests/ with bare shutil.copy2. In view_mode="link" a colliding
source-provided name (tests/verify.py, tests/<gold>.duckdb, etc.) is a symlink
back to the source, so copy2 follows it and overwrites the source task —
spider2 translation binds to link mode, so this is on the live path. Route all
5 copies through _copy_into_view, which unlinks a symlink dst first (same guard
already on the Dockerfile/preflight/test.sh writes). Adds a load-bearing
link-mode regression seeding a source tests/verify.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent validation of HEAD 965f2ea: 3 ACs green from clean checkout
(gating 93 passed, acceptance -k spider2_dbt_verify 49 passed), all
dispatch focus areas confirmed by live probes (comparator faithfulness,
duckdb-only, fail-closed guards, shell+SQL injection sealed, symlink
write-through, reward.json shape). 4 full-suite failures pre-existing on
merge-base 9c39af2 (verified). Code review: 0 blocking, 3 minor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012s6B1EQB2346B7KiM4hXPX
Copilot AI review requested due to automatic review settings June 18, 2026 14:44

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Spider2-dbt Harbor verifier that scores agent-produced DuckDB outputs against gold using a DuckDB-only, Spider2-faithful duckdb_match comparator, emitting Harbor-shaped {"reward": <float>} artifacts while failing closed on malformed inputs and hardening against injection and link-mode symlink write-through.

Changes:

  • Introduces duckdb_match comparator + eval-spec loader + verify.py CLI, and wires verifier asset emission into the spider2-dbt Harbor task materializer.
  • Adds safety hardening: fail-closed spec handling, gold basename allowlist, shell quoting, SQL identifier quoting, and symlink-unlink copy guards in link mode.
  • Adds comprehensive unit/integration tests, fixtures, and validation documentation for comparator faithfulness and verifier emission.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/test_translate_spider2_dbt.py Scopes leakage scanning to agent-visible paths (excludes verify-only tests/ subtree) and adds a guard test.
tests/unit/test_spider2_dbt_verify_comparator.py New unit suite covering eval-spec parsing, fail-closed behavior, and comparator semantics (tolerance, containment, ordering).
tests/unit/test_spider2_dbt_verify_cli.py New unit tests for emit_reward and verifier asset emission behavior.
tests/unit/test_spider2_dbt_harbor_view.py Updates fixtures for fail-closed gold requirements and adds link-mode symlink overwrite regression coverage.
tests/integration/test_spider2_dbt_verify_test_sh.py New integration tests exercising emitted verifier assets and test.sh contract.
tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/spider2_eval.jsonl Adds a gold eval spec fixture for the second minimal task.
tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/spider2_eval.jsonl Adds/updates gold eval spec fixture for the first minimal task.
tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/build_gold.py Adds a script to regenerate the committed gold DuckDB fixture deterministically.
src/razorback/benchmarks/spider2_dbt/verify.py New verifier CLI that writes Harbor-shaped reward.json and fails closed on spec/compare errors.
src/razorback/benchmarks/spider2_dbt/harbor_view.py Emits verifier assets into view tests/ (verify-only), quotes shell args, copies spec-named gold DB, and guards link-mode symlink writes.
src/razorback/benchmarks/spider2_dbt/eval_spec.py New eval-spec model/loader with fail-closed guards and gold basename allowlist.
src/razorback/benchmarks/spider2_dbt/duckdb_match.py New DuckDB-only comparator implementing Spider2 duckdb_match semantics with tolerance and containment.
docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md Adds a validation report documenting AC verification, probes, and known pre-existing failures.
docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md Extends workflow/implementation log with multi-cycle validation history and final status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +160 to +166
condition_tabs = list(params.get("condition_tabs", []))
raw_cols = params.get("condition_cols")
condition_cols = (
[list(inner) for inner in raw_cols] if raw_cols is not None else []
)
raw_orders = params.get("ignore_orders")
ignore_orders = [bool(b) for b in raw_orders] if raw_orders is not None else []
@kentwelcome kentwelcome merged commit f74dc9f into main Jun 18, 2026
1 check passed
kentwelcome added a commit that referenced this pull request Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants