diff --git a/docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md b/docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md index b119dec6..7cc8f0c9 100644 --- a/docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md +++ b/docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md @@ -69,3 +69,294 @@ dbt track is DuckDB-only. ### Summary Wrote a separate implementation plan (3 ACs, 7 tasks) for the spider2-dbt duckdb_match verifier. Riskiest-first: the comparator reproducing Spider2 `duckdb_match` (per-table SELECT *, condition_cols subset, ignore_orders multiset compare, all-tables-AND) is built and proven against in-test DuckDB fixtures before any test.sh wiring. Key locked decision: gold `.duckdb` + eval spec are verifier-only by placing them under the view's `tests/` dir (Harbor uploads `tests/` to the container only at verify time, removes it around the agent run), copied explicitly so the `**/gold/**` deny-glob keeps the agent-facing tree clean — confirmed live against the deny-globs and the harbor verifier/trial source. One open item flagged: the agent-produced DB's container path (`/app/spider2.duckdb`) in the emitted test.sh is a placeholder pending the `spider2-dbt-harbor-view-ade-parity` image/workdir contract; it does not affect the gating `uv run pytest -k spider2_dbt_verify` suite. + +## Stage Report: implementation + +- DONE: Implement the approved plan TDD-first so all 3 ACs pass (eval-spec loader + duckdb_match comparator, verify.py CLI, materializer emitting tests/test.sh + verifier-only gold assets); `uv run pytest -k spider2_dbt_verify` green + 18 passed (commits c72d551, 06be8be, 1a3b612). `eval_spec.py`+`duckdb_match.py` (per-table SELECT *, 0-based condition_cols subset, ignore_orders Counter multiset compare, AND across condition_tabs, missing-table=mismatch) → AC-1/AC-2 (11 comparator tests); `verify.py` emit_reward → harbor `{"reward": 1.0|0.0}` at parent-created path; `_ensure_verifier_assets` in `harbor_view.py` copies the 3 modules + gold.duckdb + spider2_eval.jsonl + executable test.sh into the view's `tests/` → AC-3 (5 CLI/emission + 2 integration tests). Acceptance run excluded the pre-existing-broken `tests/unit/test_task_identity_scoring.py` (imports nonexistent `razorback.score.load`, untouched by this branch). +- DONE: RIDER — emitted test.sh predicted-DB path CONSUMES resolve_spider2_db_name to produce /app/.duckdb (not hardcoded /app/spider2.duckdb); add a test asserting the predicted path is correct for a NON-spider2 slug/db name + `harbor_view.py` `_ensure_verifier_assets` calls `resolve_spider2_db_name(project_dir or view, task_slug=...)` and formats `_TEST_SH_TEMPLATE` with `/app/{db_name}.duckdb`. `test_spider2_dbt_verify_test_sh_uses_resolved_db_name` materializes with `task_slug="not-spider2-slug"` (fixture ships no profiles.yml/*.duckdb → resolver falls back to slug) and asserts `/app/not-spider2-slug.duckdb` present and `/app/spider2.duckdb` absent. +- DONE: Keep gold .duckdb + eval spec on verifier-only paths; keep the generic materializer + non-spider2 harbor behavior unchanged + Source gold lives under `tests/gold/` (matched by `**/gold/**` deny-glob, stripped from the reflected view), re-copied explicitly to the view's `tests/gold.duckdb`/`tests/spider2_eval.jsonl` (no `gold/` segment); the now-empty `gold/` dir left by reflection is pruned. Leakage-clean asserted in `test_spider2_dbt_verify_view_carries_verifier_assets`; existing harbor_view (8) + preflight (21) suites unchanged. Gold fixture is reproducible via `tests/gold/build_gold.py` (no opaque committed binary). + +### Summary + +Shipped the spider2-dbt duckdb_match verifier TDD-first across 3 commits: a pure comparator reproducing Spider2 `eval_utils.duckdb_match` (per-table SELECT *, 0-based condition_cols subsetting, ignore_orders multiset compare, all-tables-AND, missing-table mismatch), a `verify.py` CLI emitting harbor's `{"reward": }`, and a materializer step copying the comparator+gold+test.sh into the view's verifier-only `tests/`. The mandatory rider is satisfied: the emitted `test.sh` `--predicted-db` path is resolved through ny's now-merged `resolve_spider2_db_name` (`/app/.duckdb`), proven by a NON-spider2-slug test. Deviation from the plan's literal `_TEST_SH`: the hardcoded `/app/spider2.duckdb` was replaced by the resolver-driven template (the rider's purpose), and a `gold/`-empty-dir prune was added so the leakage assertion holds exactly. `uv run pytest -k spider2_dbt_verify` = 18 passed (ignoring one pre-existing-broken, unrelated scoring test file). + +## Stage Report: validation + +- DONE: Independently reproduce the 3 ACs from a clean worktree checkout (AC-1/AC-2/AC-3); run `uv run pytest -k spider2_dbt_verify --ignore=tests/unit/test_task_identity_scoring.py` + Gating suite 18 passed. AC-1/AC-2/AC-3 all reproduce green via a standalone script driving `compare_duckdb` + the materializer; additionally executed the emitted `test.sh` end-to-end (paths substituted) → reward.json `{"reward":1.0}` match / `{"reward":0.0}` mismatch. +- DONE: Confirm the mandatory rider (resolved `/app/.duckdb`, NOT hardcoded) for a NON-`spider2` slug; confirm duckdb_match reproduction matches Spider2 eval_utils semantics + Rider CONFIRMED: `task_slug="not-spider2-slug"` → test.sh `--predicted-db /app/not-spider2-slug.duckdb`, no `/app/spider2.duckdb`. Faithfulness FAILED (B2): impl is row-tuple + exact-`==`; Spider2 `compare_pandas_table` is column-containment + `math.isclose(abs_tol=1e-2)` + per-column sort, and the real spec schema is `List[List[int]]`/`List[bool]` under `evaluation.parameters` (impl uses dict + single bool, flat — would not parse real gold). +- DONE: Confirm gold assets verifier-only; scrutinize the two flagged deviations; confirm no regression; give a gate verdict + Path-based isolation sound (no `gold/` segment survives). Deviation 1 (rider path override) ACCEPTED/correct; deviation 2 (empty `gold/` prune) ACCEPTED. REGRESSION found (B1): `test_translate_spider2_dbt.py` (2 tests) pass on base `9c39af2`, fail on branch — verifier assets trip the production content-leakage scanner. 4 other full-suite failures are pre-existing on base (not regressions). Verdict: REJECTED. + +### Summary + +Independent verification only — no production code written. The 3 ACs as literally written reproduce green (18-passed gating suite + standalone AC repro + a real end-to-end test.sh run) and the mandatory resolver rider is satisfied for a non-spider2 slug. Verdict is REJECTED → implementation on two blocking findings: (B1) a branch-induced regression — the verifier assets written into the view's `tests/` trip `test_translate_spider2_dbt.py`'s content-leakage scanner, which passes on the merge-base; and (B2) the `duckdb_match` reproduction is not faithful to Spider2's `eval_utils.duckdb_match` (row-tuple/exact compare vs Spider2's column-containment + 1e-2 float tolerance + per-column sort) and models an incompatible eval-spec schema that would not parse a real `spider2_eval.jsonl` line. Both pre-flagged deviations (rider path override, empty `gold/` prune) are accepted as correct. Full report: docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md + +## Feedback Cycles + +### Cycle 1 — validation gate REJECTED (2026-06-18) + +Validation (reviewer fetched the real xlang-ai/Spider2 `eval_utils.duckdb_match`) +found two blocking defects. Routing back to `implementation`: + +1. **[Critical] B2 — comparator + eval-spec schema do NOT match Spider2's `duckdb_match`.** + The real Spider2 semantics (from `evaluation_suite/eval_utils.py`): + - **Column-containment, not row-tuple equality:** transpose both tables to column-vectors; for each GOLD column (restricted to `condition_cols`), assert some PRED column-vector matches it. Per-column sorting is applied; numeric compare uses `math.isclose(abs_tol=1e-2)`. + - **Eval-spec schema:** each gold line is `{"instance_id", "evaluation": {"func":"duckdb_match", "parameters": {"gold", "condition_tabs": [...], "condition_cols": [[int,...],...], "ignore_orders": [bool,...]}}}`. `condition_cols` is `List[List[int]]` and `ignore_orders` is `List[bool]` (per-table), NOT a flat dict + single bool. The current flat schema would silently score every prediction 1.0 on a real gold line. + **Fix:** re-derive `compare_duckdb` to reproduce Spider2's column-containment + `math.isclose(abs_tol=1e-2)` + per-column sort, AND across `condition_tabs`, missing table = mismatch. Load the real per-table `condition_cols`/`ignore_orders` lists from `evaluation.parameters`. Add tests for the verdict-flipping cases the reviewer found (float tolerance 1e-2, column reorder, extra pred columns) and a real multi-table gold line. +2. **[high] B1 — verifier gold assets trip eg's agent-view leakage test (regression).** + `_ensure_verifier_assets` writes `gold.duckdb` / eval spec / comparator into the view's `tests/`, and `test_translate_spider2_dbt.py`'s leakage scan (pattern `gold|expected|golden` over the materialized view) now flags them — 2 tests that pass on base `9c39af2` fail on this branch. + **Fix:** make the agent-facing view leakage-clean while keeping the gold assets verifier-only. The view's `tests/` is uploaded only at verify time and removed during the agent run, so reconcile the leakage check to scope to the agent-visible portion (exclude the verifier-only `tests/` assets) — OR relocate/name the assets so the scan is clean — without weakening real agent-view leakage protection. Restore the 2 failing tests to green. + +Keep the rider (resolver-driven `/app/.duckdb`) and AC-3 reward.json shape intact. + +## Stage Report: implementation (cycle 1) + +- DONE: B2 (Critical) — re-derive `compare_duckdb` to FAITHFULLY reproduce Spider2's `eval_utils.duckdb_match` (column-containment, per-column sort, `math.isclose(1e-2)`, AND across condition_tabs, missing table = mismatch); add verdict-flipping tests + Commit 61a1b9c. Ported `compare_pandas_table`/`vectors_match` 1:1 from xlang-ai/Spider2 `spider2-dbt/evaluation_suite/eval_utils.py` (fetched via `gh api`, decoded to /tmp): transpose to column-vectors, each gold column (restricted to `condition_cols[i]`) must match SOME pred column-vector, sorted per-column under `ignore_order`, numerics via `math.isclose(abs_tol=1e-2)`, `pd.isna` for NaN; predicted-fetch wrapped in try/except -> 0. Tests added: float-within-1e-2 (True) + beyond (False), column reorder (True), extra pred columns (True), plus a real multi-table gold line -> 0.0. 16/16 comparator tests pass. +- DONE: B2 (Critical) — fix eval-spec schema to Spider2's real shape (`evaluation.parameters` with `condition_cols: List[List[int]]`, `ignore_orders: List[bool]`); add a real multi-table gold-line test proving non-1.0 on mismatch + Commit 61a1b9c. `EvalSpec` now holds per-table parallel lists; `load_eval_spec` drills into `evaluation.parameters` (matching Spider2 `evaluate.py:96-99` which calls `duckdb_match(result, **parameters)`), applies Spider2's None/empty defaults (`[[]]*n`, `[False]*n`), and validates list-length parity. Real-shaped multi-table line proven non-1.0 in `test_spider2_dbt_verify_multi_table_gold_line_mismatch`. Gold fixture + `build_gold.py` regenerated to the real shape. +- DONE: B1 (high) — make the agent-facing view leakage-clean while keeping gold verifier-only; restore the 2 failing `test_translate_spider2_dbt.py` tests; full `uv run pytest -k "spider2_dbt"` green + Commit 22b9b6c. Scoped `_leakage_hits` to exclude the verify-time-only `tests/` subtree (Harbor `verifier.py:133-138` uploads `tests/` only inside `verify()`; `trial.py` `_verify_step` `reset_dirs(remove=[tests_dir,verifier_dir])` first — never agent-visible). Added a guard test planting `gold`-content in BOTH an agent-visible path (caught) and `tests/` (ignored) so the fix is scoping, not weakening. `uv run pytest -k "spider2_dbt" --ignore=tests/unit/test_task_identity_scoring.py` = 66 passed. Rider (resolver-driven `/app/.duckdb`) + AC-3 reward.json shape untouched. + +### Summary + +Resolved both cycle-1 blocking defects against the authoritative xlang-ai/Spider2 source (fetched live, not the plan's prose). B2: `compare_duckdb` is now a faithful port of `eval_utils.duckdb_match` + `compare_pandas_table` (column-containment over transposed column-vectors, `math.isclose(abs_tol=1e-2)`, per-column sort, AND across tables), and `EvalSpec` models the real `evaluation.parameters` shape with per-table `List[List[int]]`/`List[bool]` lists — the prior row-tuple/exact-== + flat-dict design is gone. B1: the leakage scan is scoped to the agent-visible portion (everything outside the verify-time-only `tests/`), justified by the Harbor verify/agent dir lifecycle, with a guard test pinning it to scoping rather than disabling. Full gating suite 66 passed; acceptance `-k spider2_dbt_verify` 23 passed. No changes to the resolver rider or reward.json contract. + +## Stage Report: validation (cycle 2) + +- DONE: B2 — independently confirm the comparator faithfully reproduces Spider2's `eval_utils.duckdb_match` by checking against the ACTUAL xlang-ai/Spider2 source + Fetched upstream `eval_utils.py` live via `gh api`; line-by-line confirmed column-containment, per-column sort, `math.isclose(abs_tol=1e-2)`, `pd.isna` NaN, AND across condition_tabs, predicted-fetch=mismatch. 10-case differential test (impl vs inline transcription of the oracle: float within/beyond 1e-2, column reorder, extra pred cols, NaN, missing table, multi-table) ALL AGREE. `load_eval_spec` parses real `playbook001`/`provider001` gold lines from upstream `gold/spider2_eval.jsonl` (`evaluation.parameters`, per-table List[List[int]]/List[bool]). +- DONE: B1 — confirm the agent-facing view is leakage-clean AND the fix did not weaken protection; confirm the Harbor lifecycle justification; the 2 previously-failing tests are green + Verified Harbor lifecycle against installed package: `verifier.py:123 verify()` uploads `tests/` only inside verify(); `trial.py:570 _verify_step` `reset_dirs(remove=[verifier_dir,tests_dir])` wipes them empty pre-verify; agent gets only `workdir/`. Guard test catches agent-visible `gold` leak, ignores `tests/`. Mutation probe (blanket-True) makes the guard FAIL — confirms scoping not weakening. Production deny-globs unchanged. `test_translate_spider2_dbt.py` = 13 passed. +- DONE: Confirm no regression and the rider intact; AC-3 reward.json shape intact; give a gate verdict + `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` = 66 passed. Independent end-to-end: emitted test.sh resolves `/app/not-spider2-slug.duckdb` (no hardcoded), and emitted verify.py writes `{"reward":1.0}`/`{"reward":0.0}`. Branch touches only spider2_dbt scope; 3 named pre-existing failures untouched. Verdict: PASSED. + +### Summary + +Re-review after cycle-1 fixes (61a1b9c, 22b9b6c) — independent verification only, no production code. Both blocking findings resolved against the authoritative live xlang-ai/Spider2 source: B2 — the comparator is a faithful port of `eval_utils.duckdb_match` (column-containment + 1e-2 tolerance + per-column sort + AND-across-tables), proven by a 10-case differential test against the upstream oracle and a real multi-table gold line; `load_eval_spec` reads the real `evaluation.parameters` schema. B1 — the leakage scoping is sound, not weakening: the Harbor verify-only-tests/ lifecycle is verified against the installed package, and a mutation probe confirms the guard test still fires on agent-visible leaks. Rider (`/app/.duckdb`) and AC-3 reward.json shape intact; no regressions outside the known pre-existing set. Gate verdict: PASSED → done. Full report: docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md + +### Cycle 2 — validation gate REJECTED (2026-06-18, captain via Codex review) + +Codex found a fail-open verifier hazard; captain chose fix-now. Confirmed live. +Routing back to `implementation`: + +1. **[high] Fail-open: empty/malformed eval spec awards `{"reward": 1.0}`.** + `load_eval_spec` (`eval_spec.py:79`) defaults `condition_tabs` to `[]` with no non-empty check and no `evaluation.func` validation; `compare_duckdb` (`duckdb_match.py:76`) returns `True` when the `condition_tabs` loop never runs. So a corrupted/truncated/schema-drifted `spider2_eval.jsonl` scores every prediction 1.0 — silently reporting broken scoring as success. + **Fix:** fail closed in `load_eval_spec`/`EvalSpec` — require a non-empty `condition_tabs`, validate `evaluation.func == "duckdb_match"`, and raise (or otherwise force reward 0) on empty/missing. Add a negative regression test proving an empty/missing-`condition_tabs` spec raises or emits reward 0, NOT 1. Keep the faithful comparator (cycle-1 B2), the leakage scoping (cycle-1 B1), the rider, and AC-3 reward shape intact. + +## Stage Report: implementation (cycle 2) + +- DONE: Fail closed in `load_eval_spec`/`EvalSpec` (`eval_spec.py`): require a non-empty `condition_tabs`, and validate `evaluation.func == "duckdb_match"`. On an empty/missing `condition_tabs`, a missing/empty `evaluation`, or a wrong `func`, RAISE a clear error — never accept a zero-table spec that compare_duckdb would score as a match. + Commit 869918d. `EvalSpec.__post_init__` now raises `ValueError` when `condition_tabs` is empty (the fail-open hazard: compare_duckdb's AND-loop returns True on zero tables). `load_eval_spec` raises on an empty/truncated file and on any `evaluation.func != "duckdb_match"`; a missing/empty `condition_tabs` falls through to the empty-spec guard and raises. TDD: red first (5 loader/construct negatives DID NOT RAISE), then green. +- DONE: Ensure the verifier path cannot emit `{"reward": 1.0}` for an empty/malformed gold spec; add a NEGATIVE regression test (empty file, missing `condition_tabs`, non-`duckdb_match` func each raise or emit reward 0 — NOT 1). + Commit 869918d. `emit_reward` (`verify.py`) wraps `load_eval_spec`+`compare_duckdb` in try/except → reward 0 on any spec/compare error (no crash-into-pass, no silent 1.0). Negatives added: loader-level (empty file / missing condition_tabs / wrong func / missing evaluation → raise) and CLI-level over a MATCHING pred/gold pair (empty spec, wrong func → `{"reward": 0.0}`, proving fail-closed beats the would-be-1.0). +- DONE: Keep intact the faithful comparator (cycle-1 B2), the leakage scoping (cycle-1 B1), the rider, and the AC-3 reward.json shape; `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` green. + Gating suite 73 passed (was 66; +7 new negatives), no edits to duckdb_match comparator, harbor_view leakage scoping, the resolver rider, or the reward.json contract. Focused fix touched only `eval_spec.py`, `verify.py`, and the two test files. + +### Summary + +Resolved the cycle-2 fail-open hazard with a focused fix. `EvalSpec` and `load_eval_spec` now fail closed: an empty/truncated `spider2_eval.jsonl`, a missing/empty `condition_tabs`, or a non-`duckdb_match` `evaluation.func` raise `ValueError` instead of yielding a zero-table spec that `compare_duckdb` silently scored 1.0; `emit_reward` catches any such failure and emits reward 0 rather than crashing or passing. Proven TDD-first by 7 negative regression tests (4 loader/construct raises + 2 CLI fail-closed-over-matching-DBs + 1 empty-EvalSpec raise) — all red before the fix, green after. The cycle-1 comparator/leakage fixes, the resolver rider, and the AC-3 reward shape are untouched; `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` = 73 passed. + +## Stage Report: validation (cycle 3) + +- DONE: Confirm the fail-closed fix is real and load-bearing: an empty file, a missing/empty `condition_tabs`, and a non-`duckdb_match` `evaluation.func` each RAISE in load_eval_spec/EvalSpec; and `emit_reward` (verify.py) surfaces any spec/compare failure as reward 0 (never crash-into-pass, never silent 1.0). Prove it beats the prior hazard: over a MATCHING pred/gold pair, an empty/zero-table spec now yields reward 0 (or raises), not 1.0. Confirm the negative tests fail if the guards are removed. + Independent probe (clean branch HEAD 11c66f4): empty file / missing `condition_tabs` / empty `condition_tabs` / wrong func / `EvalSpec([])` all RAISE ValueError; over a MATCHING pred/gold pair an empty-spec and a wrong-func spec each emit `{"reward": 0.0}` (not 1.0), garbage non-JSON emits 0.0 without crashing, and a VALID matching spec still emits 1.0 (positive control — guards aren't blanket-failing). MUTATION TEST: stripping the `n==0`/empty-file/wrong-func guards made all 6 negative tests FAIL (CLI mutation reproduced the exact `{'reward': 1.0}` hazard) — guards are load-bearing; tree restored clean. +- DONE: Confirm the cycle-1 fixes are intact: the comparator still faithfully reproduces Spider2 duckdb_match (column-containment, math.isclose 1e-2, per-table condition_cols/ignore_orders), and the agent-view leakage scoping still excludes only the verify-only tests/ subtree (agent-visible leaks still caught). + `git diff 0b64e92..869918d --name-only` touches ONLY eval_spec.py/verify.py + their 2 test files — duckdb_match.py (B2), harbor_view.py + test_translate_spider2_dbt.py (B1) untouched, so cycle-1 is intact by construction. Confirmed live anyway: comparator probe passes column-containment (reordered + extra pred col -> True), isclose 1e-2 within/beyond, per-column sort under ignore_order, ordered mismatch, missing pred table -> False. Leakage suite (`leakage_scan_still_fires` guard) 9 passed; translate suite leakage/planted/resolves 3 passed. +- DONE: Confirm no regression and rider/AC-3 intact: `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` (expect ~73 passed); emitted test.sh resolves /app/.duckdb via ny's resolver; reward.json shape `{"reward": }`. Give a gate verdict. + Gating suite = 73 passed, 751 deselected. RIDER: materialized view with `task_slug=not-spider2-slug` -> emitted test.sh `--predicted-db /app/not-spider2-slug.duckdb`, no hardcoded `/app/spider2.duckdb`. AC-3 end-to-end via emit_reward against the materialized gold fixture: matching pred -> `{"reward": 1.0}`, mismatch -> `{"reward": 0.0}`, `set(payload)=={"reward"}` & float. Working tree clean (only known `uv.lock` churn, ignored). Verdict: PASSED. + +### Summary + +Cycle-3 re-validation of the fail-closed fix (commit 869918d) — independent verification only, no production code. The fix is real and load-bearing: empty file / missing-or-empty `condition_tabs` / non-`duckdb_match` func all RAISE, `emit_reward` surfaces any failure as reward 0 (no crash-into-pass, no silent 1.0), and over a MATCHING pred/gold pair the prior would-be-1.0 hazard now scores 0 — proven by a mutation test where removing the guards makes all 6 negatives fail (CLI mutation reproduced the literal `{'reward': 1.0}` hazard). A positive control confirms a valid matching spec still scores 1.0. Cycle-1 fixes (faithful column-containment comparator + 1e-2 tolerance; verify-only-tests/ leakage scoping) are untouched by the cycle-2 diff and re-confirmed live; rider (`/app/.duckdb`) and AC-3 reward shape intact; gating suite 73 passed; only the known pre-existing unrelated failures + harness `uv.lock` churn remain. Gate verdict: PASSED -> done. Full report: docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md + +### Cycle 3 — validation gate REJECTED (2026-06-18, captain via Codex review) + +Codex found the verifier ignores the eval spec's gold-DB basename; captain chose +fix-now. Confirmed live. Routing back to `implementation`: + +1. **[high] Verifier hardcodes `gold.duckdb`, ignoring `evaluation.parameters.gold`.** + `EvalSpec` has no `gold` field and `load_eval_spec` discards `parameters.gold`; `harbor_view.py:63` emits `--gold-db /tests/gold.duckdb` and `:125` copies a hardcoded `gold.duckdb`. Real Spider2 tasks name the gold DB per task (e.g. `playbook.duckdb`, `tpch.duckdb`), so the verifier would fail to find the gold or score against the wrong file — a benchmark-correctness defect. This is the gold-DB analog of the predicted-DB resolver ny added. + **Fix:** add `gold` to `EvalSpec`, parse it from `evaluation.parameters.gold` (validate present for real wrapped specs), copy THAT exact file from the source `tests/gold/` into the view's verifier-only `tests/`, and emit `--gold-db /tests/`; fail closed if the named gold file is missing. Add a regression test with `parameters.gold` set to a NON-default basename (and no `gold.duckdb` present), proving the named file is used. Keep the cycle-1 comparator/leakage fixes, the cycle-2 fail-closed validation, the rider, and AC-3 reward shape intact. + +## Stage Report: implementation (cycle 3) + +- DONE: Add a `gold` field to `EvalSpec` and parse it from `evaluation.parameters.gold` in `load_eval_spec` (validate present/non-empty for real wrapped specs; keep the cycle-2 fail-closed behavior). The gold basename drives which file is scored — do not hardcode `gold.duckdb`. + Commit a677c36. `EvalSpec.gold: str | None`; `load_eval_spec` reads `parameters.gold` and RAISES `ValueError` when a wrapped (`evaluation.func == duckdb_match`) line omits/empties it. Cycle-2 guards (empty file / wrong func / empty condition_tabs raise) unchanged. Confirms upstream `evaluate.py:96-99` resolves `parameters['gold']` per-task before calling `duckdb_match` (fetched live via `gh api`). +- DONE: Wire the parsed gold basename through materialization (`harbor_view.py`): copy the EXACT named file from the source `tests/gold/` into the view's verifier-only `tests/`, and emit `--gold-db /tests/` in test.sh (resolved from the spec, not hardcoded). Fail closed if the named gold file is missing. + Commit a677c36. `_ensure_verifier_assets` calls `load_eval_spec(source tests/gold/spider2_eval.jsonl)`, copies `tests/gold/` to `tests/`, formats `_TEST_SH_TEMPLATE` with `gold_db=` -> `--gold-db /tests/`, and raises `FileNotFoundError` if the named gold is absent. Leakage-clean (no `gold/` segment) preserved. +- DONE: Add a regression test with `parameters.gold` set to a NON-default basename (e.g. `playbook.duckdb`) and NO `gold.duckdb` present, proving the named file is copied + scored. Keep intact: cycle-1 comparator faithfulness + leakage scoping, cycle-2 fail-closed validation, the rider (`/app/.duckdb`), AC-3 reward shape. `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` green. + 5 new tests (red-first verified): non-default `playbook.duckdb` copied + `--gold-db /tests/playbook.duckdb` emitted + leakage-clean; named gold end-to-end scores 1.0; missing named gold fails closed; loader parses gold + raises on missing-gold-in-wrapped-spec. Gating suite = 78 passed (was 73). Cycle-1 (`duckdb_match.py`) untouched; cycle-2 guards untouched; rider + AC-3 reward shape unchanged. + +### Summary + +Resolved the cycle-3 gold-DB-basename defect (commit a677c36). The verifier no longer hardcodes `gold.duckdb`: `EvalSpec` carries a `gold` field, `load_eval_spec` parses `evaluation.parameters.gold` and fails closed when a real wrapped `duckdb_match` spec omits it, and `_ensure_verifier_assets` copies the EXACT spec-named gold file into the verifier-only `tests/` and emits `--gold-db /tests/` — raising if the named file is missing so a benchmark can never silently score against the wrong/missing gold. This is the gold-DB analog of the existing predicted-DB resolver rider. Proven TDD-first by 5 tests (red before, green after), including a `playbook.duckdb`/no-`gold.duckdb` regression that copies + scores the named file. Cycle-1 comparator faithfulness + leakage scoping, cycle-2 fail-closed validation, the resolver rider, and AC-3 reward shape are all intact; `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` = 78 passed. + +## Stage Report: validation (cycle 4) + +- DONE: Confirm the cycle-3 fix: `EvalSpec.gold` parsed from `evaluation.parameters.gold`; the materializer copies the EXACT named gold from source `tests/gold/` into the verifier-only tests/, emits `--gold-db /tests/` (not hardcoded gold.duckdb), and fails closed (FileNotFoundError) when the named gold is absent. Exercise it: a spec naming `playbook.duckdb` with NO `gold.duckdb` present copies + scores `playbook.duckdb`, leakage-clean; a wrapped spec missing `gold` raises. + Verified live against branch HEAD bac0705 / fix commit a677c36. Standalone probe (no test infra) 11/11: wrapped-spec-missing-gold RAISES, empty-gold RAISES, gold parsed verbatim; `playbook.duckdb`+no-`gold.duckdb` materializes → only `playbook.duckdb` copied, test.sh `--gold-db /tests/playbook.duckdb`, no `/tests/gold.duckdb`, leakage-clean (no `gold/` segment); end-to-end `emit_reward` against named gold → match 1.0 / mismatch 0.0; missing named gold → FileNotFoundError. LOAD-BEARING: hardcoding `gold.duckdb` in harbor_view → 2 regression tests FAIL; disabling the loader missing-gold raise → 1 test FAILS. Tree restored clean. +- DONE: Confirm no regression to prior cycles: cycle-1 comparator faithfulness + leakage scoping; cycle-2 fail-closed validation. Confirm the rider (`/app/.duckdb`) and AC-3 reward.json shape intact. + Cycle-3 diff touches ONLY eval_spec.py + harbor_view.py + their 2 tests; duckdb_match.py (cycle-1, 61a1b9c) and verify.py (cycle-2/AC-3, 869918d) UNTOUCHED — intact by construction. Re-confirmed live: cycle-1 column-containment + isclose 1e-2 + per-column sort + missing-table mismatch (3/3); cycle-2 empty-file/wrong-func/empty-condition_tabs raises (3/3). Rider test `..._uses_resolved_db_name` → `/app/not-spider2-slug.duckdb` 1 passed; AC-3 `{"reward": }` shape exercised end-to-end. +- DONE: `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` green (~78 passed); test_translate_spider2_dbt.py green. Give a gate verdict (PASSED -> done, or REJECTED -> implementation). + Gating suite = 78 passed, 751 deselected. test_translate_spider2_dbt.py = 13 passed. Full suite (ignoring the broken scoring file) = 813 passed, 4 failed; the 4 failures ALSO fail on merge-base 9c39af2 (verified in a throwaway base worktree) — pre-existing/unrelated, NOT regressions. Verdict: PASSED. + +### Summary + +Cycle-3-fix re-validation (commit a677c36) — independent verification only, no production code. The gold-DB-basename fix is real and load-bearing: a non-default `playbook.duckdb` (with NO `gold.duckdb` present) is copied + scored + leakage-clean, the emitted test.sh carries `--gold-db /tests/playbook.duckdb`, a wrapped spec lacking `gold` raises, and a missing named gold fails closed — each proven by a standalone 11/11 probe AND a mutation test that flips the regression tests red. Cycles 1-2 (faithful column-containment comparator + 1e-2 tolerance; fail-closed empty/malformed-spec guards), the predicted-DB resolver rider, and the AC-3 `{"reward": }` shape are untouched by the fix and re-confirmed live. Gating suite 78 passed; translate suite 13 passed; the only remaining full-suite failures are pre-existing on base (verified). Gate verdict: PASSED → done. Full report: docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md + +### Cycle 4 — validation gate REJECTED (2026-06-18, captain via Codex review) + +4th adversarial Codex pass found a path-traversal hole in the (now spec-driven) gold basename; captain chose fix-now. Confirmed live. Routing back to `implementation`: + +1. **[high] `evaluation.parameters.gold` used verbatim as a path component (trust boundary).** + `gold` is external Spider2 input. `harbor_view.py:139-147` joins it as `source_gold / gold` (read) and `tests / gold` (copy2), and emits `--gold-db /tests/` in test.sh. `Path` preserves `..`/absolute components, so a spec value like `../dbt_project/foo.duckdb` passes the `is_file()` check against a file OUTSIDE `tests/gold/`, then `copy2` writes outside `view/tests/` and the emitted path leaves `/tests` — benchmark corruption / leakage from a malformed or hostile task. + **Fix:** reject non-basename `gold` at the trust boundary (`load_eval_spec`): require `Path(g).name == g`, reject `.`/`..`/absolute/separators, before any path use. Add a `../`/absolute/separator regression test asserting fail-closed. Keep cycles 1-3 intact. + +## Stage Report: implementation (cycle 4) + +- DONE: Reject non-basename `gold` in `load_eval_spec` before it is used as a path component. + Guard added in `eval_spec.py` immediately after the missing-gold check: when `gold` is a non-empty str, require `Path(g).name == g` and reject `.`/`..` → raises `ValueError` (fail-closed at the trust boundary). Bare-fixture `gold=None` still falls back to `gold.duckdb` (safe). The materializer and test.sh emitter are unchanged — the basename is sanitized upstream so both inherit the guarantee. +- DONE: Regression test for `../`/absolute/separator/`.`/`..` gold values. + Parametrized `test_..._rejects_non_basename_gold` (5 cases: `../dbt_project/foo.duckdb`, `/etc/passwd`, `sub/dir/g.duckdb`, `..`, `.`) asserts `load_eval_spec` raises. Comparator suite 28 passed (was 23); harbor_view + verify_cli + translate suites 34 passed. Cycles 1-3 untouched. + +### Summary + +Resolved the cycle-4 path-traversal defect. `load_eval_spec` now fails closed on any `gold` that is not a bare basename (`..`, `.`, absolute, or containing separators), sanitizing the value at the trust boundary so both the materializer's `copy2` targets and the emitted `--gold-db /tests/` path stay inside the verifier-only gold area. One file + one parametrized test (5 cases); cycles 1-3 left intact. spider2_dbt suite green (62 across comparator/harbor_view/verify_cli/translate). + +### Cycle 5 — validation gate REJECTED (2026-06-18, captain via Codex review) + +5th adversarial Codex pass found the cycle-4 basename guard still admits shell metacharacters; captain chose fix-now. Confirmed live. Routing back to `implementation`: + +1. **[high] Spec-controlled gold basename emitted UNQUOTED into test.sh → shell injection.** + The cycle-4 guard rejects separators/`..`/absolute, but a POSIX-valid basename like `x.duckdb; echo '{"reward":1.0}' > /logs/verifier/reward.json #` has no separator, so `Path(g).name == g` passes; it is then interpolated unquoted into `--gold-db /tests/` in test.sh and executes as shell during verification. External task input → trust boundary. + **Fix:** replace the basename guard in `load_eval_spec` with a conservative allowlist `^[A-Za-z0-9._-]+\.duckdb$` (subsumes the path checks AND blocks metacharacters/whitespace). Add `;`/space/`$()` regression cases. Keep cycles 1-4 intact. + +## Stage Report: implementation (cycle 5) + +- DONE: Replace the cycle-4 basename guard with a conservative allowlist at the trust boundary. + `eval_spec.py`: `_SAFE_GOLD_RE = re.compile(r"[A-Za-z0-9._-]+\.duckdb")`; `load_eval_spec` rejects any non-`None` `gold` that fails `fullmatch`. Strictly tighter than the cycle-4 check — separators, `..`, absolute, whitespace, and shell metacharacters all fail closed at parse, so neither the materializer's `copy2` nor the unquoted test.sh `--gold-db` can be subverted. Bare-fixture `gold=None` → `gold.duckdb` fallback (safe) unchanged. +- DONE: Regression cases for shell-injection + wrong-suffix gold values. + Parametrized `test_..._rejects_non_basename_gold` extended to 9 cases: the 4 path cases plus `x.duckdb; … #`, `g.duckdb $(id)`, `g .duckdb` (space), `g.sqlite` (wrong suffix). spider2_dbt suite (comparator/harbor_view/verify_cli/translate) = 66 passed. Cycles 1-4 untouched. + +### Summary + +Closed the cycle-5 shell-injection class definitively. The cycle-4 basename guard is replaced by a conservative `[A-Za-z0-9._-]+\.duckdb` allowlist in `load_eval_spec`, which subsumes the path-traversal checks and additionally rejects whitespace and shell metacharacters — so a spec-supplied gold name can neither escape `tests/gold/` nor inject into the unquoted `--gold-db` argument of the emitted test.sh. This is a terminal trust-boundary fix (allowlist, not edge-by-edge): one regex + the cycle-4 guard swap, 4 new regression cases (9 total). Cycles 1-4 intact; spider2_dbt suite 66 passed. + +### Cycle 6 — validation gate REJECTED (2026-06-18, captain via Codex review) + +6th adversarial Codex pass found the SYMMETRIC injection: the predicted-DB arg was still unquoted; captain chose fix-now. Confirmed live. Routing back to `implementation`: + +1. **[high] Predicted-DB `db_name` emitted UNQUOTED into test.sh → shell injection (the other arg).** + The cycle-5 gold allowlist only guarded `--gold-db`. `--predicted-db /app/{db_name}.duckdb` interpolated `db_name` — resolved from the task's `profiles.yml` `path:` (external input) by `resolve_spider2_db_name` — without quoting. A profile path with shell metacharacters executes during verification (overwrite reward.json / exfiltrate verifier assets). Preflight's Docker RUN already quoted this value; the verifier script did not. + **Fix:** `shlex.quote` BOTH args at the emission point in `_ensure_verifier_assets` (terminal, source-agnostic). Keep the gold allowlist (it also guards traversal). Add a regression with a `profiles.yml` DuckDB path containing `$()`/whitespace asserting the emitted arg is quoted. + +## Stage Report: implementation (cycle 6) + +- DONE: `shlex.quote` both `--predicted-db` and `--gold-db` values where test.sh is emitted. + `harbor_view.py`: `_TEST_SH_TEMPLATE` now takes fully-formed `{predicted_db}`/`{gold_db}` and `_ensure_verifier_assets` passes `shlex.quote(f"{_APP_ROOT}/{db_name}.duckdb")` and `shlex.quote(f"/tests/{gold_basename}")`. Safe names are unchanged by quoting (existing `/app/.duckdb` and `/tests/playbook.duckdb` assertions still pass); metacharacter names are single-quoted, so neither external-input arg can inject. Gold allowlist retained for traversal defense-in-depth. +- DONE: Regression for a `profiles.yml` path with shell metacharacters. + `test_..._quotes_predicted_db_against_injection`: a `path: evil$(touch pwned).duckdb` profile materializes a test.sh whose `--predicted-db` is `shlex.quote`'d (raw `--predicted-db /app/evil$(touch pwned).duckdb` absent). Load-bearing: reverting the quote flips it red. spider2_dbt suite (comparator/harbor_view/verify_cli/translate) = 67 passed. Cycles 1-5 untouched. + +### Summary + +Closed the symmetric predicted-DB injection that the cycle-5 gold allowlist didn't cover. Both verifier arguments derived from external task input — `db_name` (from `profiles.yml` `path:`) and `gold_basename` (from the eval spec) — are now `shlex.quote`'d at the single test.sh emission point, so neither can be interpreted as shell syntax during verification regardless of source. The gold allowlist stays (it additionally blocks path-traversal, which quoting alone wouldn't). One-line template change + quoted format args + one load-bearing regression (profiles `$()` path). Cycles 1-5 intact; spider2_dbt suite 67 passed. The entire verifier shell boundary (both args, traversal + injection) is now sealed. + +### Cycle 7 — validation gate REJECTED (2026-06-18, captain via Codex review) + +7th adversarial Codex pass found a runtime-dependency gap (different class from the shell-boundary cycles); captain chose fix-now via the duckdb-only re-port. Confirmed live. Routing back to `implementation`: + +1. **[high] Verifier imported undeclared/uninstalled `pandas` → crash-on-import before reward.json.** + `duckdb_match.py` imported `pandas` at module level; `verify.py` (which imports it) runs INSIDE the task image via `python /tests/verify.py`. The image is guaranteed to have `duckdb` (the build-time preflight imports it), but NOT `pandas` — it is not in `pyproject.toml` and dbt-duckdb does not pull it in as a core dep. In an image without pandas the verifier dies during import before `emit_reward()` writes `/logs/verifier/reward.json`, turning valid runs into infra failures. + **Fix (chosen direction: remove pandas):** re-port the column-containment compare onto duckdb's own `.fetchall()`/`.description` + stdlib (`sorted`, `math.isclose`), so the verifier depends ONLY on duckdb — the one library the image is already required to have. Keep all behavioral comparator tests green unchanged (the faithfulness net). Rejected alternative: keep pandas + add a `RUN pip install pandas` layer (network/pip/version assumptions, image bloat). + +## Stage Report: implementation (cycle 7) + +- DONE: Remove the `pandas` import from the verifier comparator; re-port on duckdb-native + stdlib. + `duckdb_match.py` no longer imports pandas. `_fetch_columns` replaces `fetchdf().transpose().values.tolist()` with `cur.fetchall()` + `cur.description` (column count) → list of column-vectors of native Python scalars (NULL→None); a zero-row table still yields one empty vector per column. `_isna` (`x is None or NaN-float`) replaces `pd.isna`; `_vectors_match`, the Spider2 sort key, the `math.isclose(abs_tol=1e-2)` numeric path, condition_cols restriction, the column-containment loop, the per-table AND, and the missing-pred-table→False try/except are all preserved verbatim. Only duckdb (already required by the build-time preflight) is imported. +- DONE: Keep the behavioral comparator tests green unchanged (faithfulness net). + All comparator tests pass unmodified: matching/mismatched DBs, all-tables-AND, missing predicted table→False, float within/beyond 1e-2 tolerance, column-containment reorder, extra-pred-column tolerated, condition_cols restrict + diff-detected, row-reorder match/mismatch under ignore_orders, multi-table gold-line mismatch. spider2_dbt suite (comparator/verify_cli/harbor_view/translate) = 67 passed. `grep import pandas` over `benchmarks/spider2_dbt/` → none (only a docstring `pd.isna` mention). + +### Summary + +Removed the verifier's `pandas` dependency by re-porting the comparator onto duckdb-native fetch + stdlib. `verify.py` now imports only `duckdb` — the single library the verify-time task image is already guaranteed to ship (proven by the build-time preflight that imports it) — so the emitted verifier can no longer crash-on-import in a dbt-duckdb image that lacks pandas (which is neither a project dep nor installed by any injected layer). The Spider2 `duckdb_match` semantics are preserved verbatim (column-containment, `isclose` 1e-2, Spider2 sort key, condition_cols, multi-table AND, missing-table→0, NA==NA); the full behavioral comparator suite stays green UNCHANGED as the faithfulness proof. Cycles 1-6 intact; spider2_dbt suite 67 passed. The verifier now has zero undeclared runtime deps. + +### Cycle 8 — validation gate REJECTED (2026-06-18, captain via Codex review) + +8th adversarial Codex pass found a SQL-injection via spec-supplied table names; captain chose fix-now. Confirmed live. Routing back to `implementation`: + +1. **[high] `condition_tabs` table name interpolated raw into DuckDB SQL → reward rigging.** + `_fetch_columns` built `SELECT * FROM "{table}"` by string interpolation, but `condition_tabs` comes from the external `spider2_eval.jsonl`. A value like `realt"; select 999 AS a; --` (existing-table prefix) breaks out of the identifier; DuckDB runs the multi-statement and `.fetchall()` returns the LAST statement's rows. So a hostile spec makes BOTH gold and pred fetches return identical injected rows → forced `reward: 1.0` over genuinely mismatched DBs. Verified live (`.venv/bin/python`: raw → `[(999,)]`; quote-doubled → CatalogException). + **Fix:** add an identifier-quoting helper that doubles embedded `"` and use it in `_fetch_columns`; a bogus name then fails to resolve → the gold fetch raises → `emit_reward` scores 0 (fail-closed). Add a regression where `condition_tabs` carries `"; select ...; --` over mismatched DBs and assert it cannot score a match. + +## Stage Report: implementation (cycle 8) + +- DONE: Quote spec-supplied table identifiers in `_fetch_columns` (double embedded `"`). + `duckdb_match.py`: new `_quote_ident(name)` returns `'"' + name.replace('"','""') + '"'`; `_fetch_columns` now does `SELECT * FROM {_quote_ident(table)}`. A breakout value becomes a single bogus identifier that doesn't resolve → CatalogException; `compare_duckdb`'s gold fetch is not try/excepted, so it raises → `emit_reward`'s wrapper scores 0 (fail-closed). Legit table names (incl. ones with special chars) are preserved. +- DONE: Regression proving `condition_tabs` cannot SQL-inject a match. + `test_..._condition_tabs_cannot_sql_inject`: `condition_tabs=['realt"; select 999 AS a; --']` over real `realt` tables holding DIFFERENT data (genuine mismatch). Asserts `_compare` raises (cannot return a forced match). Load-bearing: reverting the quoting makes both fetches return `[(999,)]` → match → the `pytest.raises` fails. spider2_dbt suite (comparator/verify_cli/harbor_view/translate) = 68 passed. Cycles 1-7 untouched. + +### Summary + +Closed a benchmark-integrity SQL-injection: `condition_tabs` (external eval-spec input) was interpolated raw into the comparator's `SELECT * FROM ""`, letting a hostile spec break out, run an injected statement, and force `reward: 1.0` over mismatched DBs. `_fetch_columns` now quotes the identifier with doubled `"`, so a breakout value becomes an unresolvable identifier and the fetch raises → reward 0 (fail-closed). One helper + one call-site change + one load-bearing regression (verified live that raw injects `[(999,)]` and the fix raises). This is the third trust-boundary arg sealed (gold name → cycle 5, predicted name → cycle 6, table names → now) — SQL this time, not shell. Cycles 1-7 intact; spider2_dbt suite 68 passed. + +### Cycle 9 — validation gate REJECTED (2026-06-18, captain via Codex review) + +9th adversarial Codex pass found a DECIMAL numeric-tolerance regression (introduced by the cycle-7 pandas removal) and a missing-gold fail-open; captain chose fix-both. Confirmed live. Routing back to `implementation`: + +1. **[high] DECIMAL columns skip the `isclose(1e-2)` tolerance (regression from cycle 7).** + DuckDB native `fetchall()` returns `decimal.Decimal` for DECIMAL/NUMERIC; `Decimal` is `numbers.Number` but NOT `numbers.Real`, so `_vectors_match`'s `isinstance(_, (int,float))` numeric branch skipped it → `!=` exact compare. Spider2's pandas `fetchdf()` had converted DECIMAL→float64, so a within-1e-2 DECIMAL match scored 0. Verified live: `_vectors_match([D('1.005')],[D('1.000')])` → False. + **Fix:** normalize `Decimal`→`float` in `_fetch_columns` (one place; mirrors fetchdf's float64), keeping `_vectors_match`/sort-key identical to Spider2. DECIMAL within/beyond-tolerance regression tests. +2. **[medium] Missing `tests/gold/` silently kept the source test.sh → unscored task.** + `_ensure_verifier_assets` early-returned when `tests/gold/` was absent, leaving the source `test.sh` (e.g. `fixture-002`'s `exit 0`) → a trivially-passing spider2-dbt task under dataset skew / resolver bug. + **Fix:** fail closed (raise) when a spider2-dbt source lacks `tests/gold/spider2_eval.jsonl`; give `fixture-002` a real gold so the multi-instance/explain tests exercise two scored tasks; fail-closed materialization regression. + +## Stage Report: implementation (cycle 9) + +- DONE: DECIMAL tolerance — normalize Decimal→float at fetch. + `duckdb_match.py`: new `_normalize(v)` (`float(v) if isinstance(v, Decimal) else v`) applied per cell in `_fetch_columns`; `_vectors_match` and the Spider2 sort key are unchanged (they now see floats exactly as Spider2's pandas path did). 2 regression tests: DECIMAL(10,3) within 1e-2 → match, beyond → mismatch. Load-bearing (without normalize, Decimal `!=` → the within-tol match would score False). +- DONE: Fail closed on missing gold + make fixture-002 scored. + `harbor_view.py`: `_ensure_verifier_assets` now raises `FileNotFoundError` when `tests/gold/spider2_eval.jsonl` is absent (was a silent early-return). `tests/fixtures/.../spider2-fixture-002/tests/gold/` gains a minimal `gold.duckdb` (orders table) + `spider2_eval.jsonl`, so the multi-instance leakage test and the `rk run --explain` integration test now materialize TWO scored tasks. Test helpers `_write_source` (harbor_view) and the non-dbt `plain-001` case updated to ship gold (they assert on layers, not scoring). New `test_..._missing_gold_dir_fails_closed` asserts a no-gold source raises. +- DONE: full spider2_dbt suite green. + `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` = 92 passed. Comparator faithfulness net (cycles 1/7), shell-boundary quoting (cycles 5/6), SQL-identifier quoting (cycle 8), and AC-3 reward shape all intact. + +### Summary + +Fixed a benchmark-correctness regression and a fail-open. (1) The cycle-7 pandas removal had left DECIMAL/NUMERIC values as `decimal.Decimal`, which bypassed `_vectors_match`'s `isclose(1e-2)` numeric branch (Decimal is Number, not Real) — so real dbt DECIMAL outputs within tolerance scored 0. Normalizing Decimal→float at fetch (one place) restores Spider2's float64 semantics with the comparator otherwise unchanged. (2) A spider2-dbt source missing `tests/gold/` silently kept its source `test.sh`, materializing an unscored / trivially-passing task; the materializer now fails closed, and `fixture-002` was given a real gold so the multi-instance + explain tests exercise two genuinely-scored tasks. Both proven by load-bearing regressions. Cycles 1-8 intact; spider2_dbt suite 92 passed. + +### Cycle 10 — validation gate REJECTED (2026-06-18, captain via Codex review) + +10th adversarial Codex pass found an unguarded symlink-write-through in the verifier-asset copies (same class as the Dockerfile/preflight/test.sh guards); captain chose fix-now. Confirmed live. Routing back to `implementation`: + +1. **[high] `_ensure_verifier_assets` `copy2` calls follow symlinks → source corruption in link mode.** + In `view_mode="link"` the generic materializer reflects allowed source files as symlinks. The 3 comparator-module copies + the gold-DB copy + the eval-spec copy used bare `shutil.copy2`, so a colliding source-provided name (`tests/verify.py`, `tests/eval_spec.py`, `tests/duckdb_match.py`, top-level `tests/spider2_eval.jsonl`, `tests/.duckdb`) is a symlink back to source → `copy2` follows it → overwrites the source task. spider2 translation binds to link mode, so this is on the live path. `test.sh`/Dockerfile/preflight already guarded this exact class; these copies didn't. + **Fix:** unlink-before-copy helper (`if dst.is_symlink(): dst.unlink()`) for all 5 verifier-asset copies; link-mode regression seeding a source `tests/verify.py` proving the source stays unchanged. + +## Stage Report: implementation (cycle 10) + +- DONE: Guard all 5 verifier-asset copies against symlink write-through. + `harbor_view.py`: new `_copy_into_view(src, dst)` unlinks a symlink `dst` before `shutil.copy2`; applied to the 3 module copies, the named gold-DB copy, and the eval-spec copy. The `test.sh` write keeps its existing guard. Same pattern as the Dockerfile/task.toml/preflight write-through fixes. +- DONE: Link-mode regression proving a colliding source file is not corrupted. + `test_link_mode_verifier_assets_never_mutate_colliding_source_file`: seeds a source `tests/verify.py` with sentinel content, materializes in link mode, asserts the view owns a REAL (non-symlink) `verify.py` carrying the generated comparator CLI AND the source file is byte-for-byte unchanged. Load-bearing (without the guard, copy2 follows the symlink → view path stays a symlink + source content overwritten → both asserts fail). spider2_dbt suite = 93 passed. + +### Summary + +Closed the last instance of the link-mode symlink-write-through hazard: `_ensure_verifier_assets` copied the comparator modules, gold DB, and eval spec into `view/tests/` with bare `copy2`, so a source task shipping a colliding name (e.g. its own `tests/verify.py`) would have its source file overwritten during translate/materialize (spider2 binds to link mode). All 5 copies now route through `_copy_into_view`, which unlinks a symlink destination first — the same guard already applied to the Dockerfile, preflight script, and test.sh writes. Proven by a load-bearing link-mode regression. Cycles 1-9 intact; spider2_dbt suite 93 passed. + +## Stage Report: validation (cycle 5) + +- DONE: Rerun `uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py` and the entity's acceptance command from a clean checkout of the worktree branch; record ACTUAL output and map PASS/FAIL to each AC clause (do not trust prior reports) + Clean branch HEAD 965f2ea: gating `-k spider2_dbt` = 93 passed, 751 deselected; acceptance `-k spider2_dbt_verify` = 49 passed, 795 deselected. AC-1/AC-2/AC-3 all PASS — confirmed by an 11-case standalone `compare_duckdb` probe (column-containment, isclose/DECIMAL 1e-2, per-column sort, condition_cols restrict, multi-table AND, missing-table→0, NA==NA) and an `emit_reward` probe (matching→`{"reward":1.0}`, mismatch/garbage/empty-spec/missing-pred→`{"reward":0.0}`, value is float, no crash-into-pass). Fail-closed + injection guards (8 unsafe-gold cases raise, SQL-injection condition_tabs raises) all confirmed live. duckdb-only: grep for pandas/numpy → none. +- DONE: Run `superpowers:requesting-code-review` against the worktree branch (base main); classify every finding as blocking or non-blocking + Reviewer (range 9c39af2..965f2ea) independently fetched the upstream xlang-ai/Spider2 oracle and ran the suite: Critical 0, Important 0 — could construct NO external input that forces a false reward 1.0. 3 Minor (all non-blocking): `condition_cols` `[[]]`→`[[]]*n` forgiving-direction default; leakage-scoping rests on Harbor verify-only-tests/ lifecycle (suggest version-pin comment); dead `gold.duckdb` fallback for the unwrapped-fixture path. All clarity/robustness polish, no correctness defect. +- DONE: Write/refresh the validation report covering PASS/FAIL per AC with exact command+output, the code-review findings, and a gate decision + `docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md` refreshed for the cycles-1–10 HEAD. Full suite = 4 failed, 828 passed; the 4 failures reproduce on merge-base 9c39af2 (throwaway base worktree) and live in files untouched by this branch — pre-existing, NOT regressions. Gate: PASSED. + +### Summary + +Independent validation of the cycles-1–10 HEAD (965f2ea) — no production code written. All three ACs reproduce green from a clean branch checkout with recorded command output (gating 93 passed; acceptance `-k spider2_dbt_verify` 49 passed), and every dispatch-named focus area is confirmed by live behavioral probes rather than re-reading prior reports: faithful column-containment comparator (isclose 1e-2, per-column sort, condition_cols, multi-table AND, missing-table→0, NA==NA, DECIMAL tolerance), duckdb-only (no pandas/numpy import), fail-closed spec guards + gold-basename allowlist, sealed shell (shlex.quote both args) and SQL (identifier-quoted condition_tabs) injection boundaries, link-mode symlink-write-through guards on all 5 asset copies, and the `{"reward": }` shape with no crash-into-pass. The 4 full-suite failures are pre-existing on the merge-base and outside this branch's scope (verified). The independent code review (which fetched the upstream Spider2 oracle) found zero blocking issues — 3 minor polish notes only. Gate verdict: PASSED → done. Full report: docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md diff --git a/docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md b/docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md new file mode 100644 index 00000000..c2fc2b48 --- /dev/null +++ b/docs/razorback-implementation/validation/spider2-dbt-duckdb-match-verifier.md @@ -0,0 +1,147 @@ +# Validation report — spider2-dbt duckdb_match verifier (cycles 1–10, final) + +- Entity: `docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md` +- Branch: `spacedock-ensign/spider2-dbt-duckdb-match-verifier` +- Validated HEAD: `965f2ea` (includes cycles 1–10) +- Merge-base with `main`: `9c39af2` +- Method: independent validation of the current worktree HEAD — fresh suite + runs from the clean branch checkout, standalone behavioral probes (not a + re-read of prior reports), and `superpowers:requesting-code-review`. + +## Acceptance criteria — PASS/FAIL with command + output + +### AC-1 — comparator scores 1.0 on a matching DB, 0.0 on a mismatch — PASS + +Verified by the comparator unit suite and a standalone live probe driving +`compare_duckdb` over in-test DuckDB fixtures. + +``` +$ uv run pytest -k spider2_dbt_verify --ignore=tests/unit/test_task_identity_scoring.py -q +49 passed, 795 deselected in 2.84s +``` + +Standalone probe drives `compare_duckdb` directly: matching DB → True, +mismatched DB → False; missing predicted table → False; multi-table AND with one +table mismatched → overall False. All PASS. + +### AC-2 — column subsetting + ignore_orders honor duckdb_match semantics — PASS + +Same probe, exercised live: + +- `ignore_orders=True` over a row-reordered table → match (True); same data with + `ignore_orders=False` → mismatch (False). +- A difference in a column NOT in `condition_cols` → still match (True); the same + difference WITH that column in `condition_cols` → mismatch (False). +- Column-containment: pred columns reordered + an extra pred column → still match. +- `math.isclose(abs_tol=1e-2)`: within 1e-2 → match, beyond → mismatch. +- DECIMAL(10,3) within 1e-2 → match (cycle-9 `_normalize` Decimal→float). +- NULL==NULL → match (NA==NA). + +All 11 probe cases PASS. Independently re-confirmed against the upstream +`xlang-ai/Spider2` `eval_utils.duckdb_match`/`compare_pandas_table` by the code +reviewer (fetched live during review) — line-accurate port, including the +`(x is None, str(x), isinstance(x,(int,float)))` sort key. + +### AC-3 — emitted test.sh writes a Harbor-shaped reward.json — PASS + +Verified by the integration suite and a standalone `emit_reward` probe: + +- Matching pred/gold → reward.json parses to `{"reward": 1.0}`, + `set(payload) == {"reward"}`, value is a `float`. +- Mismatch → `{"reward": 0.0}`. +- `emit_reward` never crashes-into-pass: garbage (non-JSON) spec over MATCHING + DBs → `{"reward": 0.0}`; empty-`condition_tabs` wrapped spec over matching DBs + → `{"reward": 0.0}` (the would-be-1.0 fail-open is closed); missing predicted + DB → `{"reward": 0.0}`. + +Integration test (`test_spider2_dbt_verify_test_sh.py`) executes the emitted +`test.sh` end-to-end and asserts the reward.json shape — green within the +`spider2_dbt_verify` 49-passed run above. + +## Stage-checklist verification of the dispatch focus areas + +- **Comparator faithfulness** — PASS. Column-containment, isclose 1e-2, per-column + sort, condition_cols restriction, multi-table AND, missing-table → 0, NA==NA, + DECIMAL within tolerance all confirmed live (probe) + reviewer's line-by-line + oracle comparison. +- **duckdb-only dependency** — PASS. + `grep -rn "import pandas\|from pandas\|import numpy\|from numpy" + src/razorback/benchmarks/spider2_dbt/` → no matches. Only `duckdb` + stdlib are + imported by the comparator/verifier. +- **Fail-closed guards** — PASS. Live probe: empty file, wrong `evaluation.func`, + empty/missing `condition_tabs`, and a wrapped spec missing `parameters.gold` + each RAISE; the gold-basename allowlist rejects all traversal/metachar cases + (`../…`, `/etc/passwd`, `sub/dir/g.duckdb`, `..`, `.`, `x.duckdb; … #`, + `g.duckdb $(id)`, `g .duckdb`, `g.sqlite`). Missing `tests/gold/` → + `_ensure_verifier_assets` raises `FileNotFoundError`. +- **Shell/SQL injection sealed** — PASS. `harbor_view.py` `shlex.quote`s BOTH + `--predicted-db` and `--gold-db` at the single emission point; + `condition_tabs` is identifier-quoted (doubled `"`) in `_fetch_columns`, and a + breakout value (`realt"; select 999 AS a; --`) over genuinely-mismatched DBs + RAISES (cannot force a match) — confirmed live. +- **Link-mode symlink-write-through** — PASS. All 5 verifier-asset copies route + through `_copy_into_view` (unlink-symlink-then-copy); test.sh, Dockerfile, and + preflight writes carry the same guard. Regression + `..._never_mutate_colliding_source_file` is green. +- **AC-3 reward.json shape / never-crash-into-pass** — PASS (see AC-3). + +## Suite results (clean branch checkout) + +``` +$ uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py -q +93 passed, 751 deselected in 5.34s + +$ uv run pytest --ignore=tests/unit/test_task_identity_scoring.py -q +4 failed, 828 passed, 12 skipped, 80 warnings in 68.05s +``` + +The 4 full-suite failures are pre-existing on the merge-base `9c39af2` +(verified by running exactly those 4 node-ids in a throwaway base worktree → +`4 failed`) and live in files UNTOUCHED by this branch +(`git diff 9c39af2..HEAD --name-only` is spider2_dbt scope + docs only). +NOT regressions: + +- `test_spacedock_solver_freeze_dir_mechanism.py::test_codex_runtime_dispatch_constructs_inner_agent` +- `test_worktree_teardown_preserves_runs.py::test_worktree_remove_force_does_not_destroy_runs` +- `test_generate_matrix_specs.py::test_matrix_specs_carry_query_mode_batch` +- `test_rk_research_new.py::test_rk_research_new_creates_scaffold_tree` + +## Code review findings (superpowers:requesting-code-review) + +Reviewer dispatched against `9c39af2..965f2ea`; it independently fetched the +upstream Spider2 oracle and ran the suite. + +- **Critical:** none. The reviewer could not construct any external input + (eval-spec JSON, gold basename, condition_tabs, task profile) that forces a + false reward 1.0. +- **Important:** none. +- **Minor (non-blocking):** + 1. `eval_spec.py:73` — `condition_cols` `[[]]`/`[None]` is expanded to + `[[]]*n` for any `n`; this is the *forgiving* direction and mirrors + upstream `compare_multi_pandas_table` defaulting (stricter `duckdb_match` + would assert). No correctness risk for well-formed specs; a clarifying + comment was suggested. + 2. The leakage-scan scoping that excludes the verify-only `tests/` subtree + rests on the Harbor "tests/ uploaded only at verify time, reset around the + agent run" lifecycle; a version-pinned comment was suggested so a future + Harbor bump can't silently invalidate it. The guard test proves the change + is scoping not weakening. + 3. `harbor_view.py:163` `spec.gold or "gold.duckdb"` fallback is effectively + dead for wrapped specs (they already raise without gold); reachable only + for the unwrapped-fixture path. A clarifying comment was suggested. + +All three are clarity/robustness polish, not correctness defects — classified +non-blocking. No production code was changed during validation. + +## Gate decision + +**PASSED → done.** + +Rationale: all three ACs reproduce green from a clean checkout with actual +command output (gating suite 93 passed; acceptance `-k spider2_dbt_verify` 49 +passed), every dispatch-named focus area is confirmed by live behavioral probes +(comparator faithfulness, duckdb-only, fail-closed guards, shell+SQL injection +sealed, symlink write-through, reward.json shape), the only full-suite failures +are pre-existing on the merge-base and outside this branch's scope, and the +independent code review found zero blocking issues (3 minor polish notes). No +external input forces a false reward 1.0. diff --git a/src/razorback/benchmarks/spider2_dbt/duckdb_match.py b/src/razorback/benchmarks/spider2_dbt/duckdb_match.py new file mode 100644 index 00000000..1b87a05b --- /dev/null +++ b/src/razorback/benchmarks/spider2_dbt/duckdb_match.py @@ -0,0 +1,158 @@ +# ABOUTME: spider2-dbt comparator faithfully reproducing Spider2 eval_utils.duckdb_match. +# ABOUTME: Column-containment over column-vectors, math.isclose(1e-2), AND across tables. +# Depends ONLY on duckdb (the one library the verify-time image is guaranteed to +# have — the build-time preflight already imports it); no pandas, so the emitted +# verifier cannot crash-on-import in a task image that ships dbt-duckdb but not pandas. +from __future__ import annotations + +import math +from decimal import Decimal +from pathlib import Path + +import duckdb + +try: + from razorback.benchmarks.spider2_dbt.eval_spec import EvalSpec +except ModuleNotFoundError: # running flat from /tests in the verifier container + from eval_spec import EvalSpec # type: ignore[no-redef] + +# Spider2 eval_utils.compare_pandas_table numeric tolerance. +_TOLERANCE = 1e-2 + + +def _isna(x) -> bool: + """stdlib stand-in for pandas.isna over scalar cells fetched from DuckDB. + + DuckDB returns SQL NULL as Python ``None``; a float ``NaN`` is the only other + NA-like scalar. Mirrors Spider2's ``pd.isna(...)`` NaN==NaN handling. + """ + return x is None or (isinstance(x, float) and math.isnan(x)) + + +def _vectors_match(v1, v2, *, tol: float = _TOLERANCE, ignore_order: bool = False) -> bool: + """Port of Spider2 eval_utils.compare_pandas_table.vectors_match. + + Compares two column-vectors element-wise: NA==NA, numerics via + math.isclose(abs_tol=tol), everything else by ``!=``. When ignore_order + is set, both vectors are sorted first with Spider2's exact key. + """ + try: + if ignore_order: + key = lambda x: (x is None, str(x), isinstance(x, (int, float))) + v1 = sorted(v1, key=key) + v2 = sorted(v2, key=key) + if len(v1) != len(v2): + return False + for a, b in zip(v1, v2): + if _isna(a) and _isna(b): + continue + elif isinstance(a, (int, float)) and isinstance(b, (int, float)): + if not math.isclose(float(a), float(b), abs_tol=tol): + return False + elif a != b: + return False + return True + except Exception: + return False + + +def _compare_table( + pred_cols: list[list], + gold_cols: list[list], + *, + condition_cols: list[int], + ignore_order: bool, +) -> bool: + """Port of Spider2 eval_utils.compare_pandas_table. + + Column-CONTAINMENT (not positional row equality): restrict gold to + ``condition_cols`` (all columns when empty), then require that EVERY gold + column-vector matches SOME pred column-vector. Extra pred columns are + therefore tolerated. Inputs are already transposed to column-vectors. + """ + if condition_cols: + gold_vecs = [gold_cols[c] for c in condition_cols] + else: + gold_vecs = gold_cols + + for gold_vec in gold_vecs: + if not any( + _vectors_match(gold_vec, pred_vec, ignore_order=ignore_order) + for pred_vec in pred_cols + ): + return False + return True + + +def _quote_ident(name: str) -> str: + """Quote a SQL identifier, doubling embedded double-quotes. + + ``condition_tabs`` table names come from the external eval spec, so a value + like ``realt"; select 999; --`` would otherwise break out of the quoted + identifier and DuckDB would execute the injected statement, letting a hostile + spec rig gold/pred fetches to return identical rows and award reward 1.0. + Doubling keeps the value a single (bogus) identifier; it then fails to + resolve and the fetch raises -> emit_reward scores 0 (fail-closed). + """ + return '"' + name.replace('"', '""') + '"' + + +def _fetch_columns(con: duckdb.DuckDBPyConnection, table: str) -> list[list]: + """``SELECT * FROM `table``` as a list of column-vectors (transposed rows). + + Replaces Spider2's ``fetchdf().transpose().values.tolist()``: DuckDB's + ``.fetchall()`` yields native Python scalars in row order (NULL -> None), + and ``.description`` fixes the column count so a zero-row table still + yields one empty vector per column. + """ + cur = con.execute(f"SELECT * FROM {_quote_ident(table)}") + rows = cur.fetchall() + ncols = len(cur.description) + return [[_normalize(row[j]) for row in rows] for j in range(ncols)] + + +def _normalize(v): + """Mirror Spider2's pandas fetchdf() DECIMAL->float64 coercion. + + DuckDB native fetchall() returns ``decimal.Decimal`` for DECIMAL/NUMERIC, + but Decimal is ``numbers.Number`` yet NOT ``numbers.Real``, so it would skip + ``_vectors_match``'s tolerance branch and a within-1e-2 DECIMAL match would + wrongly score 0. Normalize to float in one place so downstream compares see + the same scalar types Spider2's pandas path produced. + """ + return float(v) if isinstance(v, Decimal) else v + + +def compare_duckdb(*, predicted_db: Path, gold_db: Path, spec: EvalSpec) -> bool: + """Faithfully reproduce Spider2 eval_utils.duckdb_match. + + For each table in ``spec.condition_tabs``: load gold + predicted via + ``SELECT *``, run column-containment with the table's + ``spec.condition_cols[i]`` / ``spec.ignore_orders[i]``. Overall match is + the AND across all tables. A missing/unreadable predicted table is a + mismatch (Spider2 wraps the predicted-table fetch in try/except -> 0). + """ + gold_con = duckdb.connect(str(gold_db), read_only=True) + try: + gold_tables = [_fetch_columns(gold_con, t) for t in spec.condition_tabs] + finally: + gold_con.close() + + pred_con = duckdb.connect(str(predicted_db), read_only=True) + try: + try: + pred_tables = [_fetch_columns(pred_con, t) for t in spec.condition_tabs] + except Exception: + return False + finally: + pred_con.close() + + for i, (gold_cols, pred_cols) in enumerate(zip(gold_tables, pred_tables)): + if not _compare_table( + pred_cols, + gold_cols, + condition_cols=spec.condition_cols[i], + ignore_order=spec.ignore_orders[i], + ): + return False + return True diff --git a/src/razorback/benchmarks/spider2_dbt/eval_spec.py b/src/razorback/benchmarks/spider2_dbt/eval_spec.py new file mode 100644 index 00000000..f7449877 --- /dev/null +++ b/src/razorback/benchmarks/spider2_dbt/eval_spec.py @@ -0,0 +1,173 @@ +# ABOUTME: spider2-dbt gold eval-spec model + loader (one line of spider2_eval.jsonl). +# ABOUTME: Mirrors Spider2 evaluate.py: evaluation.parameters drives duckdb_match. +from __future__ import annotations + +import json +import re +from dataclasses import dataclass, field +from pathlib import Path + +# Conservative allowlist for the spec-supplied gold DB basename. A real Spider2 +# gold filename is a plain `.duckdb`; restricting to this set blocks path +# traversal AND shell injection at the trust boundary (see load_eval_spec). +_SAFE_GOLD_RE = re.compile(r"[A-Za-z0-9._-]+\.duckdb") + + +@dataclass(frozen=True) +class EvalSpec: + """One Spider2-dbt gold eval entry (a line of spider2_eval.jsonl). + + Mirrors the real Spider2 gold-line shape consumed by + ``spider2-dbt/evaluation_suite/evaluate.py``:: + + {"instance_id": "...", + "evaluation": {"func": "duckdb_match", + "parameters": {"gold": "", + "condition_tabs": ["t1", "t2"], + "condition_cols": [[0, 2], []], + "ignore_orders": [true, false]}}} + + The three list fields are positional and parallel to ``condition_tabs``: + + gold: basename of the gold ``.duckdb`` to score against, taken verbatim + from ``evaluation.parameters.gold``. Real Spider2 tasks name the gold + DB per task (e.g. ``playbook.duckdb``, ``tpch.duckdb``), so this drives + which file the verifier compares — it is NOT hardcoded to + ``gold.duckdb``. Mirrors Spider2 ``evaluate.py:97``, which resolves + ``parameters['gold']`` to a per-task gold path before calling + ``duckdb_match``. ``None`` only for bare-``parameters`` fixtures with no + ``evaluation`` wrapper. + condition_tabs: gold/pred table names to compare (``List[str]``). + condition_cols: per-table 0-based gold column indices to restrict the + gold side to before column-containment (``List[List[int]]``). An + empty inner list means "use all gold columns" for that table. This + mirrors Spider2 ``duckdb_match``'s ``condition_cols[i]`` argument to + ``compare_pandas_table``. + ignore_orders: per-table flag for order-insensitive per-column compare + (``List[bool]``). + + Per Spider2 ``duckdb_match`` defaults: a missing/empty ``condition_cols`` + becomes ``[[]] * len(condition_tabs)`` and a missing ``ignore_orders`` + becomes ``[False] * len(condition_tabs)``. + """ + + condition_tabs: list[str] + condition_cols: list[list[int]] = field(default_factory=list) + ignore_orders: list[bool] = field(default_factory=list) + gold: str | None = None + + def __post_init__(self) -> None: + n = len(self.condition_tabs) + # Fail closed: a zero-table spec would make compare_duckdb's AND-loop + # never run and return True — silently scoring every prediction 1.0. + # A corrupted/truncated/schema-drifted gold line must NOT score as a + # match, so refuse to construct an empty-condition_tabs spec. + if n == 0: + raise ValueError( + "EvalSpec requires a non-empty condition_tabs; a zero-table " + "spec would score every prediction as a match (fail-open)" + ) + # Normalize to Spider2 duckdb_match defaults so the comparator can + # index condition_cols[i] / ignore_orders[i] for every table. + cols = self.condition_cols + if not cols or cols in ([[]], [None]): + object.__setattr__(self, "condition_cols", [[] for _ in range(n)]) + orders = self.ignore_orders + if not orders: + object.__setattr__(self, "ignore_orders", [False] * n) + if len(self.condition_cols) != n: + raise ValueError( + f"condition_cols ({len(self.condition_cols)}) must be parallel " + f"to condition_tabs ({n})" + ) + if len(self.ignore_orders) != n: + raise ValueError( + f"ignore_orders ({len(self.ignore_orders)}) must be parallel " + f"to condition_tabs ({n})" + ) + + +def load_eval_spec(path: Path) -> EvalSpec: + """Load a single-task gold eval spec from a spider2_eval.jsonl line. + + Reads the first line (one task per line), drills into + ``evaluation.parameters`` (the shape Spider2 ``evaluate.py`` passes to + ``duckdb_match``), and returns an :class:`EvalSpec`. Tolerates a bare + ``parameters``-shaped object (no ``evaluation`` wrapper) for fixtures. + + Fails closed on a corrupted/truncated/schema-drifted gold line: an empty + file, a wrong ``evaluation.func``, a missing/empty ``condition_tabs``, or + (for a real wrapped spec) a missing/empty ``parameters.gold`` each raise + ``ValueError`` rather than yielding a spec that ``compare_duckdb`` would + silently score as a match or that would score against the wrong gold file. + """ + text = Path(path).read_text().strip() + if not text: + raise ValueError( + f"empty gold eval spec at {path}: cannot score a missing/truncated " + "spider2_eval.jsonl (fail-open guard)" + ) + first_line = text.splitlines()[0] + raw = json.loads(first_line) + + # When the real Spider2 evaluation wrapper is present, the func MUST be + # duckdb_match — any other func means this verifier cannot faithfully score + # the line, so refuse rather than fall through to a match. + evaluation = raw.get("evaluation") + wrapped = evaluation is not None + if wrapped: + func = evaluation.get("func") + if func != "duckdb_match": + raise ValueError( + f"unsupported evaluation.func {func!r} in {path}: this verifier " + "only scores 'duckdb_match' (fail-open guard)" + ) + params = evaluation.get("parameters", {}) + else: + params = raw + + gold = params.get("gold") + # A real wrapped Spider2 gold line MUST name its per-task gold DB (Spider2 + # evaluate.py resolves parameters['gold'] to a per-task path). A missing or + # empty gold basename means we cannot know which file to score against, so + # fail closed rather than fall back to a hardcoded gold.duckdb. + if wrapped and not (isinstance(gold, str) and gold.strip()): + raise ValueError( + f"missing/empty evaluation.parameters.gold in {path}: a wrapped " + "duckdb_match spec must name its per-task gold .duckdb (fail-open " + "guard)" + ) + # `gold` is external Spider2 input that the materializer joins onto a path + # (tests/gold/) and emits UNQUOTED into the verifier test.sh + # (--gold-db /tests/). A conservative allowlist closes the whole class + # at the trust boundary: it subsumes the path checks (no separators, `..`, or + # absolute paths get through) AND rejects shell metacharacters/whitespace, so + # a malformed/hostile spec can neither escape tests/gold/ nor inject shell + # syntax into the verifier script. Real Spider2 gold names are plain + # `.duckdb`, so this is tighter than the data requires by design. + if isinstance(gold, str) and gold.strip(): + g = gold.strip() + if not _SAFE_GOLD_RE.fullmatch(g): + raise ValueError( + f"unsafe evaluation.parameters.gold {gold!r} in {path}: must " + "match [A-Za-z0-9._-]+.duckdb (a bare .duckdb basename, no path " + "separators, '..', whitespace, or shell metacharacters) — " + "refusing to score outside tests/gold/ or inject into test.sh " + "(fail-closed)" + ) + gold = g + + condition_tabs = list(params.get("condition_tabs", [])) + raw_cols = params.get("condition_cols") + condition_cols = ( + [list(inner) for inner in raw_cols] if raw_cols is not None else [] + ) + raw_orders = params.get("ignore_orders") + ignore_orders = [bool(b) for b in raw_orders] if raw_orders is not None else [] + + return EvalSpec( + condition_tabs=condition_tabs, + condition_cols=condition_cols, + ignore_orders=ignore_orders, + gold=gold if isinstance(gold, str) else None, + ) diff --git a/src/razorback/benchmarks/spider2_dbt/harbor_view.py b/src/razorback/benchmarks/spider2_dbt/harbor_view.py index 1f869f5f..f41ed32d 100644 --- a/src/razorback/benchmarks/spider2_dbt/harbor_view.py +++ b/src/razorback/benchmarks/spider2_dbt/harbor_view.py @@ -2,9 +2,13 @@ import shlex import shutil +import stat from pathlib import Path from typing import Literal +from razorback.benchmarks.spider2_dbt import duckdb_match as _duckdb_match_mod +from razorback.benchmarks.spider2_dbt import eval_spec as _eval_spec_mod +from razorback.benchmarks.spider2_dbt import verify as _verify_mod from razorback.benchmarks.spider2_dbt.preflight import ( preflight_script_text, resolve_spider2_db_name, @@ -46,6 +50,21 @@ "# Razorback: validate spider2-dbt source DuckDB before agent runtime." ) +# Emitted into the view's tests/ — Harbor uploads tests/ to the container only +# at verify time, so the gold .duckdb + eval spec reach the verifier but never +# the agent. `{predicted_db}` is filled from `resolve_spider2_db_name` (the +# SHARED `/app/.duckdb` contract) so the verifier scores the SAME +# DuckDB the build-time preflight validated — NOT a hardcoded /app/spider2.duckdb. +_TEST_SH_TEMPLATE = """#!/bin/sh +set -eu +mkdir -p /logs/verifier +python /tests/verify.py \\ + --predicted-db {predicted_db} \\ + --gold-db {gold_db} \\ + --eval-spec /tests/spider2_eval.jsonl \\ + --reward-out /logs/verifier/reward.json +""" + def materialize_spider2_harbor_task_view( *, @@ -75,9 +94,121 @@ def materialize_spider2_harbor_task_view( _ensure_spider2_build_context_layer(view) _ensure_dbt_deps_image_layer(view) _ensure_workspace_preflight_image_layer(view, task_slug=task_slug) + _ensure_verifier_assets( + view, source_task_dir=Path(source_task_dir), task_slug=task_slug + ) return view +def _copy_into_view(src: Path, dst: Path) -> None: + """copy2 src->dst, first replacing any symlink dst with a view-owned file. + + In link mode the generic materializer reflects allowed source files as + symlinks, so a view path that collides with a source-provided name (e.g. a + source `tests/verify.py` or top-level `tests/.duckdb`) is a symlink + back to the source; a bare copy2 would follow it and overwrite the source + task. Same write-through class the Dockerfile/preflight/test.sh writes guard. + """ + if dst.is_symlink(): + dst.unlink() + shutil.copy2(src, dst) + + +def _ensure_verifier_assets( + view_dir: Path, *, source_task_dir: Path, task_slug: str +) -> None: + """Copy the comparator + gold data + test.sh into the view's tests/ dir. + + Gold assets are read from the SOURCE task's tests/gold/ (the reflected view + stripped them via the **/gold/** deny-glob) and written WITHOUT a `gold/` + path segment so `assert_no_denied_paths` stays green while the + verifier-uploaded tests/ dir still carries them. The emitted test.sh's + `--predicted-db` is `/app/.duckdb` where `` comes from the + SHARED `resolve_spider2_db_name` resolver (RIDER, Codex r5-plan finding) — + never a hardcoded `/app/spider2.duckdb`. + + The GOLD DB basename is driven by the eval spec's + ``evaluation.parameters.gold`` (real Spider2 tasks name the gold per task, + e.g. ``playbook.duckdb``) — the EXACT named file is copied and + ``--gold-db /tests/`` is emitted. Fails closed (raises) if the + spec names a gold file that does not exist under ``tests/gold/``, so the + verifier never scores against a missing/wrong gold. + """ + # Every spider2-dbt task is duckdb_match-scored, so its source MUST ship + # tests/gold/spider2_eval.jsonl (+ the named gold DB). A missing gold dir is + # NOT a "skip scoring" signal — silently returning would leave the source + # test.sh in place (e.g. a stub `exit 0`), materializing an unscored / + # trivially-passing task under dataset skew or a resolver bug. Fail closed. + source_gold = Path(source_task_dir) / "tests" / "gold" + source_spec = source_gold / "spider2_eval.jsonl" + if not source_spec.is_file(): + raise FileNotFoundError( + f"spider2-dbt task {task_slug!r} is missing its gold eval spec " + f"{source_spec} — every spider2-dbt task is duckdb_match-scored and " + "must ship tests/gold/spider2_eval.jsonl + its named gold DB; " + "refusing to materialize an unscored task (fail-closed)" + ) + + tests = view_dir / "tests" + tests.mkdir(parents=True, exist_ok=True) + for mod in (_duckdb_match_mod, _eval_spec_mod, _verify_mod): + src = Path(mod.__file__) + _copy_into_view(src, tests / src.name) + + # Resolve the gold DB basename from the spec (NOT hardcoded gold.duckdb). + # Real Spider2 tasks name the gold per task; scoring the wrong/missing file + # is a benchmark-correctness defect. A wrapped spec without parameters.gold + # fails closed inside load_eval_spec. + spec = _eval_spec_mod.load_eval_spec(source_spec) + gold_basename = spec.gold or "gold.duckdb" + source_gold_db = source_gold / gold_basename + if not source_gold_db.is_file(): + raise FileNotFoundError( + f"spider2-dbt gold spec names gold DB {gold_basename!r} but " + f"{source_gold_db} does not exist; refusing to emit a verifier that " + "would score against a missing gold (fail-closed)" + ) + _copy_into_view(source_gold_db, tests / gold_basename) + _copy_into_view(source_spec, tests / "spider2_eval.jsonl") + + # Resolve the agent-facing DuckDB stem from the dbt project (or the slug + # fallback) so the verifier compares the SAME `/app/.duckdb` the + # preflight validated and the agent operates against. + project_dir = _dbt_project_dir(view_dir) + db_name = resolve_spider2_db_name( + project_dir if project_dir is not None else view_dir, + task_slug=task_slug, + ) + test_sh = tests / "test.sh" + if test_sh.is_symlink(): + test_sh.unlink() + # shlex.quote BOTH args: db_name is resolved from the task's profiles.yml + # path (external input) and gold_basename from the eval spec; emitting either + # unquoted into the verifier shell script is a command-injection boundary. + # gold_basename is also allowlisted in load_eval_spec (traversal defense); + # quoting closes the injection half for both arguments at the emission point. + test_sh.write_text( + _TEST_SH_TEMPLATE.format( + predicted_db=shlex.quote(f"{_APP_ROOT}/{db_name}.duckdb"), + gold_db=shlex.quote(f"/tests/{gold_basename}"), + ) + ) + test_sh.chmod( + test_sh.stat().st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH + ) + + # The deny-glob reflection strips gold FILES but leaves the now-empty + # `gold/` directory behind. Prune empty `gold/`-named dirs so no `gold/` + # path segment survives anywhere in the agent-facing view. + for gold_dir in sorted( + (p for p in view_dir.rglob("gold") if p.is_dir()), + key=lambda p: len(p.parts), + reverse=True, + ): + if not any(gold_dir.iterdir()): + gold_dir.rmdir() + + def _has_dbt_project(view_dir: Path) -> bool: """spider2-dbt nests the dbt project under `dbt_project/` (or under `environment/dbt_project/`).""" diff --git a/src/razorback/benchmarks/spider2_dbt/verify.py b/src/razorback/benchmarks/spider2_dbt/verify.py new file mode 100644 index 00000000..6d68424a --- /dev/null +++ b/src/razorback/benchmarks/spider2_dbt/verify.py @@ -0,0 +1,70 @@ +# ABOUTME: spider2-dbt verifier — compares predicted vs gold .duckdb via +# ABOUTME: duckdb_match semantics and writes harbor's {"reward": } file. +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path + +try: + from razorback.benchmarks.spider2_dbt.duckdb_match import compare_duckdb + from razorback.benchmarks.spider2_dbt.eval_spec import load_eval_spec +except ModuleNotFoundError: # running flat from /tests in the verifier container + from duckdb_match import compare_duckdb # type: ignore[no-redef] + from eval_spec import load_eval_spec # type: ignore[no-redef] + + +def emit_reward( + *, + predicted_db: Path, + gold_db: Path, + eval_spec: Path, + reward_out: Path, +) -> None: + """Compute the binary duckdb_match reward and write harbor's reward.json.""" + err: str | None = None + if not Path(predicted_db).exists(): + is_match = False + err = f"predicted DB not found ({predicted_db})" + else: + # Fail closed: an empty / malformed / schema-drifted gold spec must + # surface as reward 0, never crash-into-pass and never silently 1.0. + try: + spec = load_eval_spec(Path(eval_spec)) + is_match = compare_duckdb( + predicted_db=Path(predicted_db), + gold_db=Path(gold_db), + spec=spec, + ) + except Exception as exc: # noqa: BLE001 — any spec/compare failure = non-match + is_match = False + err = f"eval-spec/compare error: {exc}" + payload = {"reward": 1.0 if is_match else 0.0} + Path(reward_out).parent.mkdir(parents=True, exist_ok=True) + Path(reward_out).write_text(json.dumps(payload) + "\n") + if not is_match: + detail = f": {err}" if err else "" + sys.stderr.write( + f"spider2-dbt verify: mismatch (predicted={predicted_db}){detail}\n" + ) + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument("--predicted-db", type=Path, required=True) + parser.add_argument("--gold-db", type=Path, required=True) + parser.add_argument("--eval-spec", type=Path, required=True) + parser.add_argument("--reward-out", type=Path, required=True) + args = parser.parse_args() + emit_reward( + predicted_db=args.predicted_db, + gold_db=args.gold_db, + eval_spec=args.eval_spec, + reward_out=args.reward_out, + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/build_gold.py b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/build_gold.py new file mode 100644 index 00000000..7211be1a --- /dev/null +++ b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/build_gold.py @@ -0,0 +1,44 @@ +# ABOUTME: Rebuilds the gold .duckdb for the spider2-dbt verifier fixture. +# ABOUTME: Run: uv run python (regenerates gold.duckdb next to it). +import json +from pathlib import Path + +import duckdb + +HERE = Path(__file__).parent + + +def build() -> None: + db = HERE / "gold.duckdb" + if db.exists(): + db.unlink() + con = duckdb.connect(str(db)) + try: + con.execute("CREATE TABLE orders (id INTEGER, amount INTEGER)") + con.executemany("INSERT INTO orders VALUES (?, ?)", [(1, 100), (2, 200)]) + finally: + con.close() + # Real Spider2 gold-line shape (spider2-dbt/evaluation_suite/evaluate.py): + # instance_id + evaluation.parameters with per-table List[List[int]] + # condition_cols and List[bool] ignore_orders. + (HERE / "spider2_eval.jsonl").write_text( + json.dumps( + { + "instance_id": "spider2-fixture-001", + "evaluation": { + "func": "duckdb_match", + "parameters": { + "gold": "gold.duckdb", + "condition_tabs": ["orders"], + "condition_cols": [[0, 1]], + "ignore_orders": [True], + }, + }, + } + ) + + "\n" + ) + + +if __name__ == "__main__": + build() diff --git a/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/gold.duckdb b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/gold.duckdb new file mode 100644 index 00000000..2c038e71 Binary files /dev/null and b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/gold.duckdb differ diff --git a/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/spider2_eval.jsonl b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/spider2_eval.jsonl new file mode 100644 index 00000000..ea685123 --- /dev/null +++ b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-001/tests/gold/spider2_eval.jsonl @@ -0,0 +1 @@ +{"instance_id": "spider2-fixture-001", "evaluation": {"func": "duckdb_match", "parameters": {"gold": "gold.duckdb", "condition_tabs": ["orders"], "condition_cols": [[0, 1]], "ignore_orders": [true]}}} diff --git a/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/gold.duckdb b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/gold.duckdb new file mode 100644 index 00000000..cd81945e Binary files /dev/null and b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/gold.duckdb differ diff --git a/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/spider2_eval.jsonl b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/spider2_eval.jsonl new file mode 100644 index 00000000..91a6c01c --- /dev/null +++ b/tests/fixtures/spider2_dbt/harbor_task_minimal/spider2-fixture-002/tests/gold/spider2_eval.jsonl @@ -0,0 +1 @@ +{"instance_id": "spider2-fixture-002", "evaluation": {"func": "duckdb_match", "parameters": {"gold": "gold.duckdb", "condition_tabs": ["orders"], "condition_cols": [[0, 1]], "ignore_orders": [true]}}} diff --git a/tests/integration/test_spider2_dbt_verify_test_sh.py b/tests/integration/test_spider2_dbt_verify_test_sh.py new file mode 100644 index 00000000..2da9c81e --- /dev/null +++ b/tests/integration/test_spider2_dbt_verify_test_sh.py @@ -0,0 +1,90 @@ +# ABOUTME: AC-3 end-to-end — the emitted verifier assets produce a harbor-shaped +# ABOUTME: reward.json. Exercises the materialized tests/ dir, not a re-import. +import json +import subprocess +import sys +from pathlib import Path + +import duckdb + +from razorback.benchmarks.spider2_dbt.harbor_view import ( + materialize_spider2_harbor_task_view, +) + +_SOURCE = ( + Path(__file__).resolve().parents[1] + / "fixtures" + / "spider2_dbt" + / "harbor_task_minimal" + / "spider2-fixture-001" +) + + +def _build_predicted_matching_gold(path: Path) -> None: + con = duckdb.connect(str(path)) + try: + con.execute("CREATE TABLE orders (id INTEGER, amount INTEGER)") + # rows reordered vs gold; ignore_orders=True -> still a match + con.executemany("INSERT INTO orders VALUES (?, ?)", [(2, 200), (1, 100)]) + finally: + con.close() + + +def test_spider2_dbt_verify_emitted_assets_write_reward_json(tmp_path): + view = materialize_spider2_harbor_task_view( + source_task_dir=_SOURCE, + view_root=tmp_path / "views", + task_slug="spider2-fixture-001", + ) + tests = view / "tests" + predicted = tmp_path / "spider2-fixture-001.duckdb" + _build_predicted_matching_gold(predicted) + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + + result = subprocess.run( + [ + sys.executable, + str(tests / "verify.py"), + "--predicted-db", + str(predicted), + "--gold-db", + str(tests / "gold.duckdb"), + "--eval-spec", + str(tests / "spider2_eval.jsonl"), + "--reward-out", + str(reward_out), + ], + capture_output=True, + text=True, + cwd=str(tests), # flat-import fallback resolves duckdb_match/eval_spec here + ) + assert result.returncode == 0, result.stderr + payload = json.loads(reward_out.read_text()) + assert set(payload) == {"reward"} + assert isinstance(payload["reward"], float) + assert payload["reward"] == 1.0 # reordered rows match under ignore_orders + + +def test_spider2_dbt_verify_emitted_test_sh_is_runnable(tmp_path): + # Proves the emitted test.sh is a valid shell script with the right shape: + # it references verify.py, the resolved /app/.duckdb predicted path, + # and the harbor reward path. A full container run is out of scope (no docker + # in unit/integration); this checks the contract. + view = materialize_spider2_harbor_task_view( + source_task_dir=_SOURCE, + view_root=tmp_path / "v", + task_slug="spider2-fixture-001", + ) + text = (view / "tests" / "test.sh").read_text() + assert "verify.py" in text + assert "/logs/verifier/reward.json" in text + # RIDER: predicted path is the resolver's /app/.duckdb, not hardcoded + assert "/app/spider2-fixture-001.duckdb" in text + assert "/app/spider2.duckdb" not in text + # the script is syntactically valid sh + check = subprocess.run( + ["sh", "-n", str(view / "tests" / "test.sh")], + capture_output=True, + text=True, + ) + assert check.returncode == 0, check.stderr diff --git a/tests/unit/test_spider2_dbt_harbor_view.py b/tests/unit/test_spider2_dbt_harbor_view.py index b48b484e..b3914fcb 100644 --- a/tests/unit/test_spider2_dbt_harbor_view.py +++ b/tests/unit/test_spider2_dbt_harbor_view.py @@ -51,6 +51,23 @@ def _write_source( conn.execute("CREATE TABLE orders (id INTEGER)") finally: conn.close() + # Every spider2-dbt task is duckdb_match-scored, so the materializer now + # fails closed without tests/gold/. Give the source a minimal gold so these + # layer-focused tests still materialize (they don't assert on the verifier). + import duckdb as _dk + + gold = source / "tests" / "gold" + gold.mkdir(parents=True) + _conn = _dk.connect(str(gold / "gold.duckdb")) + try: + _conn.execute("CREATE TABLE orders (id INTEGER); INSERT INTO orders VALUES (1)") + finally: + _conn.close() + (gold / "spider2_eval.jsonl").write_text( + '{"instance_id": "spider2-fixture-001", "evaluation": {"func": ' + '"duckdb_match", "parameters": {"gold": "gold.duckdb", "condition_tabs": ' + '["orders"], "condition_cols": [[0]], "ignore_orders": [true]}}}\n' + ) lines = dockerfile_lines or [ "FROM python:3.12", "WORKDIR /app", @@ -264,6 +281,22 @@ def test_preflight_layer_absent_when_not_a_dbt_project(tmp_path): (source / "environment" / "Dockerfile").write_text( "FROM python:3.12\nCMD [\"bash\"]\n" ) + # Gold is required (the materializer fails closed without it); this test is + # about the preflight layer being absent for a non-dbt source, not scoring. + import duckdb as _dk + + gold = source / "tests" / "gold" + gold.mkdir(parents=True) + _conn = _dk.connect(str(gold / "gold.duckdb")) + try: + _conn.execute("CREATE TABLE orders (id INTEGER); INSERT INTO orders VALUES (1)") + finally: + _conn.close() + (gold / "spider2_eval.jsonl").write_text( + '{"instance_id": "plain-001", "evaluation": {"func": "duckdb_match", ' + '"parameters": {"gold": "gold.duckdb", "condition_tabs": ["orders"], ' + '"condition_cols": [[0]], "ignore_orders": [true]}}}\n' + ) view = materialize_spider2_harbor_task_view( source_task_dir=source, @@ -384,3 +417,36 @@ def test_link_mode_preflight_script_never_mutates_source_named_file(tmp_path): # The SOURCE file is byte-for-byte unchanged — no write followed the symlink. assert source_script.read_text() == source_script_before + + +def test_link_mode_verifier_assets_never_mutate_colliding_source_file(tmp_path): + """A source task shipping a colliding tests/verify.py must not be corrupted. + + `_ensure_verifier_assets` copies the comparator modules + gold + eval spec + into view/tests/. Under `view_mode="link"` a source-provided tests/verify.py + is reflected as a symlink back into the source, so an unguarded copy2 would + follow it and overwrite the user's source file — same write-through class as + the Dockerfile/preflight/test.sh guards. + """ + source = _write_source( + tmp_path / "source", with_packages=True, with_duckdb=True + ) + source_verify = source / "tests" / "verify.py" + source_verify_before = "# user's own tests/verify.py, not the generated one\n" + source_verify.write_text(source_verify_before) + + view = materialize_spider2_harbor_task_view( + source_task_dir=source, + view_root=tmp_path / "views", + task_slug="spider2-fixture-001", + view_mode="link", + ) + + # The view owns a real verify.py carrying the generated comparator CLI. + view_verify = view / "tests" / "verify.py" + assert view_verify.is_file() + assert not view_verify.is_symlink() + assert "emit_reward" in view_verify.read_text() + + # The SOURCE file is byte-for-byte unchanged — no copy2 followed the symlink. + assert source_verify.read_text() == source_verify_before diff --git a/tests/unit/test_spider2_dbt_verify_cli.py b/tests/unit/test_spider2_dbt_verify_cli.py new file mode 100644 index 00000000..bb096bcb --- /dev/null +++ b/tests/unit/test_spider2_dbt_verify_cli.py @@ -0,0 +1,360 @@ +# ABOUTME: spider2-dbt verify.py CLI + materializer-emission tests (AC-3). +# ABOUTME: emit_reward writes harbor-shaped reward.json; the view carries verifier assets. +import json +from pathlib import Path + +import duckdb + +from razorback.benchmarks.spider2_dbt.harbor_view import ( + materialize_spider2_harbor_task_view, +) +from razorback.benchmarks.spider2_dbt.verify import emit_reward + +_SOURCE = ( + Path(__file__).parent.parent + / "fixtures" + / "spider2_dbt" + / "harbor_task_minimal" + / "spider2-fixture-001" +) + + +def _build_db(path, rows): + con = duckdb.connect(str(path)) + try: + con.execute("CREATE TABLE orders (a INTEGER)") + con.executemany("INSERT INTO orders VALUES (?)", [(r,) for r in rows]) + finally: + con.close() + + +def _spec(path): + # Real Spider2 gold-line shape: evaluation.parameters with per-table lists. + path.write_text( + json.dumps( + { + "instance_id": "cli-001", + "evaluation": { + "func": "duckdb_match", + "parameters": { + "gold": "gold.duckdb", + "condition_tabs": ["orders"], + "condition_cols": [[]], + "ignore_orders": [True], + }, + }, + } + ) + + "\n" + ) + return path + + +def test_spider2_dbt_verify_cli_emits_reward_one_on_match(tmp_path): + _build_db(tmp_path / "pred.duckdb", [1, 2]) + _build_db(tmp_path / "gold.duckdb", [2, 1]) + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + emit_reward( + predicted_db=tmp_path / "pred.duckdb", + gold_db=tmp_path / "gold.duckdb", + eval_spec=_spec(tmp_path / "spider2_eval.jsonl"), + reward_out=reward_out, + ) + assert json.loads(reward_out.read_text()) == {"reward": 1.0} + + +def test_spider2_dbt_verify_cli_emits_reward_zero_on_mismatch(tmp_path): + _build_db(tmp_path / "pred.duckdb", [1, 2]) + _build_db(tmp_path / "gold.duckdb", [9, 9]) + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + emit_reward( + predicted_db=tmp_path / "pred.duckdb", + gold_db=tmp_path / "gold.duckdb", + eval_spec=_spec(tmp_path / "spider2_eval.jsonl"), + reward_out=reward_out, + ) + payload = json.loads(reward_out.read_text()) + assert payload == {"reward": 0.0} + # parent dir was created by emit_reward (mirrors dab/verify.py:31) + assert reward_out.parent.is_dir() + + +def test_spider2_dbt_verify_cli_missing_predicted_scores_zero(tmp_path): + # A predicted DB the agent never produced is a 0.0 reward, not a crash. + _build_db(tmp_path / "gold.duckdb", [1]) + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + emit_reward( + predicted_db=tmp_path / "does-not-exist.duckdb", + gold_db=tmp_path / "gold.duckdb", + eval_spec=_spec(tmp_path / "spider2_eval.jsonl"), + reward_out=reward_out, + ) + assert json.loads(reward_out.read_text()) == {"reward": 0.0} + + +def test_spider2_dbt_verify_cli_empty_spec_scores_zero_not_one(tmp_path): + # FAIL-CLOSED (cycle-2 B1): a real predicted DB matching a gold DB, but with + # an empty / malformed gold spec, must score 0.0 — never silently 1.0. An + # empty condition_tabs is the hazard (compare_duckdb's AND-loop returns True + # on zero tables), so emit_reward must surface it as a NON-match. + _build_db(tmp_path / "pred.duckdb", [1, 2]) + _build_db(tmp_path / "gold.duckdb", [1, 2]) # identical -> would be 1.0 if scored + empty_spec = tmp_path / "spider2_eval.jsonl" + empty_spec.write_text("") + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + emit_reward( + predicted_db=tmp_path / "pred.duckdb", + gold_db=tmp_path / "gold.duckdb", + eval_spec=empty_spec, + reward_out=reward_out, + ) + assert json.loads(reward_out.read_text()) == {"reward": 0.0} + + +def test_spider2_dbt_verify_cli_wrong_func_scores_zero_not_one(tmp_path): + # A schema-drifted spec (non-duckdb_match func) over a matching DB pair must + # also fail closed to 0.0, not crash and not pass. + _build_db(tmp_path / "pred.duckdb", [1, 2]) + _build_db(tmp_path / "gold.duckdb", [1, 2]) + bad_spec = tmp_path / "spider2_eval.jsonl" + bad_spec.write_text( + json.dumps( + { + "instance_id": "x", + "evaluation": { + "func": "string_match", + "parameters": {"gold": "gold.duckdb", "condition_tabs": ["orders"]}, + }, + } + ) + + "\n" + ) + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + emit_reward( + predicted_db=tmp_path / "pred.duckdb", + gold_db=tmp_path / "gold.duckdb", + eval_spec=bad_spec, + reward_out=reward_out, + ) + assert json.loads(reward_out.read_text()) == {"reward": 0.0} + + +def test_spider2_dbt_verify_view_carries_verifier_assets(tmp_path): + view = materialize_spider2_harbor_task_view( + source_task_dir=_SOURCE, + view_root=tmp_path / "views", + task_slug="spider2-fixture-001", + ) + tests = view / "tests" + # comparator + cli + spec modules and gold data are present for the verifier + for name in ( + "duckdb_match.py", + "eval_spec.py", + "verify.py", + "gold.duckdb", + "spider2_eval.jsonl", + "test.sh", + ): + assert (tests / name).is_file(), f"missing verifier asset: {name}" + # test.sh is executable + assert (tests / "test.sh").stat().st_mode & 0o111 + # leakage-clean: no `gold/` path segment survived in the agent-facing view + assert not (view / "tests" / "gold").exists() + assert not list(view.rglob("gold/*")) + + +def _write_gold_source( + source: Path, *, gold_basename: str, with_default_gold: bool +) -> Path: + # A minimal spider2-dbt source task whose gold line names a NON-default gold + # DB (e.g. playbook.duckdb), with NO gold.duckdb present. Mirrors a real + # Spider2 task that names its gold per task. + (source / "environment").mkdir(parents=True) + (source / "dbt_project" / "models").mkdir(parents=True) + (source / "task.toml").write_text( + "\n".join( + [ + 'schema_version = "1.0"', + "[environment]", + 'os = "linux"', + "cpus = 1", + "memory_mb = 1024", + "storage_mb = 1024", + "", + ] + ) + ) + (source / "instruction.md").write_text("Fix the dbt project.\n") + (source / "dbt_project" / "dbt_project.yml").write_text( + "name: example\nprofile: example\n" + ) + (source / "dbt_project" / "models" / "example.sql").write_text("select 1\n") + (source / "environment" / "Dockerfile").write_text( + "FROM python:3.12\nWORKDIR /app\nCMD [\"bash\"]\n" + ) + gold_dir = source / "tests" / "gold" + gold_dir.mkdir(parents=True) + # The NAMED gold DB (the one the spec points at). + _build_db(gold_dir / gold_basename, [1, 2]) + if with_default_gold: + _build_db(gold_dir / "gold.duckdb", [9, 9]) # a decoy default + (gold_dir / "spider2_eval.jsonl").write_text( + json.dumps( + { + "instance_id": "playbook001", + "evaluation": { + "func": "duckdb_match", + "parameters": { + "gold": gold_basename, + "condition_tabs": ["orders"], + "condition_cols": [[]], + "ignore_orders": [True], + }, + }, + } + ) + + "\n" + ) + return source + + +def test_spider2_dbt_verify_view_uses_non_default_gold_basename(tmp_path): + # REGRESSION (cycle-3): the gold DB basename is driven by parameters.gold, + # NOT hardcoded gold.duckdb. A task naming playbook.duckdb (and shipping NO + # gold.duckdb) must copy playbook.duckdb into the verifier-only tests/ and + # emit --gold-db /tests/playbook.duckdb. + source = _write_gold_source( + tmp_path / "source", gold_basename="playbook.duckdb", with_default_gold=False + ) + view = materialize_spider2_harbor_task_view( + source_task_dir=source, + view_root=tmp_path / "views", + task_slug="playbook001", + ) + tests = view / "tests" + # The NAMED gold file is copied; the hardcoded gold.duckdb is NOT invented. + assert (tests / "playbook.duckdb").is_file() + assert not (tests / "gold.duckdb").exists() + # test.sh scores against the named file. + test_sh = (tests / "test.sh").read_text() + assert "--gold-db /tests/playbook.duckdb" in test_sh + assert "/tests/gold.duckdb" not in test_sh + # leakage-clean: no gold/ segment survives. + assert not (view / "tests" / "gold").exists() + assert not list(view.rglob("gold/*")) + + +def test_spider2_dbt_verify_view_named_gold_actually_scores(tmp_path): + # End-to-end: a predicted DB matching the NAMED gold scores 1.0 through the + # emitted verifier, proving the named file (not a missing/decoy gold.duckdb) + # is the one actually scored. + source = _write_gold_source( + tmp_path / "source", gold_basename="playbook.duckdb", with_default_gold=False + ) + view = materialize_spider2_harbor_task_view( + source_task_dir=source, + view_root=tmp_path / "views", + task_slug="playbook001", + ) + tests = view / "tests" + predicted = tmp_path / "pred.duckdb" + _build_db(predicted, [2, 1]) # reordered; ignore_orders -> match + reward_out = tmp_path / "logs" / "verifier" / "reward.json" + emit_reward( + predicted_db=predicted, + gold_db=tests / "playbook.duckdb", + eval_spec=tests / "spider2_eval.jsonl", + reward_out=reward_out, + ) + assert json.loads(reward_out.read_text()) == {"reward": 1.0} + + +def test_spider2_dbt_verify_view_missing_named_gold_fails_closed(tmp_path): + # Fail closed: if the spec names a gold DB that does NOT exist under + # tests/gold/, materialization must raise rather than silently emit a + # verifier that scores against a missing file. + source = _write_gold_source( + tmp_path / "source", gold_basename="playbook.duckdb", with_default_gold=False + ) + # Delete the named gold so the spec points at a missing file. + (source / "tests" / "gold" / "playbook.duckdb").unlink() + import pytest + + with pytest.raises((FileNotFoundError, ValueError)): + materialize_spider2_harbor_task_view( + source_task_dir=source, + view_root=tmp_path / "views", + task_slug="playbook001", + ) + + +def test_spider2_dbt_verify_view_missing_gold_dir_fails_closed(tmp_path): + # Every spider2-dbt task is duckdb_match-scored, so a source with NO + # tests/gold/spider2_eval.jsonl must be a hard materialization error — NOT a + # silent pass-through that leaves the source test.sh (e.g. a stub `exit 0`) + # in place and yields an unscored / trivially-passing task. + import shutil as _shutil + + import pytest + + source = _write_gold_source( + tmp_path / "source", gold_basename="g.duckdb", with_default_gold=False + ) + _shutil.rmtree(source / "tests" / "gold") + with pytest.raises(FileNotFoundError): + materialize_spider2_harbor_task_view( + source_task_dir=source, + view_root=tmp_path / "views", + task_slug="playbook001", + ) + + +def test_spider2_dbt_verify_test_sh_uses_resolved_db_name(tmp_path): + # RIDER: the emitted test.sh predicted-DB path must come from + # resolve_spider2_db_name, NOT a hardcoded /app/spider2.duckdb. This + # fixture ships no profiles.yml / *.duckdb, so the resolver falls back to + # the task slug -> the predicted path is /app/.duckdb (a NON-spider2 + # db name), proving the resolver is consumed. + view = materialize_spider2_harbor_task_view( + source_task_dir=_SOURCE, + view_root=tmp_path / "views", + task_slug="not-spider2-slug", + ) + test_sh = (view / "tests" / "test.sh").read_text() + assert "/app/not-spider2-slug.duckdb" in test_sh + assert "/app/spider2.duckdb" not in test_sh + + +def test_spider2_dbt_verify_test_sh_quotes_predicted_db_against_injection(tmp_path): + # SECURITY (cycle-5 symmetric): db_name is resolved from the task's + # profiles.yml `path:` (external input). A path carrying shell metacharacters + # must NOT be emitted raw into the verifier test.sh --predicted-db arg, or it + # executes during verification. shlex.quote at the emission point neutralizes + # it (the same class the gold allowlist blocks for --gold-db). + import shlex + + source = _write_gold_source( + tmp_path / "source", gold_basename="g.duckdb", with_default_gold=False + ) + (source / "dbt_project" / "profiles.yml").write_text( + "\n".join( + [ + "example:", + " outputs:", + " dev:", + " type: duckdb", + " path: evil$(touch pwned).duckdb", + " target: dev", + "", + ] + ) + ) + view = materialize_spider2_harbor_task_view( + source_task_dir=source, + view_root=tmp_path / "views", + task_slug="playbook001", + ) + test_sh = (view / "tests" / "test.sh").read_text() + # The metacharacter path is present only in its safely-quoted form. + assert shlex.quote("/app/evil$(touch pwned).duckdb") in test_sh + assert "--predicted-db /app/evil$(touch pwned).duckdb" not in test_sh diff --git a/tests/unit/test_spider2_dbt_verify_comparator.py b/tests/unit/test_spider2_dbt_verify_comparator.py new file mode 100644 index 00000000..1d83a828 --- /dev/null +++ b/tests/unit/test_spider2_dbt_verify_comparator.py @@ -0,0 +1,507 @@ +# ABOUTME: spider2-dbt duckdb_match comparator + eval-spec unit tests (AC-1/AC-2). +# ABOUTME: Faithful to Spider2 eval_utils: column-containment, 1e-2 tolerance, per-table lists. +import json +from pathlib import Path + +import duckdb +import pytest + +from razorback.benchmarks.spider2_dbt.duckdb_match import compare_duckdb +from razorback.benchmarks.spider2_dbt.eval_spec import EvalSpec, load_eval_spec + + +# -------------------------------------------------------------------------- +# eval-spec loader (real Spider2 evaluation.parameters shape) +# -------------------------------------------------------------------------- + + +def test_spider2_dbt_verify_loads_eval_spec_real_shape(tmp_path: Path): + # Real Spider2 gold line: instance_id + evaluation.parameters with + # per-table List[List[int]] condition_cols and List[bool] ignore_orders. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "task-001", + "evaluation": { + "func": "duckdb_match", + "parameters": { + "gold": "gold.duckdb", + "condition_tabs": ["orders", "customers"], + "condition_cols": [[0, 2], []], + "ignore_orders": [True, False], + }, + }, + } + ) + + "\n" + ) + spec = load_eval_spec(spec_path) + assert spec == EvalSpec( + gold="gold.duckdb", + condition_tabs=["orders", "customers"], + condition_cols=[[0, 2], []], + ignore_orders=[True, False], + ) + + +def test_spider2_dbt_verify_load_eval_spec_parses_gold_basename(tmp_path: Path): + # Real Spider2 tasks name the gold DB per task (playbook.duckdb, tpch.duckdb, + # ...). load_eval_spec must carry parameters.gold through so the verifier + # scores against the NAMED file, not a hardcoded gold.duckdb. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "playbook001", + "evaluation": { + "func": "duckdb_match", + "parameters": { + "gold": "playbook.duckdb", + "condition_tabs": ["orders"], + }, + }, + } + ) + + "\n" + ) + spec = load_eval_spec(spec_path) + assert spec.gold == "playbook.duckdb" + + +def test_spider2_dbt_verify_load_eval_spec_missing_gold_in_wrapped_spec_raises( + tmp_path: Path, +): + # A real wrapped (evaluation.func == duckdb_match) gold line MUST name its + # gold DB. A missing/empty parameters.gold is a schema drift the verifier + # cannot score against the right file -> fail closed. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "t", + "evaluation": { + "func": "duckdb_match", + "parameters": {"condition_tabs": ["orders"]}, + }, + } + ) + + "\n" + ) + with pytest.raises(ValueError): + load_eval_spec(spec_path) + + +def test_spider2_dbt_verify_eval_spec_defaults(tmp_path: Path): + # Missing condition_cols/ignore_orders default per-table to []/False + # (mirrors Spider2 duckdb_match's None-handling). + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "evaluation": { + "func": "duckdb_match", + "parameters": {"gold": "g.duckdb", "condition_tabs": ["t1", "t2"]}, + } + } + ) + + "\n" + ) + spec = load_eval_spec(spec_path) + assert spec.condition_tabs == ["t1", "t2"] + assert spec.condition_cols == [[], []] + assert spec.ignore_orders == [False, False] + + +def test_spider2_dbt_verify_eval_spec_rejects_mismatched_lengths(): + with pytest.raises(ValueError): + EvalSpec(condition_tabs=["a", "b"], condition_cols=[[0]], ignore_orders=[True]) + + +# -------------------------------------------------------------------------- +# Fail-closed: a zero-table / malformed gold spec MUST NOT score as a match +# (cycle-2 B1). compare_duckdb returns True on an empty condition_tabs loop, +# so an empty spec would silently award 1.0 -> reject it at load/construct time. +# -------------------------------------------------------------------------- + + +def test_spider2_dbt_verify_eval_spec_rejects_empty_condition_tabs(): + # A zero-table spec is the fail-open hazard: compare_duckdb's AND-loop never + # runs and returns True. EvalSpec must refuse to construct one. + with pytest.raises(ValueError): + EvalSpec(condition_tabs=[]) + + +def test_spider2_dbt_verify_load_eval_spec_empty_file_raises(tmp_path: Path): + # An empty / truncated spider2_eval.jsonl must NOT yield a zero-table spec. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text("") + with pytest.raises(ValueError): + load_eval_spec(spec_path) + + +def test_spider2_dbt_verify_load_eval_spec_missing_condition_tabs_raises(tmp_path: Path): + # evaluation.parameters present but no condition_tabs -> fail closed. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "t", + "evaluation": {"func": "duckdb_match", "parameters": {"gold": "g.duckdb"}}, + } + ) + + "\n" + ) + with pytest.raises(ValueError): + load_eval_spec(spec_path) + + +def test_spider2_dbt_verify_load_eval_spec_wrong_func_raises(tmp_path: Path): + # A non-duckdb_match evaluation func is a schema drift we must not score. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "t", + "evaluation": { + "func": "string_match", + "parameters": {"gold": "g.duckdb", "condition_tabs": ["orders"]}, + }, + } + ) + + "\n" + ) + with pytest.raises(ValueError): + load_eval_spec(spec_path) + + +def test_spider2_dbt_verify_load_eval_spec_missing_evaluation_raises(tmp_path: Path): + # A line with no evaluation wrapper and no condition_tabs must fail closed + # rather than fall through to a zero-table (match-everything) spec. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text(json.dumps({"instance_id": "t"}) + "\n") + with pytest.raises(ValueError): + load_eval_spec(spec_path) + + +@pytest.mark.parametrize( + "bad_gold", + [ + # path traversal / separators / absolute + "../dbt_project/foo.duckdb", + "/etc/passwd", + "sub/dir/g.duckdb", + "..", + ".", + # shell injection into the unquoted test.sh --gold-db arg + "x.duckdb; echo '{\"reward\":1.0}' > /logs/verifier/reward.json #", + "g.duckdb $(id)", + "g .duckdb", + "g.sqlite", # wrong suffix + ], +) +def test_spider2_dbt_verify_load_eval_spec_rejects_non_basename_gold( + tmp_path: Path, bad_gold: str +): + # `gold` is external Spider2 input the materializer joins onto a path and + # emits UNQUOTED into test.sh. A `..`/absolute/separator value escapes + # tests/gold/; a shell-metacharacter/whitespace value injects into the + # verifier script. The conservative allowlist must fail both closed at parse. + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "t", + "evaluation": { + "func": "duckdb_match", + "parameters": {"gold": bad_gold, "condition_tabs": ["orders"]}, + }, + } + ) + + "\n" + ) + with pytest.raises(ValueError): + load_eval_spec(spec_path) + + +# -------------------------------------------------------------------------- +# comparator fixtures +# -------------------------------------------------------------------------- + + +def _build_db(path, tables: dict): + """tables: name -> (column_names, column_types, rows). Builds a .duckdb.""" + con = duckdb.connect(str(path)) + try: + for name, (cols, types, rows) in tables.items(): + col_defs = ", ".join(f"{c} {t}" for c, t in zip(cols, types)) + con.execute(f"CREATE TABLE {name} ({col_defs})") + if rows: + placeholders = ", ".join( + ["(" + ", ".join(["?"] * len(cols)) + ")"] * len(rows) + ) + flat = [v for row in rows for v in row] + con.execute(f"INSERT INTO {name} VALUES {placeholders}", flat) + finally: + con.close() + + +def _ints(names, rows): + return (names, ["INTEGER"] * len(names), rows) + + +def _compare(tmp_path, *, pred, gold, spec): + _build_db(tmp_path / "pred.duckdb", pred) + _build_db(tmp_path / "gold.duckdb", gold) + return compare_duckdb( + predicted_db=tmp_path / "pred.duckdb", + gold_db=tmp_path / "gold.duckdb", + spec=spec, + ) + + +# -------------------------------------------------------------------------- +# AC-1: 1.0 on match, 0.0 on mismatch +# -------------------------------------------------------------------------- + + +def test_spider2_dbt_verify_matching_db_scores_true(tmp_path): + tables = {"orders": _ints(["a", "b"], [(1, 2), (3, 4)])} + spec = EvalSpec(condition_tabs=["orders"]) + assert _compare(tmp_path, pred=tables, gold=tables, spec=spec) is True + + +def test_spider2_dbt_verify_mismatched_db_scores_false(tmp_path): + spec = EvalSpec(condition_tabs=["orders"]) + assert ( + _compare( + tmp_path, + pred={"orders": _ints(["a", "b"], [(1, 2), (9, 9)])}, + gold={"orders": _ints(["a", "b"], [(1, 2), (3, 4)])}, + spec=spec, + ) + is False + ) + + +def test_spider2_dbt_verify_all_tables_must_match(tmp_path): + # One table matches, the other differs -> overall False (AND across tables). + spec = EvalSpec(condition_tabs=["t1", "t2"]) + assert ( + _compare( + tmp_path, + pred={"t1": _ints(["a"], [(1,)]), "t2": _ints(["a"], [(2,)])}, + gold={"t1": _ints(["a"], [(1,)]), "t2": _ints(["a"], [(99,)])}, + spec=spec, + ) + is False + ) + + +def test_spider2_dbt_verify_missing_predicted_table_scores_false(tmp_path): + # Spider2 wraps the predicted-table fetch in try/except -> 0 on a + # missing/unreadable table, not a crash. + spec = EvalSpec(condition_tabs=["orders"]) + assert ( + _compare( + tmp_path, + pred={"other": _ints(["a"], [(1,)])}, + gold={"orders": _ints(["a"], [(1,)])}, + spec=spec, + ) + is False + ) + + +# -------------------------------------------------------------------------- +# AC-2 + reviewer's verdict-flipping cases +# -------------------------------------------------------------------------- + + +def test_spider2_dbt_verify_float_within_tolerance_matches(tmp_path): + # Reviewer case: a numeric column-vector within math.isclose(abs_tol=1e-2) + # must still match. Row-tuple exact-== would have flipped this to a 0. + spec = EvalSpec(condition_tabs=["m"]) + assert ( + _compare( + tmp_path, + pred={"m": (["v"], ["DOUBLE"], [(1.005,), (2.0,)])}, + gold={"m": (["v"], ["DOUBLE"], [(1.0,), (2.0,)])}, + spec=spec, + ) + is True + ) + + +def test_spider2_dbt_verify_float_beyond_tolerance_mismatches(tmp_path): + # Just outside 1e-2 -> mismatch (proves the tolerance actually bounds). + spec = EvalSpec(condition_tabs=["m"]) + assert ( + _compare( + tmp_path, + pred={"m": (["v"], ["DOUBLE"], [(1.5,)])}, + gold={"m": (["v"], ["DOUBLE"], [(1.0,)])}, + spec=spec, + ) + is False + ) + + +def test_spider2_dbt_verify_column_reorder_matches_by_containment(tmp_path): + # Reviewer case: pred has the SAME column data in a DIFFERENT column order. + # Column-containment matches every gold column to SOME pred column, so a + # reordered pred is still a match. Positional row-tuple == would have failed. + spec = EvalSpec(condition_tabs=["t"]) + assert ( + _compare( + tmp_path, + pred={"t": _ints(["b", "a"], [(10, 1), (20, 2)])}, + gold={"t": _ints(["a", "b"], [(1, 10), (2, 20)])}, + spec=spec, + ) + is True + ) + + +def test_spider2_dbt_verify_extra_pred_columns_tolerated(tmp_path): + # Reviewer case: pred carries an EXTRA column not present in gold. Since + # only gold's columns must be contained in pred, the extra is tolerated. + spec = EvalSpec(condition_tabs=["t"]) + assert ( + _compare( + tmp_path, + pred={"t": _ints(["a", "extra"], [(1, 999), (2, 888)])}, + gold={"t": _ints(["a"], [(1,), (2,)])}, + spec=spec, + ) + is True + ) + + +def test_spider2_dbt_verify_condition_cols_restricts_gold(tmp_path): + # condition_cols=[[0]] restricts gold to column 0 only, so a difference in + # gold column 1 is NOT compared -> still a match. + spec = EvalSpec(condition_tabs=["orders"], condition_cols=[[0]]) + assert ( + _compare( + tmp_path, + pred={"orders": _ints(["a"], [(1,), (2,)])}, + gold={"orders": _ints(["a", "b"], [(1, 7), (2, 8)])}, + spec=spec, + ) + is True + ) + + +def test_spider2_dbt_verify_condition_col_diff_detected(tmp_path): + # condition_cols=[[0,1]] -> gold column 1 IS compared; its values are not + # contained in any pred column -> mismatch. Proves subsetting restricts + # without dropping everything. + spec = EvalSpec(condition_tabs=["orders"], condition_cols=[[0, 1]]) + assert ( + _compare( + tmp_path, + pred={"orders": _ints(["a", "b"], [(1, 100), (2, 200)])}, + gold={"orders": _ints(["a", "b"], [(1, 7), (2, 8)])}, + spec=spec, + ) + is False + ) + + +def test_spider2_dbt_verify_row_reorder_matches_when_ignore_orders(tmp_path): + # Per-column ignore_order sorts each column-vector before compare. + spec = EvalSpec(condition_tabs=["orders"], ignore_orders=[True]) + assert ( + _compare( + tmp_path, + pred={"orders": _ints(["a"], [(2,), (1,)])}, + gold={"orders": _ints(["a"], [(1,), (2,)])}, + spec=spec, + ) + is True + ) + + +def test_spider2_dbt_verify_row_reorder_mismatches_when_ordered(tmp_path): + # ignore_orders=False -> per-column positional compare; reordered -> mismatch. + spec = EvalSpec(condition_tabs=["orders"], ignore_orders=[False]) + assert ( + _compare( + tmp_path, + pred={"orders": _ints(["a"], [(2,), (1,)])}, + gold={"orders": _ints(["a"], [(1,), (2,)])}, + spec=spec, + ) + is False + ) + + +def test_spider2_dbt_verify_multi_table_gold_line_mismatch(tmp_path): + # Real multi-table gold line: table 1 matches, table 2's condition column + # genuinely differs -> overall 0.0 (proves the per-table lists are honored, + # not a single flat dict + single bool). + spec_path = tmp_path / "spider2_eval.jsonl" + spec_path.write_text( + json.dumps( + { + "instance_id": "multi-001", + "evaluation": { + "func": "duckdb_match", + "parameters": { + "gold": "gold.duckdb", + "condition_tabs": ["orders", "customers"], + "condition_cols": [[0], [0]], + "ignore_orders": [True, False], + }, + }, + } + ) + + "\n" + ) + spec = load_eval_spec(spec_path) + pred = { + "orders": _ints(["id"], [(2,), (1,)]), # reordered, ignore_orders[0]=True -> ok + "customers": _ints(["id"], [(7,)]), # differs from gold -> table 2 fails + } + gold = { + "orders": _ints(["id"], [(1,), (2,)]), + "customers": _ints(["id"], [(99,)]), + } + assert _compare(tmp_path, pred=pred, gold=gold, spec=spec) is False + + +def test_spider2_dbt_verify_decimal_within_tolerance_matches(tmp_path): + # DuckDB returns decimal.Decimal for DECIMAL/NUMERIC columns; Spider2's pandas + # path converted those to float64 before math.isclose(abs_tol=1e-2). The + # comparator must apply the SAME tolerance to Decimals -> within-tol == match. + spec = EvalSpec(condition_tabs=["m"]) + pred = {"m": (["v"], ["DECIMAL(10,3)"], [(1.005,), (2.000,)])} + gold = {"m": (["v"], ["DECIMAL(10,3)"], [(1.000,), (2.004,)])} + assert _compare(tmp_path, pred=pred, gold=gold, spec=spec) is True + + +def test_spider2_dbt_verify_decimal_beyond_tolerance_mismatches(tmp_path): + # Same DECIMAL path, difference > 1e-2 -> mismatch (proves it's the tolerance + # compare, not a string/exact-equality fallback). + spec = EvalSpec(condition_tabs=["m"]) + pred = {"m": (["v"], ["DECIMAL(10,3)"], [(1.050,)])} + gold = {"m": (["v"], ["DECIMAL(10,3)"], [(1.000,)])} + assert _compare(tmp_path, pred=pred, gold=gold, spec=spec) is False + + +def test_spider2_dbt_verify_condition_tabs_cannot_sql_inject(tmp_path): + # SECURITY: condition_tabs comes from the external eval spec. A name like + # `realt"; select 999 ...; --` breaks out of the quoted identifier; DuckDB + # runs the injected statement and returns ITS rows, so pre-fix both gold and + # pred fetch identical attacker rows -> a forced reward 1.0 over genuinely + # mismatched DBs. The identifier-quoting (doubled ") must keep it a single + # bogus identifier that fails to resolve -> fetch raises -> NEVER a match. + inject = 'realt"; select 999 AS a; --' + spec = EvalSpec(condition_tabs=[inject]) + # Real `realt` tables present (the breakout prefix) but with DIFFERENT data: + # a faithful compare is a mismatch, and the injection must not flip it to 1.0. + pred = {"realt": _ints(["a"], [(1,)])} + gold = {"realt": _ints(["a"], [(2,)])} + with pytest.raises(Exception): + _compare(tmp_path, pred=pred, gold=gold, spec=spec) diff --git a/tests/unit/test_translate_spider2_dbt.py b/tests/unit/test_translate_spider2_dbt.py index 60be44ea..10c50205 100644 --- a/tests/unit/test_translate_spider2_dbt.py +++ b/tests/unit/test_translate_spider2_dbt.py @@ -88,17 +88,43 @@ def test_spider2_dataset_requires_tasks_root(tmp_path, monkeypatch): _LEAKAGE_TERMS = ("gold", "expected", "golden") +def _is_verifier_only(rel: Path) -> bool: + """True for the verify-time-only `tests/` subtree. + + Harbor uploads the task's `tests/` dir into the container ONLY at verify + time and wipes/recreates it around the agent run, so it is never part of + the agent-visible view: + - harbor/verifier/verifier.py:133-138 uploads `tests/` inside `verify()`. + - harbor/trial/trial.py `_verify_step` first `reset_dirs(remove=[tests_dir, + verifier_dir])` then re-uploads — the agent step never sees `/tests`. + The spider2-dbt materializer deliberately stages the gold .duckdb + eval + spec + comparator there (`_ensure_verifier_assets`); those are the verifier's + answer key, NOT agent-visible leakage. Scoping the scan to the agent-visible + portion (everything OUTSIDE `tests/`) keeps real leakage protection intact: + any answer data that survives into an agent-reachable path still trips. + """ + return rel.parts[:1] == ("tests",) + + def _leakage_hits(view: Path) -> list[Path]: - """Pure-Python reimplementation of the rider's `rg -l 'gold|expected|golden'`. + """Pure-Python reimplementation of the rider's `rg -l 'gold|expected|golden'`, + scoped to the AGENT-VISIBLE portion of the view. Walks the view dir and reports any file whose NAME or CONTENT matches the case-sensitive alternation `gold|expected|golden` — matching the rider's - unescaped `rg -l` semantics. `view_manifest.json` is the materializer's - provenance record: it lists the deny globs (`tests/expected/**`, - `**/gold/**`, `**/golden/**`) and the checksums of every excluded source - file by design — an audit trail, NOT leaked answer content. So the scan - excludes the manifest; a hit on any other file means real answer data - survived the deny globs. + unescaped `rg -l` semantics. Two exclusions, both principled: + + - `view_manifest.json` is the materializer's provenance record: it lists the + deny globs (`tests/expected/**`, `**/gold/**`, `**/golden/**`) and the + checksums of every excluded source file by design — an audit trail, NOT + leaked answer content. + - The verify-time-only `tests/` subtree (see `_is_verifier_only`): Harbor + uploads it to the container only at verify time and removes it during the + agent run, so the gold .duckdb / eval spec / comparator the verifier needs + there are never agent-visible. + + A hit on any other file means real answer data survived into an + agent-reachable path. """ hits: list[Path] = [] for path in sorted(view.rglob("*")): @@ -106,6 +132,8 @@ def _leakage_hits(view: Path) -> list[Path]: continue if path.name == "view_manifest.json": continue + if _is_verifier_only(path.relative_to(view)): + continue if any(term in path.name for term in _LEAKAGE_TERMS): hits.append(path) continue @@ -185,6 +213,25 @@ def test_planted_forbidden_files_are_excluded_from_view(tmp_path, monkeypatch): assert hits == [], f"planted leakage survived into {view}: {hits}" +def test_leakage_scan_still_fires_on_agent_visible_answer_content(tmp_path): + """Guard: the `tests/`-scoped exclusion must NOT blanket-disable the scan. + + Plant `gold`-named answer content BOTH in an agent-visible path (view root) + AND under the verify-time-only `tests/` subtree, then assert `_leakage_hits` + reports the agent-visible one and ignores the verifier-only one. This pins + the B1 fix to scoping (exclude only `tests/`), not weakening the scanner. + """ + view = tmp_path / "view" + (view / "models").mkdir(parents=True) + (view / "models" / "answer_gold.sql").write_text("select 'leak';\n") + (view / "tests").mkdir() + (view / "tests" / "gold.duckdb").write_bytes(b"verifier-only gold answer key") + + hits = _leakage_hits(view) + assert view / "models" / "answer_gold.sql" in hits, "agent-visible leak missed" + assert view / "tests" / "gold.duckdb" not in hits, "verifier-only asset flagged" + + def test_exclude_tasks_drops_spider2_source_slug(tmp_path, monkeypatch): sources = sorted(FIXTURE_ROOT.glob("spider2-fixture-*")) assert len(sources) >= 2, "need >1 fixture instance to prove exclusion keeps the other"