Skip to content

Commit f74dc9f

Browse files
authored
Merge pull request #15 from spacedock-dev/spacedock-ensign/spider2-dbt-duckdb-match-verifier
spider2-dbt — duckdb_match verifier emitting binary reward.json
2 parents 9c39af2 + 5cdcea3 commit f74dc9f

16 files changed

Lines changed: 2093 additions & 7 deletions

File tree

docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md

Lines changed: 291 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Validation report — spider2-dbt duckdb_match verifier (cycles 1–10, final)
2+
3+
- Entity: `docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md`
4+
- Branch: `spacedock-ensign/spider2-dbt-duckdb-match-verifier`
5+
- Validated HEAD: `965f2ea` (includes cycles 1–10)
6+
- Merge-base with `main`: `9c39af2`
7+
- Method: independent validation of the current worktree HEAD — fresh suite
8+
runs from the clean branch checkout, standalone behavioral probes (not a
9+
re-read of prior reports), and `superpowers:requesting-code-review`.
10+
11+
## Acceptance criteria — PASS/FAIL with command + output
12+
13+
### AC-1 — comparator scores 1.0 on a matching DB, 0.0 on a mismatch — PASS
14+
15+
Verified by the comparator unit suite and a standalone live probe driving
16+
`compare_duckdb` over in-test DuckDB fixtures.
17+
18+
```
19+
$ uv run pytest -k spider2_dbt_verify --ignore=tests/unit/test_task_identity_scoring.py -q
20+
49 passed, 795 deselected in 2.84s
21+
```
22+
23+
Standalone probe drives `compare_duckdb` directly: matching DB → True,
24+
mismatched DB → False; missing predicted table → False; multi-table AND with one
25+
table mismatched → overall False. All PASS.
26+
27+
### AC-2 — column subsetting + ignore_orders honor duckdb_match semantics — PASS
28+
29+
Same probe, exercised live:
30+
31+
- `ignore_orders=True` over a row-reordered table → match (True); same data with
32+
`ignore_orders=False` → mismatch (False).
33+
- A difference in a column NOT in `condition_cols` → still match (True); the same
34+
difference WITH that column in `condition_cols` → mismatch (False).
35+
- Column-containment: pred columns reordered + an extra pred column → still match.
36+
- `math.isclose(abs_tol=1e-2)`: within 1e-2 → match, beyond → mismatch.
37+
- DECIMAL(10,3) within 1e-2 → match (cycle-9 `_normalize` Decimal→float).
38+
- NULL==NULL → match (NA==NA).
39+
40+
All 11 probe cases PASS. Independently re-confirmed against the upstream
41+
`xlang-ai/Spider2` `eval_utils.duckdb_match`/`compare_pandas_table` by the code
42+
reviewer (fetched live during review) — line-accurate port, including the
43+
`(x is None, str(x), isinstance(x,(int,float)))` sort key.
44+
45+
### AC-3 — emitted test.sh writes a Harbor-shaped reward.json — PASS
46+
47+
Verified by the integration suite and a standalone `emit_reward` probe:
48+
49+
- Matching pred/gold → reward.json parses to `{"reward": 1.0}`,
50+
`set(payload) == {"reward"}`, value is a `float`.
51+
- Mismatch → `{"reward": 0.0}`.
52+
- `emit_reward` never crashes-into-pass: garbage (non-JSON) spec over MATCHING
53+
DBs → `{"reward": 0.0}`; empty-`condition_tabs` wrapped spec over matching DBs
54+
`{"reward": 0.0}` (the would-be-1.0 fail-open is closed); missing predicted
55+
DB → `{"reward": 0.0}`.
56+
57+
Integration test (`test_spider2_dbt_verify_test_sh.py`) executes the emitted
58+
`test.sh` end-to-end and asserts the reward.json shape — green within the
59+
`spider2_dbt_verify` 49-passed run above.
60+
61+
## Stage-checklist verification of the dispatch focus areas
62+
63+
- **Comparator faithfulness** — PASS. Column-containment, isclose 1e-2, per-column
64+
sort, condition_cols restriction, multi-table AND, missing-table → 0, NA==NA,
65+
DECIMAL within tolerance all confirmed live (probe) + reviewer's line-by-line
66+
oracle comparison.
67+
- **duckdb-only dependency** — PASS.
68+
`grep -rn "import pandas\|from pandas\|import numpy\|from numpy"
69+
src/razorback/benchmarks/spider2_dbt/` → no matches. Only `duckdb` + stdlib are
70+
imported by the comparator/verifier.
71+
- **Fail-closed guards** — PASS. Live probe: empty file, wrong `evaluation.func`,
72+
empty/missing `condition_tabs`, and a wrapped spec missing `parameters.gold`
73+
each RAISE; the gold-basename allowlist rejects all traversal/metachar cases
74+
(`../…`, `/etc/passwd`, `sub/dir/g.duckdb`, `..`, `.`, `x.duckdb; … #`,
75+
`g.duckdb $(id)`, `g .duckdb`, `g.sqlite`). Missing `tests/gold/`
76+
`_ensure_verifier_assets` raises `FileNotFoundError`.
77+
- **Shell/SQL injection sealed** — PASS. `harbor_view.py` `shlex.quote`s BOTH
78+
`--predicted-db` and `--gold-db` at the single emission point;
79+
`condition_tabs` is identifier-quoted (doubled `"`) in `_fetch_columns`, and a
80+
breakout value (`realt"; select 999 AS a; --`) over genuinely-mismatched DBs
81+
RAISES (cannot force a match) — confirmed live.
82+
- **Link-mode symlink-write-through** — PASS. All 5 verifier-asset copies route
83+
through `_copy_into_view` (unlink-symlink-then-copy); test.sh, Dockerfile, and
84+
preflight writes carry the same guard. Regression
85+
`..._never_mutate_colliding_source_file` is green.
86+
- **AC-3 reward.json shape / never-crash-into-pass** — PASS (see AC-3).
87+
88+
## Suite results (clean branch checkout)
89+
90+
```
91+
$ uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py -q
92+
93 passed, 751 deselected in 5.34s
93+
94+
$ uv run pytest --ignore=tests/unit/test_task_identity_scoring.py -q
95+
4 failed, 828 passed, 12 skipped, 80 warnings in 68.05s
96+
```
97+
98+
The 4 full-suite failures are pre-existing on the merge-base `9c39af2`
99+
(verified by running exactly those 4 node-ids in a throwaway base worktree →
100+
`4 failed`) and live in files UNTOUCHED by this branch
101+
(`git diff 9c39af2..HEAD --name-only` is spider2_dbt scope + docs only).
102+
NOT regressions:
103+
104+
- `test_spacedock_solver_freeze_dir_mechanism.py::test_codex_runtime_dispatch_constructs_inner_agent`
105+
- `test_worktree_teardown_preserves_runs.py::test_worktree_remove_force_does_not_destroy_runs`
106+
- `test_generate_matrix_specs.py::test_matrix_specs_carry_query_mode_batch`
107+
- `test_rk_research_new.py::test_rk_research_new_creates_scaffold_tree`
108+
109+
## Code review findings (superpowers:requesting-code-review)
110+
111+
Reviewer dispatched against `9c39af2..965f2ea`; it independently fetched the
112+
upstream Spider2 oracle and ran the suite.
113+
114+
- **Critical:** none. The reviewer could not construct any external input
115+
(eval-spec JSON, gold basename, condition_tabs, task profile) that forces a
116+
false reward 1.0.
117+
- **Important:** none.
118+
- **Minor (non-blocking):**
119+
1. `eval_spec.py:73``condition_cols` `[[]]`/`[None]` is expanded to
120+
`[[]]*n` for any `n`; this is the *forgiving* direction and mirrors
121+
upstream `compare_multi_pandas_table` defaulting (stricter `duckdb_match`
122+
would assert). No correctness risk for well-formed specs; a clarifying
123+
comment was suggested.
124+
2. The leakage-scan scoping that excludes the verify-only `tests/` subtree
125+
rests on the Harbor "tests/ uploaded only at verify time, reset around the
126+
agent run" lifecycle; a version-pinned comment was suggested so a future
127+
Harbor bump can't silently invalidate it. The guard test proves the change
128+
is scoping not weakening.
129+
3. `harbor_view.py:163` `spec.gold or "gold.duckdb"` fallback is effectively
130+
dead for wrapped specs (they already raise without gold); reachable only
131+
for the unwrapped-fixture path. A clarifying comment was suggested.
132+
133+
All three are clarity/robustness polish, not correctness defects — classified
134+
non-blocking. No production code was changed during validation.
135+
136+
## Gate decision
137+
138+
**PASSED → done.**
139+
140+
Rationale: all three ACs reproduce green from a clean checkout with actual
141+
command output (gating suite 93 passed; acceptance `-k spider2_dbt_verify` 49
142+
passed), every dispatch-named focus area is confirmed by live behavioral probes
143+
(comparator faithfulness, duckdb-only, fail-closed guards, shell+SQL injection
144+
sealed, symlink write-through, reward.json shape), the only full-suite failures
145+
are pre-existing on the merge-base and outside this branch's scope, and the
146+
independent code review found zero blocking issues (3 minor polish notes). No
147+
external input forces a false reward 1.0.
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# ABOUTME: spider2-dbt comparator faithfully reproducing Spider2 eval_utils.duckdb_match.
2+
# ABOUTME: Column-containment over column-vectors, math.isclose(1e-2), AND across tables.
3+
# Depends ONLY on duckdb (the one library the verify-time image is guaranteed to
4+
# have — the build-time preflight already imports it); no pandas, so the emitted
5+
# verifier cannot crash-on-import in a task image that ships dbt-duckdb but not pandas.
6+
from __future__ import annotations
7+
8+
import math
9+
from decimal import Decimal
10+
from pathlib import Path
11+
12+
import duckdb
13+
14+
try:
15+
from razorback.benchmarks.spider2_dbt.eval_spec import EvalSpec
16+
except ModuleNotFoundError: # running flat from /tests in the verifier container
17+
from eval_spec import EvalSpec # type: ignore[no-redef]
18+
19+
# Spider2 eval_utils.compare_pandas_table numeric tolerance.
20+
_TOLERANCE = 1e-2
21+
22+
23+
def _isna(x) -> bool:
24+
"""stdlib stand-in for pandas.isna over scalar cells fetched from DuckDB.
25+
26+
DuckDB returns SQL NULL as Python ``None``; a float ``NaN`` is the only other
27+
NA-like scalar. Mirrors Spider2's ``pd.isna(...)`` NaN==NaN handling.
28+
"""
29+
return x is None or (isinstance(x, float) and math.isnan(x))
30+
31+
32+
def _vectors_match(v1, v2, *, tol: float = _TOLERANCE, ignore_order: bool = False) -> bool:
33+
"""Port of Spider2 eval_utils.compare_pandas_table.vectors_match.
34+
35+
Compares two column-vectors element-wise: NA==NA, numerics via
36+
math.isclose(abs_tol=tol), everything else by ``!=``. When ignore_order
37+
is set, both vectors are sorted first with Spider2's exact key.
38+
"""
39+
try:
40+
if ignore_order:
41+
key = lambda x: (x is None, str(x), isinstance(x, (int, float)))
42+
v1 = sorted(v1, key=key)
43+
v2 = sorted(v2, key=key)
44+
if len(v1) != len(v2):
45+
return False
46+
for a, b in zip(v1, v2):
47+
if _isna(a) and _isna(b):
48+
continue
49+
elif isinstance(a, (int, float)) and isinstance(b, (int, float)):
50+
if not math.isclose(float(a), float(b), abs_tol=tol):
51+
return False
52+
elif a != b:
53+
return False
54+
return True
55+
except Exception:
56+
return False
57+
58+
59+
def _compare_table(
60+
pred_cols: list[list],
61+
gold_cols: list[list],
62+
*,
63+
condition_cols: list[int],
64+
ignore_order: bool,
65+
) -> bool:
66+
"""Port of Spider2 eval_utils.compare_pandas_table.
67+
68+
Column-CONTAINMENT (not positional row equality): restrict gold to
69+
``condition_cols`` (all columns when empty), then require that EVERY gold
70+
column-vector matches SOME pred column-vector. Extra pred columns are
71+
therefore tolerated. Inputs are already transposed to column-vectors.
72+
"""
73+
if condition_cols:
74+
gold_vecs = [gold_cols[c] for c in condition_cols]
75+
else:
76+
gold_vecs = gold_cols
77+
78+
for gold_vec in gold_vecs:
79+
if not any(
80+
_vectors_match(gold_vec, pred_vec, ignore_order=ignore_order)
81+
for pred_vec in pred_cols
82+
):
83+
return False
84+
return True
85+
86+
87+
def _quote_ident(name: str) -> str:
88+
"""Quote a SQL identifier, doubling embedded double-quotes.
89+
90+
``condition_tabs`` table names come from the external eval spec, so a value
91+
like ``realt"; select 999; --`` would otherwise break out of the quoted
92+
identifier and DuckDB would execute the injected statement, letting a hostile
93+
spec rig gold/pred fetches to return identical rows and award reward 1.0.
94+
Doubling keeps the value a single (bogus) identifier; it then fails to
95+
resolve and the fetch raises -> emit_reward scores 0 (fail-closed).
96+
"""
97+
return '"' + name.replace('"', '""') + '"'
98+
99+
100+
def _fetch_columns(con: duckdb.DuckDBPyConnection, table: str) -> list[list]:
101+
"""``SELECT * FROM `table``` as a list of column-vectors (transposed rows).
102+
103+
Replaces Spider2's ``fetchdf().transpose().values.tolist()``: DuckDB's
104+
``.fetchall()`` yields native Python scalars in row order (NULL -> None),
105+
and ``.description`` fixes the column count so a zero-row table still
106+
yields one empty vector per column.
107+
"""
108+
cur = con.execute(f"SELECT * FROM {_quote_ident(table)}")
109+
rows = cur.fetchall()
110+
ncols = len(cur.description)
111+
return [[_normalize(row[j]) for row in rows] for j in range(ncols)]
112+
113+
114+
def _normalize(v):
115+
"""Mirror Spider2's pandas fetchdf() DECIMAL->float64 coercion.
116+
117+
DuckDB native fetchall() returns ``decimal.Decimal`` for DECIMAL/NUMERIC,
118+
but Decimal is ``numbers.Number`` yet NOT ``numbers.Real``, so it would skip
119+
``_vectors_match``'s tolerance branch and a within-1e-2 DECIMAL match would
120+
wrongly score 0. Normalize to float in one place so downstream compares see
121+
the same scalar types Spider2's pandas path produced.
122+
"""
123+
return float(v) if isinstance(v, Decimal) else v
124+
125+
126+
def compare_duckdb(*, predicted_db: Path, gold_db: Path, spec: EvalSpec) -> bool:
127+
"""Faithfully reproduce Spider2 eval_utils.duckdb_match.
128+
129+
For each table in ``spec.condition_tabs``: load gold + predicted via
130+
``SELECT *``, run column-containment with the table's
131+
``spec.condition_cols[i]`` / ``spec.ignore_orders[i]``. Overall match is
132+
the AND across all tables. A missing/unreadable predicted table is a
133+
mismatch (Spider2 wraps the predicted-table fetch in try/except -> 0).
134+
"""
135+
gold_con = duckdb.connect(str(gold_db), read_only=True)
136+
try:
137+
gold_tables = [_fetch_columns(gold_con, t) for t in spec.condition_tabs]
138+
finally:
139+
gold_con.close()
140+
141+
pred_con = duckdb.connect(str(predicted_db), read_only=True)
142+
try:
143+
try:
144+
pred_tables = [_fetch_columns(pred_con, t) for t in spec.condition_tabs]
145+
except Exception:
146+
return False
147+
finally:
148+
pred_con.close()
149+
150+
for i, (gold_cols, pred_cols) in enumerate(zip(gold_tables, pred_tables)):
151+
if not _compare_table(
152+
pred_cols,
153+
gold_cols,
154+
condition_cols=spec.condition_cols[i],
155+
ignore_order=spec.ignore_orders[i],
156+
):
157+
return False
158+
return True

0 commit comments

Comments
 (0)