Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
c72d551
feat(spider2): eval-spec + duckdb_match comparator (AC-1, AC-2)
kentwelcome Jun 18, 2026
06be8be
feat(spider2): verify.py CLI + materialize verifier assets/test.sh (A…
kentwelcome Jun 18, 2026
1a3b612
test(spider2): end-to-end emitted verifier writes harbor reward.json …
kentwelcome Jun 18, 2026
7ab0bf7
docs(spider2): implementation stage report
kentwelcome Jun 18, 2026
637aac2
docs(spider2): validation report — REJECTED (translate leakage regres…
kentwelcome Jun 18, 2026
98604e5
feedback: validation gate rejected (cycle 1) — duckdb_match faithfuln…
kentwelcome Jun 18, 2026
61a1b9c
fix(spider2): B2 — faithful duckdb_match (column-containment + 1e-2) …
kentwelcome Jun 18, 2026
22b9b6c
fix(spider2): B1 — scope leakage scan to agent-visible portion (exclu…
kentwelcome Jun 18, 2026
f878967
docs(spider2): implementation stage report (feedback cycle 1)
kentwelcome Jun 18, 2026
0b64e92
docs(spider2): validation re-review (cycle 2) — PASSED (B2 faithful, …
kentwelcome Jun 18, 2026
f25597c
feedback: validation gate rejected (cycle 2) — fail-closed eval-spec …
kentwelcome Jun 18, 2026
869918d
fix(spider2-dbt): fail closed on empty/malformed gold eval spec
kentwelcome Jun 18, 2026
11c66f4
docs(spider2-dbt): cycle-2 implementation stage report (fail-closed)
kentwelcome Jun 18, 2026
6698e26
docs(spider2): validation re-review (cycle 3) — PASSED (fail-closed r…
kentwelcome Jun 18, 2026
c6749fd
feedback: validation gate rejected (cycle 3) — honor eval-spec gold b…
kentwelcome Jun 18, 2026
a677c36
fix(spider2-dbt): drive gold DB basename from evaluation.parameters.gold
kentwelcome Jun 18, 2026
bac0705
docs: cycle-3 implementation stage report (gold basename fix)
kentwelcome Jun 18, 2026
78b4a52
docs(spider2): validation re-review (cycle-3 fix) — PASSED (gold base…
kentwelcome Jun 18, 2026
bfed74c
fix(spider2-dbt): reject non-basename gold in load_eval_spec (path-tr…
kentwelcome Jun 18, 2026
b774bb7
fix(spider2-dbt): allowlist gold basename to block test.sh shell inje…
kentwelcome Jun 18, 2026
843b4c1
fix(spider2-dbt): shlex.quote both verifier args (predicted-db inject…
kentwelcome Jun 18, 2026
0ce7d51
fix(spider2-dbt): drop pandas from verifier; re-port comparator on du…
kentwelcome Jun 18, 2026
6c76cbe
fix(spider2-dbt): quote spec-supplied table names to block SQL injection
kentwelcome Jun 18, 2026
f35a13b
fix(spider2-dbt): DECIMAL tolerance regression + fail closed on missi…
kentwelcome Jun 18, 2026
965f2ea
fix(spider2-dbt): guard verifier-asset copies against symlink write-t…
kentwelcome Jun 18, 2026
5cdcea3
docs(spider2): final validation (cycles 1-10) — PASSED
kentwelcome Jun 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
291 changes: 291 additions & 0 deletions docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Validation report — spider2-dbt duckdb_match verifier (cycles 1–10, final)

- Entity: `docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md`
- Branch: `spacedock-ensign/spider2-dbt-duckdb-match-verifier`
- Validated HEAD: `965f2ea` (includes cycles 1–10)
- Merge-base with `main`: `9c39af2`
- Method: independent validation of the current worktree HEAD — fresh suite
runs from the clean branch checkout, standalone behavioral probes (not a
re-read of prior reports), and `superpowers:requesting-code-review`.

## Acceptance criteria — PASS/FAIL with command + output

### AC-1 — comparator scores 1.0 on a matching DB, 0.0 on a mismatch — PASS

Verified by the comparator unit suite and a standalone live probe driving
`compare_duckdb` over in-test DuckDB fixtures.

```
$ uv run pytest -k spider2_dbt_verify --ignore=tests/unit/test_task_identity_scoring.py -q
49 passed, 795 deselected in 2.84s
```

Standalone probe drives `compare_duckdb` directly: matching DB → True,
mismatched DB → False; missing predicted table → False; multi-table AND with one
table mismatched → overall False. All PASS.

### AC-2 — column subsetting + ignore_orders honor duckdb_match semantics — PASS

Same probe, exercised live:

- `ignore_orders=True` over a row-reordered table → match (True); same data with
`ignore_orders=False` → mismatch (False).
- A difference in a column NOT in `condition_cols` → still match (True); the same
difference WITH that column in `condition_cols` → mismatch (False).
- Column-containment: pred columns reordered + an extra pred column → still match.
- `math.isclose(abs_tol=1e-2)`: within 1e-2 → match, beyond → mismatch.
- DECIMAL(10,3) within 1e-2 → match (cycle-9 `_normalize` Decimal→float).
- NULL==NULL → match (NA==NA).

All 11 probe cases PASS. Independently re-confirmed against the upstream
`xlang-ai/Spider2` `eval_utils.duckdb_match`/`compare_pandas_table` by the code
reviewer (fetched live during review) — line-accurate port, including the
`(x is None, str(x), isinstance(x,(int,float)))` sort key.

### AC-3 — emitted test.sh writes a Harbor-shaped reward.json — PASS

Verified by the integration suite and a standalone `emit_reward` probe:

- Matching pred/gold → reward.json parses to `{"reward": 1.0}`,
`set(payload) == {"reward"}`, value is a `float`.
- Mismatch → `{"reward": 0.0}`.
- `emit_reward` never crashes-into-pass: garbage (non-JSON) spec over MATCHING
DBs → `{"reward": 0.0}`; empty-`condition_tabs` wrapped spec over matching DBs
→ `{"reward": 0.0}` (the would-be-1.0 fail-open is closed); missing predicted
DB → `{"reward": 0.0}`.

Integration test (`test_spider2_dbt_verify_test_sh.py`) executes the emitted
`test.sh` end-to-end and asserts the reward.json shape — green within the
`spider2_dbt_verify` 49-passed run above.

## Stage-checklist verification of the dispatch focus areas

- **Comparator faithfulness** — PASS. Column-containment, isclose 1e-2, per-column
sort, condition_cols restriction, multi-table AND, missing-table → 0, NA==NA,
DECIMAL within tolerance all confirmed live (probe) + reviewer's line-by-line
oracle comparison.
- **duckdb-only dependency** — PASS.
`grep -rn "import pandas\|from pandas\|import numpy\|from numpy"
src/razorback/benchmarks/spider2_dbt/` → no matches. Only `duckdb` + stdlib are
imported by the comparator/verifier.
- **Fail-closed guards** — PASS. Live probe: empty file, wrong `evaluation.func`,
empty/missing `condition_tabs`, and a wrapped spec missing `parameters.gold`
each RAISE; the gold-basename allowlist rejects all traversal/metachar cases
(`../…`, `/etc/passwd`, `sub/dir/g.duckdb`, `..`, `.`, `x.duckdb; … #`,
`g.duckdb $(id)`, `g .duckdb`, `g.sqlite`). Missing `tests/gold/` →
`_ensure_verifier_assets` raises `FileNotFoundError`.
- **Shell/SQL injection sealed** — PASS. `harbor_view.py` `shlex.quote`s BOTH
`--predicted-db` and `--gold-db` at the single emission point;
`condition_tabs` is identifier-quoted (doubled `"`) in `_fetch_columns`, and a
breakout value (`realt"; select 999 AS a; --`) over genuinely-mismatched DBs
RAISES (cannot force a match) — confirmed live.
- **Link-mode symlink-write-through** — PASS. All 5 verifier-asset copies route
through `_copy_into_view` (unlink-symlink-then-copy); test.sh, Dockerfile, and
preflight writes carry the same guard. Regression
`..._never_mutate_colliding_source_file` is green.
- **AC-3 reward.json shape / never-crash-into-pass** — PASS (see AC-3).

## Suite results (clean branch checkout)

```
$ uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py -q
93 passed, 751 deselected in 5.34s

$ uv run pytest --ignore=tests/unit/test_task_identity_scoring.py -q
4 failed, 828 passed, 12 skipped, 80 warnings in 68.05s
```

The 4 full-suite failures are pre-existing on the merge-base `9c39af2`
(verified by running exactly those 4 node-ids in a throwaway base worktree →
`4 failed`) and live in files UNTOUCHED by this branch
(`git diff 9c39af2..HEAD --name-only` is spider2_dbt scope + docs only).
NOT regressions:

- `test_spacedock_solver_freeze_dir_mechanism.py::test_codex_runtime_dispatch_constructs_inner_agent`
- `test_worktree_teardown_preserves_runs.py::test_worktree_remove_force_does_not_destroy_runs`
- `test_generate_matrix_specs.py::test_matrix_specs_carry_query_mode_batch`
- `test_rk_research_new.py::test_rk_research_new_creates_scaffold_tree`

## Code review findings (superpowers:requesting-code-review)

Reviewer dispatched against `9c39af2..965f2ea`; it independently fetched the
upstream Spider2 oracle and ran the suite.

- **Critical:** none. The reviewer could not construct any external input
(eval-spec JSON, gold basename, condition_tabs, task profile) that forces a
false reward 1.0.
- **Important:** none.
- **Minor (non-blocking):**
1. `eval_spec.py:73` — `condition_cols` `[[]]`/`[None]` is expanded to
`[[]]*n` for any `n`; this is the *forgiving* direction and mirrors
upstream `compare_multi_pandas_table` defaulting (stricter `duckdb_match`
would assert). No correctness risk for well-formed specs; a clarifying
comment was suggested.
2. The leakage-scan scoping that excludes the verify-only `tests/` subtree
rests on the Harbor "tests/ uploaded only at verify time, reset around the
agent run" lifecycle; a version-pinned comment was suggested so a future
Harbor bump can't silently invalidate it. The guard test proves the change
is scoping not weakening.
3. `harbor_view.py:163` `spec.gold or "gold.duckdb"` fallback is effectively
dead for wrapped specs (they already raise without gold); reachable only
for the unwrapped-fixture path. A clarifying comment was suggested.

All three are clarity/robustness polish, not correctness defects — classified
non-blocking. No production code was changed during validation.

## Gate decision

**PASSED → done.**

Rationale: all three ACs reproduce green from a clean checkout with actual
command output (gating suite 93 passed; acceptance `-k spider2_dbt_verify` 49
passed), every dispatch-named focus area is confirmed by live behavioral probes
(comparator faithfulness, duckdb-only, fail-closed guards, shell+SQL injection
sealed, symlink write-through, reward.json shape), the only full-suite failures
are pre-existing on the merge-base and outside this branch's scope, and the
independent code review found zero blocking issues (3 minor polish notes). No
external input forces a false reward 1.0.
158 changes: 158 additions & 0 deletions src/razorback/benchmarks/spider2_dbt/duckdb_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# ABOUTME: spider2-dbt comparator faithfully reproducing Spider2 eval_utils.duckdb_match.
# ABOUTME: Column-containment over column-vectors, math.isclose(1e-2), AND across tables.
# Depends ONLY on duckdb (the one library the verify-time image is guaranteed to
# have — the build-time preflight already imports it); no pandas, so the emitted
# verifier cannot crash-on-import in a task image that ships dbt-duckdb but not pandas.
from __future__ import annotations

import math
from decimal import Decimal
from pathlib import Path

import duckdb

try:
from razorback.benchmarks.spider2_dbt.eval_spec import EvalSpec
except ModuleNotFoundError: # running flat from /tests in the verifier container
from eval_spec import EvalSpec # type: ignore[no-redef]

# Spider2 eval_utils.compare_pandas_table numeric tolerance.
_TOLERANCE = 1e-2


def _isna(x) -> bool:
"""stdlib stand-in for pandas.isna over scalar cells fetched from DuckDB.

DuckDB returns SQL NULL as Python ``None``; a float ``NaN`` is the only other
NA-like scalar. Mirrors Spider2's ``pd.isna(...)`` NaN==NaN handling.
"""
return x is None or (isinstance(x, float) and math.isnan(x))


def _vectors_match(v1, v2, *, tol: float = _TOLERANCE, ignore_order: bool = False) -> bool:
"""Port of Spider2 eval_utils.compare_pandas_table.vectors_match.

Compares two column-vectors element-wise: NA==NA, numerics via
math.isclose(abs_tol=tol), everything else by ``!=``. When ignore_order
is set, both vectors are sorted first with Spider2's exact key.
"""
try:
if ignore_order:
key = lambda x: (x is None, str(x), isinstance(x, (int, float)))
v1 = sorted(v1, key=key)
v2 = sorted(v2, key=key)
if len(v1) != len(v2):
return False
for a, b in zip(v1, v2):
if _isna(a) and _isna(b):
continue
elif isinstance(a, (int, float)) and isinstance(b, (int, float)):
if not math.isclose(float(a), float(b), abs_tol=tol):
return False
elif a != b:
return False
return True
except Exception:
return False


def _compare_table(
pred_cols: list[list],
gold_cols: list[list],
*,
condition_cols: list[int],
ignore_order: bool,
) -> bool:
"""Port of Spider2 eval_utils.compare_pandas_table.

Column-CONTAINMENT (not positional row equality): restrict gold to
``condition_cols`` (all columns when empty), then require that EVERY gold
column-vector matches SOME pred column-vector. Extra pred columns are
therefore tolerated. Inputs are already transposed to column-vectors.
"""
if condition_cols:
gold_vecs = [gold_cols[c] for c in condition_cols]
else:
gold_vecs = gold_cols

for gold_vec in gold_vecs:
if not any(
_vectors_match(gold_vec, pred_vec, ignore_order=ignore_order)
for pred_vec in pred_cols
):
return False
return True


def _quote_ident(name: str) -> str:
"""Quote a SQL identifier, doubling embedded double-quotes.

``condition_tabs`` table names come from the external eval spec, so a value
like ``realt"; select 999; --`` would otherwise break out of the quoted
identifier and DuckDB would execute the injected statement, letting a hostile
spec rig gold/pred fetches to return identical rows and award reward 1.0.
Doubling keeps the value a single (bogus) identifier; it then fails to
resolve and the fetch raises -> emit_reward scores 0 (fail-closed).
"""
return '"' + name.replace('"', '""') + '"'


def _fetch_columns(con: duckdb.DuckDBPyConnection, table: str) -> list[list]:
"""``SELECT * FROM `table``` as a list of column-vectors (transposed rows).

Replaces Spider2's ``fetchdf().transpose().values.tolist()``: DuckDB's
``.fetchall()`` yields native Python scalars in row order (NULL -> None),
and ``.description`` fixes the column count so a zero-row table still
yields one empty vector per column.
"""
cur = con.execute(f"SELECT * FROM {_quote_ident(table)}")
rows = cur.fetchall()
ncols = len(cur.description)
return [[_normalize(row[j]) for row in rows] for j in range(ncols)]


def _normalize(v):
"""Mirror Spider2's pandas fetchdf() DECIMAL->float64 coercion.

DuckDB native fetchall() returns ``decimal.Decimal`` for DECIMAL/NUMERIC,
but Decimal is ``numbers.Number`` yet NOT ``numbers.Real``, so it would skip
``_vectors_match``'s tolerance branch and a within-1e-2 DECIMAL match would
wrongly score 0. Normalize to float in one place so downstream compares see
the same scalar types Spider2's pandas path produced.
"""
return float(v) if isinstance(v, Decimal) else v


def compare_duckdb(*, predicted_db: Path, gold_db: Path, spec: EvalSpec) -> bool:
"""Faithfully reproduce Spider2 eval_utils.duckdb_match.

For each table in ``spec.condition_tabs``: load gold + predicted via
``SELECT *``, run column-containment with the table's
``spec.condition_cols[i]`` / ``spec.ignore_orders[i]``. Overall match is
the AND across all tables. A missing/unreadable predicted table is a
mismatch (Spider2 wraps the predicted-table fetch in try/except -> 0).
"""
gold_con = duckdb.connect(str(gold_db), read_only=True)
try:
gold_tables = [_fetch_columns(gold_con, t) for t in spec.condition_tabs]
finally:
gold_con.close()

pred_con = duckdb.connect(str(predicted_db), read_only=True)
try:
try:
pred_tables = [_fetch_columns(pred_con, t) for t in spec.condition_tabs]
except Exception:
return False
finally:
pred_con.close()

for i, (gold_cols, pred_cols) in enumerate(zip(gold_tables, pred_tables)):
if not _compare_table(
pred_cols,
gold_cols,
condition_cols=spec.condition_cols[i],
ignore_order=spec.ignore_orders[i],
):
return False
return True
Loading