|
| 1 | +# Validation report — spider2-dbt duckdb_match verifier (cycles 1–10, final) |
| 2 | + |
| 3 | +- Entity: `docs/razorback-implementation/spider2-dbt-duckdb-match-verifier.md` |
| 4 | +- Branch: `spacedock-ensign/spider2-dbt-duckdb-match-verifier` |
| 5 | +- Validated HEAD: `965f2ea` (includes cycles 1–10) |
| 6 | +- Merge-base with `main`: `9c39af2` |
| 7 | +- Method: independent validation of the current worktree HEAD — fresh suite |
| 8 | + runs from the clean branch checkout, standalone behavioral probes (not a |
| 9 | + re-read of prior reports), and `superpowers:requesting-code-review`. |
| 10 | + |
| 11 | +## Acceptance criteria — PASS/FAIL with command + output |
| 12 | + |
| 13 | +### AC-1 — comparator scores 1.0 on a matching DB, 0.0 on a mismatch — PASS |
| 14 | + |
| 15 | +Verified by the comparator unit suite and a standalone live probe driving |
| 16 | +`compare_duckdb` over in-test DuckDB fixtures. |
| 17 | + |
| 18 | +``` |
| 19 | +$ uv run pytest -k spider2_dbt_verify --ignore=tests/unit/test_task_identity_scoring.py -q |
| 20 | +49 passed, 795 deselected in 2.84s |
| 21 | +``` |
| 22 | + |
| 23 | +Standalone probe drives `compare_duckdb` directly: matching DB → True, |
| 24 | +mismatched DB → False; missing predicted table → False; multi-table AND with one |
| 25 | +table mismatched → overall False. All PASS. |
| 26 | + |
| 27 | +### AC-2 — column subsetting + ignore_orders honor duckdb_match semantics — PASS |
| 28 | + |
| 29 | +Same probe, exercised live: |
| 30 | + |
| 31 | +- `ignore_orders=True` over a row-reordered table → match (True); same data with |
| 32 | + `ignore_orders=False` → mismatch (False). |
| 33 | +- A difference in a column NOT in `condition_cols` → still match (True); the same |
| 34 | + difference WITH that column in `condition_cols` → mismatch (False). |
| 35 | +- Column-containment: pred columns reordered + an extra pred column → still match. |
| 36 | +- `math.isclose(abs_tol=1e-2)`: within 1e-2 → match, beyond → mismatch. |
| 37 | +- DECIMAL(10,3) within 1e-2 → match (cycle-9 `_normalize` Decimal→float). |
| 38 | +- NULL==NULL → match (NA==NA). |
| 39 | + |
| 40 | +All 11 probe cases PASS. Independently re-confirmed against the upstream |
| 41 | +`xlang-ai/Spider2` `eval_utils.duckdb_match`/`compare_pandas_table` by the code |
| 42 | +reviewer (fetched live during review) — line-accurate port, including the |
| 43 | +`(x is None, str(x), isinstance(x,(int,float)))` sort key. |
| 44 | + |
| 45 | +### AC-3 — emitted test.sh writes a Harbor-shaped reward.json — PASS |
| 46 | + |
| 47 | +Verified by the integration suite and a standalone `emit_reward` probe: |
| 48 | + |
| 49 | +- Matching pred/gold → reward.json parses to `{"reward": 1.0}`, |
| 50 | + `set(payload) == {"reward"}`, value is a `float`. |
| 51 | +- Mismatch → `{"reward": 0.0}`. |
| 52 | +- `emit_reward` never crashes-into-pass: garbage (non-JSON) spec over MATCHING |
| 53 | + DBs → `{"reward": 0.0}`; empty-`condition_tabs` wrapped spec over matching DBs |
| 54 | + → `{"reward": 0.0}` (the would-be-1.0 fail-open is closed); missing predicted |
| 55 | + DB → `{"reward": 0.0}`. |
| 56 | + |
| 57 | +Integration test (`test_spider2_dbt_verify_test_sh.py`) executes the emitted |
| 58 | +`test.sh` end-to-end and asserts the reward.json shape — green within the |
| 59 | +`spider2_dbt_verify` 49-passed run above. |
| 60 | + |
| 61 | +## Stage-checklist verification of the dispatch focus areas |
| 62 | + |
| 63 | +- **Comparator faithfulness** — PASS. Column-containment, isclose 1e-2, per-column |
| 64 | + sort, condition_cols restriction, multi-table AND, missing-table → 0, NA==NA, |
| 65 | + DECIMAL within tolerance all confirmed live (probe) + reviewer's line-by-line |
| 66 | + oracle comparison. |
| 67 | +- **duckdb-only dependency** — PASS. |
| 68 | + `grep -rn "import pandas\|from pandas\|import numpy\|from numpy" |
| 69 | + src/razorback/benchmarks/spider2_dbt/` → no matches. Only `duckdb` + stdlib are |
| 70 | + imported by the comparator/verifier. |
| 71 | +- **Fail-closed guards** — PASS. Live probe: empty file, wrong `evaluation.func`, |
| 72 | + empty/missing `condition_tabs`, and a wrapped spec missing `parameters.gold` |
| 73 | + each RAISE; the gold-basename allowlist rejects all traversal/metachar cases |
| 74 | + (`../…`, `/etc/passwd`, `sub/dir/g.duckdb`, `..`, `.`, `x.duckdb; … #`, |
| 75 | + `g.duckdb $(id)`, `g .duckdb`, `g.sqlite`). Missing `tests/gold/` → |
| 76 | + `_ensure_verifier_assets` raises `FileNotFoundError`. |
| 77 | +- **Shell/SQL injection sealed** — PASS. `harbor_view.py` `shlex.quote`s BOTH |
| 78 | + `--predicted-db` and `--gold-db` at the single emission point; |
| 79 | + `condition_tabs` is identifier-quoted (doubled `"`) in `_fetch_columns`, and a |
| 80 | + breakout value (`realt"; select 999 AS a; --`) over genuinely-mismatched DBs |
| 81 | + RAISES (cannot force a match) — confirmed live. |
| 82 | +- **Link-mode symlink-write-through** — PASS. All 5 verifier-asset copies route |
| 83 | + through `_copy_into_view` (unlink-symlink-then-copy); test.sh, Dockerfile, and |
| 84 | + preflight writes carry the same guard. Regression |
| 85 | + `..._never_mutate_colliding_source_file` is green. |
| 86 | +- **AC-3 reward.json shape / never-crash-into-pass** — PASS (see AC-3). |
| 87 | + |
| 88 | +## Suite results (clean branch checkout) |
| 89 | + |
| 90 | +``` |
| 91 | +$ uv run pytest -k spider2_dbt --ignore=tests/unit/test_task_identity_scoring.py -q |
| 92 | +93 passed, 751 deselected in 5.34s |
| 93 | +
|
| 94 | +$ uv run pytest --ignore=tests/unit/test_task_identity_scoring.py -q |
| 95 | +4 failed, 828 passed, 12 skipped, 80 warnings in 68.05s |
| 96 | +``` |
| 97 | + |
| 98 | +The 4 full-suite failures are pre-existing on the merge-base `9c39af2` |
| 99 | +(verified by running exactly those 4 node-ids in a throwaway base worktree → |
| 100 | +`4 failed`) and live in files UNTOUCHED by this branch |
| 101 | +(`git diff 9c39af2..HEAD --name-only` is spider2_dbt scope + docs only). |
| 102 | +NOT regressions: |
| 103 | + |
| 104 | +- `test_spacedock_solver_freeze_dir_mechanism.py::test_codex_runtime_dispatch_constructs_inner_agent` |
| 105 | +- `test_worktree_teardown_preserves_runs.py::test_worktree_remove_force_does_not_destroy_runs` |
| 106 | +- `test_generate_matrix_specs.py::test_matrix_specs_carry_query_mode_batch` |
| 107 | +- `test_rk_research_new.py::test_rk_research_new_creates_scaffold_tree` |
| 108 | + |
| 109 | +## Code review findings (superpowers:requesting-code-review) |
| 110 | + |
| 111 | +Reviewer dispatched against `9c39af2..965f2ea`; it independently fetched the |
| 112 | +upstream Spider2 oracle and ran the suite. |
| 113 | + |
| 114 | +- **Critical:** none. The reviewer could not construct any external input |
| 115 | + (eval-spec JSON, gold basename, condition_tabs, task profile) that forces a |
| 116 | + false reward 1.0. |
| 117 | +- **Important:** none. |
| 118 | +- **Minor (non-blocking):** |
| 119 | + 1. `eval_spec.py:73` — `condition_cols` `[[]]`/`[None]` is expanded to |
| 120 | + `[[]]*n` for any `n`; this is the *forgiving* direction and mirrors |
| 121 | + upstream `compare_multi_pandas_table` defaulting (stricter `duckdb_match` |
| 122 | + would assert). No correctness risk for well-formed specs; a clarifying |
| 123 | + comment was suggested. |
| 124 | + 2. The leakage-scan scoping that excludes the verify-only `tests/` subtree |
| 125 | + rests on the Harbor "tests/ uploaded only at verify time, reset around the |
| 126 | + agent run" lifecycle; a version-pinned comment was suggested so a future |
| 127 | + Harbor bump can't silently invalidate it. The guard test proves the change |
| 128 | + is scoping not weakening. |
| 129 | + 3. `harbor_view.py:163` `spec.gold or "gold.duckdb"` fallback is effectively |
| 130 | + dead for wrapped specs (they already raise without gold); reachable only |
| 131 | + for the unwrapped-fixture path. A clarifying comment was suggested. |
| 132 | + |
| 133 | +All three are clarity/robustness polish, not correctness defects — classified |
| 134 | +non-blocking. No production code was changed during validation. |
| 135 | + |
| 136 | +## Gate decision |
| 137 | + |
| 138 | +**PASSED → done.** |
| 139 | + |
| 140 | +Rationale: all three ACs reproduce green from a clean checkout with actual |
| 141 | +command output (gating suite 93 passed; acceptance `-k spider2_dbt_verify` 49 |
| 142 | +passed), every dispatch-named focus area is confirmed by live behavioral probes |
| 143 | +(comparator faithfulness, duckdb-only, fail-closed guards, shell+SQL injection |
| 144 | +sealed, symlink write-through, reward.json shape), the only full-suite failures |
| 145 | +are pre-existing on the merge-base and outside this branch's scope, and the |
| 146 | +independent code review found zero blocking issues (3 minor polish notes). No |
| 147 | +external input forces a false reward 1.0. |
0 commit comments