Skip to content

Commit ba2b352

Browse files
committed
Merge branch 'feat/conformance-harness'
2 parents ad5a4d9 + 9acd56d commit ba2b352

91 files changed

Lines changed: 5507 additions & 463 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# Conformance Harness — Acceptance Criteria & Engine Matrix
2+
3+
Status: Approved for implementation on `feat/conformance-harness`.
4+
Scope: a cross-engine conformance harness that proves every supported execution
5+
backend reproduces the **same** typed-ingest / gold-derivation result as the
6+
established Spark-direct oracle, byte-for-byte, under one shared canonicalization.
7+
8+
This document is the **criteria-first** phase. It defines unambiguous,
9+
machine-checkable acceptance for each engine, the canonicalization contract every
10+
engine MUST share, the fixture corpus + tags (including the cases still to add),
11+
and the matrix assertion the harness enforces. Items marked `(NEW)` do not exist
12+
yet and are the deliverable of the later implementation phases on this branch.
13+
14+
> **Run prefix** (ALL python/pytest/dbt/uv commands):
15+
> `UV_PROJECT_ENVIRONMENT=/tmp/tsvenv JAVA_HOME=/home/linuxbrew/.linuxbrew/opt/openjdk@17 SPARK_LOCAL_IP=127.0.0.1 uv run <cmd>`
16+
> PySpark 4.0 runs ONLY under `JAVA_HOME=openjdk@17` (default JDK 26 crashes in
17+
> `getSubject`). For any `dbt-spark` (session) leg, set an **isolated**
18+
> `spark.sql.warehouse.dir` + metastore dir per case for parallel safety.
19+
20+
---
21+
22+
## 1. The oracle (the "previous implementation")
23+
24+
The single source of truth is the **Spark-direct ingest baseline**:
25+
`tablespec.generate_ingest_sql(umf)` executed on Delta-Spark
26+
(`tests/ingest_parity/test_spark_baseline.py:210`). Its canonicalized output is
27+
committed as the **corpus golden** under
28+
`tests/golden/ingest_parity/<fixture>.spark.expected.json`. The gold-derivation
29+
oracle is `SQLPlanGenerator` / `generate_sql_plan`
30+
(`src/tablespec/schemas/sql_generator.py`), whose golden is the canonicalized
31+
result of executing the generated gold SQL on the oracle engine.
32+
33+
Every engine leg compares its canonicalized output to **that same corpus
34+
golden** (never to itself, never to a freshly-recomputed expectation), AND any
35+
two engines that can both run a given case MUST agree **pairwise**. An engine
36+
that cannot run a tier in this environment is `skipif`-gated with an explicit,
37+
visible reason — it is never silently passed.
38+
39+
---
40+
41+
## 2. Engines × fidelity tier × what-it-compares-to × gate
42+
43+
| Engine | Fidelity tier | Executed here? | Compares to | Skip gate |
44+
| --- | --- | --- | --- | --- |
45+
| **SparkDirect** | Oracle / executed (result-parity) | Yes (Delta-Spark, JVM) | IS the corpus golden (writes it under `--update-golden`); all others compare to it | `spark_only`; skip if no JVM / `JAVA_HOME` not openjdk@17 |
46+
| **DbtDuckDB** | Executed (result-parity) | Yes (in-process DuckDB) | corpus golden + pairwise vs every other available engine | `no_spark`; `importorskip("duckdb")`, `importorskip("dbt")`, skip if `dbt` CLI absent |
47+
| **DbtSparkSession** | Executed (result-parity) | Yes (local embedded `dbt-spark[session]`, `method: session`, embedded Hive/Derby) | corpus golden + pairwise | `slow`; skip if `dbt-spark` adapter missing or JVM unavailable; per-case isolated warehouse/metastore dir |
48+
| **SQLPlanGeneratorGold** | Executed (result-parity) — run on BOTH DuckDB AND the Spark session | Yes (both backends, via the dbt-generated gold project so the dialect layer applies) | corpus golden + Spark↔DuckDB equivalence proven pairwise (closes the "gold never run on Spark" gap) | DuckDB leg: `no_spark` + duckdb/dbt present; Spark leg: `slow` + JVM/`dbt-spark` present |
49+
| **DbtDatabricks** | Compile-golden (no cluster) | Compile only | the committed compiled-SQL golden; cast-SQL parity to Spark via the shared renderer | `no_spark`; `dbt compile` only — `dbt run` `skipif` no Databricks workspace |
50+
| **LDP** | Cast-parity + compile-golden + opt-in e2e | Cast-parity + emit-golden executed; e2e opt-in | (a) cast-parity: emitted cast SQL == Spark cast SQL; (b) compile-golden: emitted project text == `tests/golden/ldp/**`; (c) e2e: corpus golden | `no_spark` for (a)+(b); (c) gated behind opt-in `databricks_e2e` marker (`skipif` no Databricks) |
51+
52+
### 2.1 Tier definitions
53+
54+
- **Oracle / executed (result-parity):** generates SQL, executes it on a real
55+
engine against real CSV data, canonicalizes the resulting table, and that
56+
canonical form defines (SparkDirect) or must equal (all others) the corpus
57+
golden. No mocks for the behavior under test.
58+
- **Compile-golden:** `dbt compile` (or LDP text emission) renders deterministic
59+
SQL/project text that is byte-compared to a committed golden. Proves the
60+
emitter, not a live run. Used where no cluster exists here (Databricks; LDP
61+
Databricks runtime).
62+
- **Cast-parity:** the per-column cast expression the backend emits is executed
63+
in isolation (or string-compared) and must reproduce the EXACT value/NULL
64+
behavior of the Spark `try_to_timestamp` + Java-token oracle, including the
65+
sub-second / width-boundary cases the second-resolution canonical form would
66+
otherwise hide.
67+
68+
### 2.2 Marker plan `(NEW where noted)`
69+
70+
Reuse existing markers (`slow`, `fast`, `no_spark`, `spark_only`, `acceptance`,
71+
`contract`). Add ONE new marker:
72+
73+
- `databricks_e2e` `(NEW)` — opt-in; `skipif` unless a real Databricks workspace
74+
is configured. Default-deselected so the green suite never depends on a cluster.
75+
76+
Registered in `pyproject.toml [tool.pytest.ini_options].markers` (`--strict-markers`
77+
is on, so it must be declared).
78+
79+
---
80+
81+
## 3. Canonicalization contract `(NEW: extend `tests/ingest_parity/canonical.py`)`
82+
83+
ALL engines MUST canonicalize through the identical `canonical.to_json`. Today
84+
`render_value` pins timestamps to **second** resolution and assumes UTC, which
85+
HIDES sub-second and timezone divergence between engines. The contract is
86+
extended to make that divergence visible while keeping current goldens stable by
87+
default-equivalence on the corpus that has no sub-second data.
88+
89+
Contract (`canonical.to_json` / `render_value` / `canonical_rows`):
90+
91+
1. **Configurable timestamp precision.** `to_json(..., ts_precision: int = 6)`
92+
threads through to `render_value(value, *, ts_precision=6)`. A
93+
`datetime`/timestamp renders as `YYYY-MM-DD HH:MM:SS` when `ts_precision == 0`,
94+
else `YYYY-MM-DD HH:MM:SS.ffffff` truncated (NOT rounded) to `ts_precision`
95+
fractional digits. **Default is microsecond (6)** so sub-second divergence is
96+
visible by default; a case may pin `ts_precision=0` only with an explicit,
97+
documented reason.
98+
2. **Explicit timezone handling.** TZ rendering is explicit, not implicit-UTC.
99+
A tz-aware `datetime` is first normalized to UTC then rendered with a trailing
100+
`Z`; a naive `datetime` renders with NO suffix. The two are therefore NEVER
101+
byte-equal, so a TZ-aware↔naive divergence cannot silently pass. Every engine
102+
leg pins its session to UTC (`SET TimeZone='UTC'` / Spark `spark.sql.session.timeZone=UTC`)
103+
so wall-clock values agree before this rendering step.
104+
3. **Identical for all engines.** SparkDirect, DbtDuckDB, DbtSparkSession,
105+
SQLPlanGeneratorGold (both backends), and the LDP e2e leg import and call the
106+
SAME `to_json` with the SAME `ts_precision` and the SAME decimal `scales` map.
107+
Decimals stay fixed at their declared scale; booleans `true`/`false`; NULL ->
108+
`"NULL"`; rows sorted by all canonical columns. No per-engine canonicalization.
109+
4. **Backward compatibility (explicit, not hand-waved).** Switching the default
110+
to `ts_precision=6` is NOT byte-identical to the current second-resolution
111+
goldens: a whole-second `...:SS` becomes `...:SS.000000`. Two compatible paths,
112+
one MUST be chosen at implementation:
113+
- **(a) corpus default `ts_precision=0`** — the existing 10 fixtures keep
114+
pinning second resolution (their goldens are unchanged, byte-for-byte), and
115+
ONLY the NEW sub-second/tz cases opt into `ts_precision=6`. This preserves
116+
every committed golden with zero regeneration. **This is the recommended
117+
default**; the `to_json` signature default is `6`, but the ingest corpus
118+
parametrization passes `ts_precision=0` explicitly except for `tz`-tagged
119+
cases.
120+
- **(b) global `ts_precision=6` + one-time golden migration** — regenerate all
121+
goldens under `--update-golden` so whole seconds carry `.000000`. This is
122+
compatibility by MIGRATION (a single reviewed golden churn), not byte
123+
compatibility of the unchanged files.
124+
The harness records the chosen precision per case so golden + every engine leg
125+
compare at one precision.
126+
127+
---
128+
129+
## 4. Fixture corpus, tags, and cases to add
130+
131+
### 4.1 Existing ingest corpus (`tests/fixtures/ingest/`)
132+
133+
`claims_incremental_pk`, `currency_amounts`, `dates_formats`,
134+
`events_incremental_nopk`, `members_snapshot_pk`, `messy_incremental_pk`,
135+
`nopad_formats`, `parity_hardening`, `provider_snapshot`, `types_basic`.
136+
Two-batch fixtures are tracked by `_TWO_BATCH` in `test_spark_baseline.py`.
137+
138+
### 4.2 Tag taxonomy `(NEW: a `tags:` list on each fixture UMF, surfaced as pytest marks/ids)`
139+
140+
- `types` — scalar type coverage (passthrough, numeric, boolean).
141+
- `decimal` — decimal precision / scale / overflow boundaries.
142+
- `datetime` — date/timestamp format parsing.
143+
- `tz` — timezone-aware + sub-second timestamp behavior.
144+
- `incremental` — incremental (merge / append) ingestion.
145+
- `snapshot` — full-snapshot ingestion.
146+
- `pk` / `nopk` — has / lacks a primary key (dedup vs blind-append).
147+
- `multibatch` — 3+ batches / out-of-order `_load_ts` / tie-break / tombstone.
148+
- `gold` — cross-table gold derivation (join/pivot/unpivot/window/etc).
149+
150+
### 4.3 Missing cases to add `(NEW)`
151+
152+
Ingest tier:
153+
154+
1. **`decimal_boundaries`** (`decimal`) — values at `precision`/`scale` limits,
155+
rounding at scale boundary, and OVERFLOW inputs that must NULL/error
156+
identically across engines (largest-representable + just-over-precision).
157+
2. **`tz_subsecond_timestamps`** (`datetime,tz`) — tz-aware offsets (`+00:00`,
158+
`-05:00`, `Z`) AND `.SSS`/`.SSSSSS` fractional seconds; exercises the
159+
microsecond + explicit-TZ canonicalization so sub-second/TZ divergence is
160+
visible and must agree.
161+
3. **`multibatch_ooo_tiebreak`** (`incremental,pk,multibatch`) — 3+ batches with
162+
OUT-OF-ORDER `_load_ts`, an exact-tie `_load_ts` requiring a deterministic
163+
tie-break, and a **tombstone** (delete-marker) row that removes a prior key.
164+
165+
Gold pattern family (`gold`, executed via `generate_sql_plan` on BOTH backends):
166+
167+
4. **`gold_join`** — multi-table sequential join (member×claims). Generator path:
168+
`_generate_join_step` (direct/sequential join).
169+
5. **`gold_pivot`** — pivot derivation. Generator path: `_generate_pivot_join`.
170+
6. **`gold_unpivot`** — UNPIVOT base-view derivation. Generator path:
171+
`_generate_unpivot_base_view`.
172+
7. **`gold_window_aggregation`** — window / pre-aggregation view (`ROW_NUMBER` /
173+
`RANK` / pre-aggregation). Generator path: `_generate_pre_aggregation_views`.
174+
8. **`gold_survivorship_priority`** — survivorship across `union_sources` via the
175+
priority-sorted `COALESCE` candidate order (the generator's supported
176+
survivorship mechanism). Generator path: `_generate_member_universe_view` +
177+
priority `COALESCE`. (Most-recent / longest-value survivorship is NOT a named
178+
generator strategy and is out of scope for this case.)
179+
9. **`gold_first_record`** — first-record-per-key selection. Generator path:
180+
`_generate_first_record_join` (`strategy in ("first", "first_record")`).
181+
10. **`gold_fk_integrity`** — referential-integrity coverage. NOTE: orphan-FK
182+
validation is NOT emitted by `generate_sql_plan` (FK metadata there only
183+
drives join planning / join type). FK-integrity is therefore tested at the
184+
**dbt `relationships` schema-test** tier: `generate_dbt_dag_project` emits the
185+
`relationships` test and `dbt build`/`dbt test` is asserted to PASS on clean
186+
data and FAIL on an injected orphan row (the explicit negative). The SparkDirect
187+
gold join result for the clean data is still the corpus golden; the orphan
188+
negative is a dbt-test assertion, not a canonical-row comparison.
189+
190+
Each new case ships: `<name>.umf.yaml` (with `tags:`), CSV batch(es), and a
191+
committed corpus golden produced by the SparkDirect oracle under `--update-golden`.
192+
193+
---
194+
195+
## 5. The matrix assertion
196+
197+
For the parametrized product **(case × available-engine)** the harness asserts:
198+
199+
- **A. Golden conformance:** `canonical(engine, case) == read(case.golden)`
200+
byte-identical, using the case's pinned `ts_precision` + decimal `scales`. The
201+
golden is the SparkDirect oracle output (the previous implementation).
202+
- **B. Pairwise agreement:** for any two engines `e1`, `e2` both available for a
203+
case, `canonical(e1, case) == canonical(e2, case)`. (Transitively implied by A
204+
when both pass, but asserted explicitly so a shared-golden-but-divergent-render
205+
bug is localized to the engine pair.)
206+
- **C. Gold Spark↔DuckDB equivalence:** for every `gold` case, the
207+
`SQLPlanGeneratorGold` output is executed on BOTH DuckDB and the Spark session
208+
**via the dbt-generated gold project** (so the dialect layer rewrites
209+
Spark-flavored constructs like `SELECT * EXCEPT (rn)` / `UNPIVOT EXCLUDE NULLS`
210+
appropriately per backend) and the two canonical forms MUST be equal (and each
211+
equal to the golden) — explicitly closing the "gold never run on Spark" gap.
212+
- **D. Compile-golden stability:** `DbtDatabricks` `dbt compile` output and LDP
213+
emitted project text are byte-equal to their committed goldens; LDP cast SQL ==
214+
Spark cast SQL (cast-parity).
215+
- **E. Skip visibility:** any unavailable (engine, tier) emits a `skip` with an
216+
explicit reason; the run summary shows skips so a silently-missing engine is
217+
detectable (never reported as a pass).
218+
219+
Encapsulation (`tests/test_core_encapsulation.py`) and `make check`
220+
(lint + pyright + full suite) MUST stay green; no core→dbt/ldp import is added by
221+
the harness.

pyproject.toml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,17 @@ dbt = [
5050
"dbt-duckdb>=1.9.0,<2.0.0",
5151
"duckdb>=1.5.0,<2.0.0",
5252
]
53+
# Cross-engine conformance harness: drives the SAME ingest artifact through
54+
# DuckDB, a local Spark session (dbt-spark[session]), and a compile-only
55+
# Databricks target. dbt-databricks is COMPILE-ONLY here (no cluster); the
56+
# Databricks SQL dialect equals Spark SQL for our casts.
57+
conformance = [
58+
"dbt-core>=1.9.0,<2.0.0",
59+
"dbt-duckdb>=1.9.0,<2.0.0",
60+
"dbt-spark[session]>=1.10,<2",
61+
"dbt-databricks>=1.9,<2",
62+
"duckdb>=1.5.0,<2.0.0",
63+
]
5364
tui = [
5465
"textual>=0.50.0",
5566
]
@@ -85,6 +96,7 @@ markers = [
8596
"fast: marks tests that complete in <100ms with no I/O or external deps",
8697
"no_spark: marks tests that do not require PySpark",
8798
"spark_only: marks tests that REQUIRE a JVM-backed Spark session (excluded from the no_spark fast lane)",
99+
"databricks_e2e: opt-in tier that deploys + executes on a REAL Databricks workspace (skipped unless DATABRICKS_HOST is set)",
88100
]
89101
filterwarnings = [
90102
"error",

src/tablespec/casting_utils.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -499,7 +499,13 @@ def cast_column_sql(
499499
format: Optional UMF date/timestamp format (e.g. "YYYYMMDD").
500500
precision: DECIMAL precision (defaults to 10, matching the runtime caster).
501501
scale: DECIMAL scale (defaults to 2, matching the runtime caster).
502-
dialect: ``"spark"`` (default) or ``"duckdb"``.
502+
dialect: ``"spark"`` (default), ``"databricks"``, or ``"duckdb"``.
503+
``"databricks"`` is an explicit, separately-selectable dialect that
504+
renders byte-for-byte identical SQL to ``"spark"`` -- Databricks SQL is
505+
Spark SQL for our casts (``try_to_timestamp`` + Java date tokens), so a
506+
Databricks dbt target reuses the Spark rendering. It exists as a named
507+
dialect purely so a Databricks compile/run target can be selected
508+
explicitly rather than masquerading as plain Spark.
503509
504510
Returns:
505511
-------
@@ -517,9 +523,16 @@ def cast_column_sql(
517523
"try_cast(nullif(trim(regexp_replace(age, '^\\$', '')), '') as INT)"
518524
519525
"""
520-
if dialect not in ("spark", "duckdb"):
521-
msg = f"Unsupported dialect: {dialect!r} (expected 'spark' or 'duckdb')"
526+
if dialect not in ("spark", "databricks", "duckdb"):
527+
msg = (
528+
f"Unsupported dialect: {dialect!r} "
529+
"(expected 'spark', 'databricks', or 'duckdb')"
530+
)
522531
raise ValueError(msg)
532+
# Databricks SQL == Spark SQL for our casts: try_to_timestamp + Java date
533+
# tokens. We keep 'databricks' as a distinct, explicitly-selectable named
534+
# dialect but render it through the identical Spark code path below, so the two
535+
# never drift. Everything past this point only distinguishes duckdb vs not.
523536
is_duck = dialect == "duckdb"
524537
t = target_type.upper()
525538

src/tablespec/dbt/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from __future__ import annotations
1818

1919
from tablespec.dbt.materialization import Materialization, MaterializationPolicy
20+
from tablespec.dbt.profiles import PROFILE_TARGETS, render_profiles_yml
2021
from tablespec.dbt.project import DbtProjectError, generate_dbt_dag_project
2122
from tablespec.dbt.registry import NodeRegistry, NodeRegistryError, ResolvedNode
2223
from tablespec.dbt.renderer import DbtRefRenderer, UnknownRelationError
@@ -37,6 +38,7 @@
3738

3839
__all__ = [
3940
"EMPTY_SELECTION",
41+
"PROFILE_TARGETS",
4042
"DbtProjectError",
4143
"DbtRefRenderer",
4244
"Materialization",
@@ -51,6 +53,7 @@
5153
"emit_seeds",
5254
"generate_dbt_dag_project",
5355
"generate_dbt_project",
56+
"render_profiles_yml",
5457
"render_seeds_config",
5558
"seed_column_types",
5659
"select_expression",

src/tablespec/dbt/contracts.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,9 +67,13 @@
6767
"TIMESTAMP": "TIMESTAMP",
6868
}
6969

70+
# Databricks SQL types == Spark SQL types, so the Databricks dialect reuses the
71+
# Spark contract type map (kept as a distinct, explicitly-selectable key so a
72+
# Databricks target renders its contract under its own name without drifting).
7073
_TYPE_BY_DIALECT: dict[str, dict[str, str]] = {
7174
"duckdb": _DUCKDB_TYPE,
7275
"spark": _SPARK_TYPE,
76+
"databricks": _SPARK_TYPE,
7377
}
7478

7579

@@ -82,7 +86,10 @@ def contract_sql_type(contract: ColumnContract, *, dialect: str = "duckdb") -> s
8286
base type.
8387
"""
8488
if dialect not in _TYPE_BY_DIALECT:
85-
msg = f"Unsupported contract dialect: {dialect!r} (expected 'duckdb'/'spark')"
89+
msg = (
90+
f"Unsupported contract dialect: {dialect!r} "
91+
"(expected 'duckdb'/'spark'/'databricks')"
92+
)
8693
raise ValueError(msg)
8794
table = _TYPE_BY_DIALECT[dialect]
8895
dt = contract.data_type.upper()

0 commit comments

Comments
 (0)