|
| 1 | +# ADR-007: Raw-to-Ingest Transforms as Committed SQL Artifacts |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Accepted — default path for all pipelines going forward. |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +The Bronze.Raw -> Bronze.Ingested transform (type casting, format-aware date/timestamp |
| 10 | +parsing, currency/empty-string normalization, dedup, and the write into the typed table) |
| 11 | +has historically been expressed as **PySpark Column-API logic** in `casting_utils.py`, |
| 12 | +applied row-by-row by a downstream orchestrator (pulseflow's Phase 8 `TypeConverter`) and |
| 13 | +mirrored by Great Expectations validation (`ExpectColumnValuesToCastToType`). |
| 14 | + |
| 15 | +This couples every pipeline to: |
| 16 | + |
| 17 | +- a Python/PySpark runtime (the library must be installed and importable on the cluster), |
| 18 | +- the Column-API capability-probing path (Spark classic vs Connect differences), and |
| 19 | +- a transform that is **not reviewable or runnable on its own** — you cannot read a diff of |
| 20 | + "what the ingest step does" without reading Python. |
| 21 | + |
| 22 | +For Databricks pipelines (Git folders + notebooks, DBR 17.x / Spark 4.0) the ingest step is |
| 23 | +naturally pure SQL. We already commit generated SQL for Gold tables (`SQLPlanGenerator`); |
| 24 | +the raw->ingest step should be no different. |
| 25 | + |
| 26 | +## Decision |
| 27 | + |
| 28 | +**The canonical raw->ingest transform is a committed, generated SQL artifact**, produced from |
| 29 | +the UMF spec and run independently of this library. Python's role is to *generate* the |
| 30 | +artifact, not to wrap it at runtime. |
| 31 | + |
| 32 | +- `casting_utils.cast_column_sql()` is the canonical cast expression (plain Spark SQL). It |
| 33 | + shares `convert_umf_format_to_spark()` with the runtime caster `cast_column_with_format()`, |
| 34 | + so date/timestamp formats are guaranteed identical. |
| 35 | +- `schemas/ingest_generator.generate_ingest_sql(umf)` emits the full artifact |
| 36 | + (Databricks/Delta dialect): a `raw_<table>` landing table (all `STRING` + ingest metadata), |
| 37 | + a typed `ingested_<table>` target table, and a cast + write transform that branches on |
| 38 | + `ingestion.mode` and `primary_key`: |
| 39 | + - incremental + primary_key -> dedup-latest then `MERGE` (upsert) |
| 40 | + - incremental, no primary_key -> blind `INSERT INTO` (with a warning comment) |
| 41 | + - snapshot -> `INSERT OVERWRITE` (drop/reload) |
| 42 | +- Surfaced as `tablespec generate <umf> -f ingest` and exported from the public API |
| 43 | + (`tablespec.generate_ingest_sql`). |
| 44 | +- Golden tests under `tests/golden/ingest_sql/` make every change to the transform a |
| 45 | + reviewable diff against checked-in `.expected.sql`. |
| 46 | + |
| 47 | +### Going forward |
| 48 | + |
| 49 | +1. **All pipelines** generate and commit the raw->ingest `.sql` artifact (like Gold), and the |
| 50 | + warehouse runs the SQL. No `pip install tablespec[spark]` on the cluster for ingest. |
| 51 | +2. **pulseflow's `TypeConverter`** is migrated to execute the generated SQL (against a temp |
| 52 | + view / Delta table) instead of applying Column-API casts. |
| 53 | +3. **Single source of truth** is preserved: the runtime caster and GX validation converge on |
| 54 | + `cast_column_sql` (e.g. via `selectExpr`), so "validation tests exactly what ingestion |
| 55 | + does" continues to hold — now with the additional guarantee that the committed SQL is |
| 56 | + exactly what runs. |
| 57 | + |
| 58 | +## Consequences |
| 59 | + |
| 60 | +- **Positive:** ingest logic is reviewable and independently runnable; no library runtime on |
| 61 | + the cluster; one canonical cast; diff-based review via golden files; Databricks-native. |
| 62 | +- **Negative / follow-up:** `cast_column_sql` currently covers the common cast paths |
| 63 | + (`cast_column_with_format`); the richer fallback casters (flexible-format coalesce, |
| 64 | + epoch-ms, Excel-serial) and snapshot "latest file" filtering are follow-ups. The runtime |
| 65 | + caster has not yet been refactored to consume `cast_column_sql` — until it is, the two are |
| 66 | + kept in parity by tests. |
| 67 | +- **Type fidelity:** the typed target DDL uses Spark-correct types (e.g. `DATETIME -> |
| 68 | + TIMESTAMP`), unlike `generate_sql_ddl` which emits a literal `DATETIME`. |
| 69 | + |
| 70 | +## Related |
| 71 | + |
| 72 | +- ADR-002 (GX 1.6 format), ADR-005 (unified expectation model — Bronze.Raw/Ingested stages). |
| 73 | +- `src/tablespec/schemas/ingest_generator.py`, `src/tablespec/casting_utils.py`. |
0 commit comments