Skip to content

Commit e9f171b

Browse files
easelclaude
andcommitted
ADR-007: raw->ingest as committed SQL artifact (default path)
Records the decision that the Bronze.Raw -> Bronze.Ingested transform is a generated, committed, independently-runnable SQL artifact (Databricks/Delta) -- the default path for all pipelines going forward -- with cast_column_sql as the canonical cast and golden tests for diff-based review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 844043d commit e9f171b

1 file changed

Lines changed: 73 additions & 0 deletions

File tree

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# ADR-007: Raw-to-Ingest Transforms as Committed SQL Artifacts
2+
3+
## Status
4+
5+
Accepted — default path for all pipelines going forward.
6+
7+
## Context
8+
9+
The Bronze.Raw -> Bronze.Ingested transform (type casting, format-aware date/timestamp
10+
parsing, currency/empty-string normalization, dedup, and the write into the typed table)
11+
has historically been expressed as **PySpark Column-API logic** in `casting_utils.py`,
12+
applied row-by-row by a downstream orchestrator (pulseflow's Phase 8 `TypeConverter`) and
13+
mirrored by Great Expectations validation (`ExpectColumnValuesToCastToType`).
14+
15+
This couples every pipeline to:
16+
17+
- a Python/PySpark runtime (the library must be installed and importable on the cluster),
18+
- the Column-API capability-probing path (Spark classic vs Connect differences), and
19+
- a transform that is **not reviewable or runnable on its own** — you cannot read a diff of
20+
"what the ingest step does" without reading Python.
21+
22+
For Databricks pipelines (Git folders + notebooks, DBR 17.x / Spark 4.0) the ingest step is
23+
naturally pure SQL. We already commit generated SQL for Gold tables (`SQLPlanGenerator`);
24+
the raw->ingest step should be no different.
25+
26+
## Decision
27+
28+
**The canonical raw->ingest transform is a committed, generated SQL artifact**, produced from
29+
the UMF spec and run independently of this library. Python's role is to *generate* the
30+
artifact, not to wrap it at runtime.
31+
32+
- `casting_utils.cast_column_sql()` is the canonical cast expression (plain Spark SQL). It
33+
shares `convert_umf_format_to_spark()` with the runtime caster `cast_column_with_format()`,
34+
so date/timestamp formats are guaranteed identical.
35+
- `schemas/ingest_generator.generate_ingest_sql(umf)` emits the full artifact
36+
(Databricks/Delta dialect): a `raw_<table>` landing table (all `STRING` + ingest metadata),
37+
a typed `ingested_<table>` target table, and a cast + write transform that branches on
38+
`ingestion.mode` and `primary_key`:
39+
- incremental + primary_key -> dedup-latest then `MERGE` (upsert)
40+
- incremental, no primary_key -> blind `INSERT INTO` (with a warning comment)
41+
- snapshot -> `INSERT OVERWRITE` (drop/reload)
42+
- Surfaced as `tablespec generate <umf> -f ingest` and exported from the public API
43+
(`tablespec.generate_ingest_sql`).
44+
- Golden tests under `tests/golden/ingest_sql/` make every change to the transform a
45+
reviewable diff against checked-in `.expected.sql`.
46+
47+
### Going forward
48+
49+
1. **All pipelines** generate and commit the raw->ingest `.sql` artifact (like Gold), and the
50+
warehouse runs the SQL. No `pip install tablespec[spark]` on the cluster for ingest.
51+
2. **pulseflow's `TypeConverter`** is migrated to execute the generated SQL (against a temp
52+
view / Delta table) instead of applying Column-API casts.
53+
3. **Single source of truth** is preserved: the runtime caster and GX validation converge on
54+
`cast_column_sql` (e.g. via `selectExpr`), so "validation tests exactly what ingestion
55+
does" continues to hold — now with the additional guarantee that the committed SQL is
56+
exactly what runs.
57+
58+
## Consequences
59+
60+
- **Positive:** ingest logic is reviewable and independently runnable; no library runtime on
61+
the cluster; one canonical cast; diff-based review via golden files; Databricks-native.
62+
- **Negative / follow-up:** `cast_column_sql` currently covers the common cast paths
63+
(`cast_column_with_format`); the richer fallback casters (flexible-format coalesce,
64+
epoch-ms, Excel-serial) and snapshot "latest file" filtering are follow-ups. The runtime
65+
caster has not yet been refactored to consume `cast_column_sql` — until it is, the two are
66+
kept in parity by tests.
67+
- **Type fidelity:** the typed target DDL uses Spark-correct types (e.g. `DATETIME ->
68+
TIMESTAMP`), unlike `generate_sql_ddl` which emits a literal `DATETIME`.
69+
70+
## Related
71+
72+
- ADR-002 (GX 1.6 format), ADR-005 (unified expectation model — Bronze.Raw/Ingested stages).
73+
- `src/tablespec/schemas/ingest_generator.py`, `src/tablespec/casting_utils.py`.

0 commit comments

Comments
 (0)