Skip to content

Commit 7b63a3e

Browse files
feat: bundle databricks delta workflow (#19)
* feat: bundle databricks delta workflow * fix: use direct engine for delta bundle validation
1 parent 3d4ccf9 commit 7b63a3e

21 files changed

Lines changed: 395 additions & 32 deletions

README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
Convex CDC sync engine with two supported target families:
99

1010
- `S3/export`: append-only raw parquet -> current-state staging parquet -> S3 publish
11-
- `Databricks/native`: bronze Delta CDC landing -> Lakeflow `AUTO CDC` -> silver current-state Delta tables
11+
- `Databricks Delta`: bronze Delta CDC landing -> Lakeflow `AUTO CDC` -> silver current-state Delta tables
1212

1313
The source-side behavior intentionally stays close to the public Convex/Fivetran
1414
extraction model:
@@ -29,7 +29,7 @@ flowchart TD
2929
S3[crates/convex-export-s3]
3030
AWS[platform/aws]
3131
DBS3[platform/databricks/s3]
32-
DBN[platform/databricks/native]
32+
DBN[platform/databricks/delta]
3333
Root --> Inspect
3434
Root --> CLI
3535
Root --> Core
@@ -48,7 +48,7 @@ Read the repo by layer:
4848
- [`platform/aws/README.md`](platform/aws/README.md): AWS assets for publishing and downstream readers
4949
- [`platform/databricks/README.md`](platform/databricks/README.md): Databricks target family overview
5050
- [`platform/databricks/s3/README.md`](platform/databricks/s3/README.md): Databricks consuming the S3 export path
51-
- [`platform/databricks/native/README.md`](platform/databricks/native/README.md): Databricks-native bronze/silver landing
51+
- [`platform/databricks/delta/README.md`](platform/databricks/delta/README.md): Databricks Delta bronze/silver landing
5252

5353
## Install
5454

@@ -85,7 +85,7 @@ flowchart LR
8585
C[Convex]
8686
E[shared sync semantics]
8787
S3[S3 export path]
88-
DBN[Databricks native path]
88+
DBN[Databricks Delta path]
8989
DBS3[Databricks over S3 path]
9090
C --> E
9191
E --> S3
@@ -126,20 +126,28 @@ Or via `just`:
126126
- `just publish-s3 --bucket your-bucket`
127127
- `just run --bucket your-bucket`
128128

129-
### `Databricks/native`
129+
### `Databricks Delta`
130130

131-
Checked-in Databricks-native assets:
131+
Checked-in Databricks Delta assets:
132132

133-
- `platform/databricks/native/extractor/convex_cdc_job.py`
134-
- `platform/databricks/native/sql/bootstrap/`
135-
- `platform/databricks/native/lakeflow/bronze_to_silver_template.sql`
133+
- `platform/databricks/delta/databricks.yml`
134+
- `platform/databricks/delta/resources/convex_delta_extract.job.yml`
135+
- `platform/databricks/delta/extractor/convex_cdc_job.py`
136+
- `platform/databricks/delta/sql/bootstrap/`
137+
- `platform/databricks/delta/lakeflow/bronze_to_silver_template.sql`
136138

137139
Runtime split:
138140

139141
1. a Databricks job runs the extractor and appends bronze CDC rows
140142
2. checkpoint rows land in the control schema
141143
3. Lakeflow `AUTO CDC` materializes silver current-state tables
142144

145+
Packaged entrypoints:
146+
147+
- `just databricks-delta-deploy`
148+
- `just databricks-delta-run`
149+
- `just databricks-delta-smoke <warehouse_id>`
150+
143151
### `Databricks over S3`
144152

145153
This variation keeps the existing Rust exporter and S3 publish loop, then adds:

docs/architecture.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This repo has three explicit layers:
66

77
- `core`: Convex extraction and checkpoint semantics
88
- `target-s3`: raw parquet, staging parquet, and S3 publish
9-
- `platform/databricks`: Databricks assets for both S3-backed consumption and Databricks-native landing
9+
- `platform/databricks`: Databricks assets for both S3-backed consumption and Databricks Delta landing
1010

1111
The shared extraction logic stays target-agnostic. Targets decide how event
1212
batches are durably written and what downstream shape they expose.
@@ -67,7 +67,7 @@ Use it when you want:
6767
- a target-agnostic export contract
6868
- another platform to consume S3 directly
6969

70-
## `Databricks/native`
70+
## `Databricks/delta`
7171

7272
```text
7373
Convex
@@ -87,7 +87,7 @@ Owned pieces:
8787
Use it when:
8888

8989
- Databricks is the primary serving layer
90-
- you want Databricks-native CDC reconstruction
90+
- you want Databricks Delta CDC reconstruction
9191
- downstream consumers can read Unity Catalog tables directly
9292

9393
## Data Shapes
@@ -121,7 +121,7 @@ Append-only CDC landing in Delta.
121121
Current-state Delta tables derived from bronze CDC.
122122

123123
- one current row per source key
124-
- resolved with Databricks-native CDC semantics
124+
- resolved with Databricks Delta CDC semantics
125125

126126
## Checkpoints
127127

@@ -140,7 +140,7 @@ Rules:
140140
Target storage differs:
141141

142142
- `S3/export`: file-backed JSON
143-
- `Databricks/native`: Delta control table
143+
- `Databricks/delta`: Delta control table
144144

145145
## Boundary
146146

@@ -149,7 +149,7 @@ This repo should own:
149149
- Convex extraction
150150
- checkpoint semantics
151151
- S3/export target code
152-
- Databricks-native landing assets
152+
- Databricks Delta landing assets
153153

154154
Downstream systems should own:
155155

docs/public-reference-map.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,4 +119,4 @@ Today we intentionally differ from the Fivetran connector in these ways:
119119
- we support target-owned landing contracts instead of one destination runtime
120120
- the S3/export path writes Parquet datasets instead of pushing rows into a Fivetran destination
121121
- the S3/export path materializes local `staging` tables instead of relying on warehouse-native managed tables
122-
- the Databricks-native path lands bronze CDC rows directly and relies on Lakeflow `AUTO CDC` for current-state resolution
122+
- the Databricks Delta path lands bronze CDC rows directly and relies on Lakeflow `AUTO CDC` for current-state resolution

docs/release-artifacts.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,11 @@ as separate binary assets:
3939

4040
- `platform/aws/`
4141
- `platform/databricks/s3/`
42-
- `platform/databricks/native/extractor/convex_cdc_job.py`
43-
- `platform/databricks/native/sql/bootstrap/`
44-
- `platform/databricks/native/lakeflow/`
42+
- `platform/databricks/delta/databricks.yml`
43+
- `platform/databricks/delta/resources/`
44+
- `platform/databricks/delta/extractor/convex_cdc_job.py`
45+
- `platform/databricks/delta/sql/bootstrap/`
46+
- `platform/databricks/delta/lakeflow/`
4547

4648
## Explicit Non-Artifacts
4749

@@ -59,4 +61,4 @@ Not in the first release slice:
5961

6062
- wider platform matrix beyond `linux-amd64`
6163
- container images for the S3/export runtime
62-
- separate packaging for Databricks-native assets
64+
- separate packaging for Databricks Delta assets

justfile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,15 @@ databricks-sync-staging-views *args:
7575
databricks-apply-sql-dir profile warehouse_id sql_dir:
7676
./scripts/apply-databricks-sql-dir.sh {{profile}} {{warehouse_id}} {{sql_dir}}
7777

78+
databricks-delta-deploy profile="DEFAULT" target="dev":
79+
./scripts/deploy-databricks-delta.sh {{profile}} {{target}}
80+
81+
databricks-delta-run profile="DEFAULT" target="dev" job_key="convex_delta_extract":
82+
./scripts/run-databricks-delta-job.sh {{profile}} {{target}} {{job_key}}
83+
84+
databricks-delta-smoke warehouse_id profile="DEFAULT" target="dev":
85+
./scripts/run-databricks-delta-smoke.sh {{profile}} {{target}} {{warehouse_id}}
86+
7887
# Install
7988
install-dev:
8089
./install.sh --mode dev --force

platform/databricks/README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@
33
Databricks assets grouped by target family:
44

55
- `s3/`: Databricks consuming the existing S3 export path
6-
- `native/`: Databricks-first assets where Convex changes land directly in
6+
- `delta/`: Databricks-first assets where Convex changes land directly in
77
Unity Catalog Delta tables
88

99
```mermaid
1010
flowchart LR
1111
S3[S3-backed path]
12-
Native[Databricks-native path]
12+
Delta[Databricks Delta path]
1313
S3 --> Views[Unity Catalog views over S3 parquet]
14-
Native --> Bronze[bronze CDC Delta]
15-
Native --> Silver[silver current-state Delta]
14+
Delta --> Bronze[bronze CDC Delta]
15+
Delta --> Silver[silver current-state Delta]
1616
```
1717

1818
Use these assets the same way as the AWS templates:
@@ -42,17 +42,18 @@ external location before `read_files(...)` views are applied.
4242

4343
Read more: [`platform/databricks/s3/README.md`](s3/README.md)
4444

45-
## `native/`
45+
## `delta/`
4646

4747
The Databricks-first assets are the starting point for direct Delta landing.
4848

49-
- `native/extractor/`: a Databricks job entrypoint that mirrors the current
49+
- `delta/extractor/`: a Databricks job entrypoint that mirrors the current
5050
Convex snapshot/delta checkpoint logic and writes bronze CDC tables
51-
- `native/lakeflow/`: Lakeflow SQL templates that turn bronze CDC tables into
51+
- `delta/lakeflow/`: Lakeflow SQL templates that turn bronze CDC tables into
5252
silver current-state tables
53-
- `native/sql/`: bootstrap DDL for schemas and control tables
53+
- `delta/sql/`: bootstrap DDL for schemas and control tables
54+
- `delta/resources/`: Databricks bundle job definitions
5455

5556
The provider is configured from `~/.databrickscfg` by default, typically using
5657
the `DEFAULT` profile.
5758

58-
Read more: [`platform/databricks/native/README.md`](native/README.md)
59+
Read more: [`platform/databricks/delta/README.md`](delta/README.md)
Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Databricks-native Target
1+
# Databricks Delta Target
22

33
Databricks as the primary landing plane:
44

@@ -27,13 +27,47 @@ cannot overwrite key, ordering, or delete semantics.
2727

2828
## Layout
2929

30+
- `databricks.yml`: Databricks bundle entrypoint
31+
- `resources/`: Databricks bundle resources for the extractor job
3032
- `extractor/convex_cdc_job.py`: Databricks job entrypoint
3133
- `sql/bootstrap/`: ordered bootstrap DDL for configurable control/bronze/silver schemas and checkpoint table
3234
- `lakeflow/bronze_to_silver_template.sql`: per-table Lakeflow template
3335

34-
Apply the bootstrap directory with:
36+
## Deploy Surface
37+
38+
Bundle lifecycle:
39+
40+
- `scripts/deploy-databricks-delta.sh <profile> <target>`
41+
- `scripts/run-databricks-delta-job.sh <profile> <target> [job_key]`
42+
- `scripts/run-databricks-delta-smoke.sh <profile> <target> <warehouse_id>`
43+
44+
These scripts default `DATABRICKS_BUNDLE_ENGINE=direct` so deployment does not
45+
depend on Terraform downloads.
46+
47+
Bootstrap SQL can still be applied directly with:
3548

3649
- `scripts/apply-databricks-sql-dir.sh <profile> <warehouse_id> <rendered_sql_dir>`
3750

3851
The extractor mirrors the Rust source/checkpoint logic and does not depend on
3952
the local parquet/S3 path.
53+
54+
## Typical Flow
55+
56+
```mermaid
57+
flowchart LR
58+
B[bundle validate/deploy]
59+
J[convex_delta_extract job]
60+
C[checkpoint table]
61+
T[bronze CDC tables]
62+
B --> J
63+
J --> C
64+
J --> T
65+
```
66+
67+
Recommended operator entrypoints:
68+
69+
```bash
70+
just databricks-delta-deploy
71+
just databricks-delta-run
72+
just databricks-delta-smoke <warehouse_id>
73+
```
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
bundle:
2+
name: convex-streaming-olap-export-delta
3+
uuid: 2d5d8d8e-cdb2-41e4-bc7e-7ffae07f628d
4+
5+
include:
6+
- resources/*.yml
7+
8+
variables:
9+
convex_deployment_url:
10+
description: Convex deployment root URL.
11+
convex_deploy_key:
12+
description: Convex deploy key used by the extractor.
13+
source_id:
14+
description: Source identifier stored in the checkpoint table.
15+
table_name:
16+
description: Optional table filter for smoke/debug runs.
17+
catalog:
18+
description: Unity Catalog catalog that owns control and bronze tables.
19+
control_schema:
20+
description: Schema that owns the checkpoint table.
21+
bronze_schema:
22+
description: Schema that owns bronze CDC tables.
23+
checkpoint_table:
24+
description: Checkpoint table name.
25+
26+
targets:
27+
dev:
28+
mode: development
29+
default: true
30+
workspace:
31+
profile: DEFAULT
32+
variables:
33+
source_id: convex-streaming-olap-export-dev
34+
table_name: ""
35+
catalog: workspace
36+
control_schema: convex_streaming_olap_export_control
37+
bronze_schema: convex_streaming_olap_export_bronze
38+
checkpoint_table: connector_checkpoint
39+
prod:
40+
mode: production
41+
workspace:
42+
profile: DEFAULT
43+
variables:
44+
source_id: convex-streaming-olap-export
45+
table_name: ""
46+
catalog: workspace
47+
control_schema: convex_streaming_olap_export_control
48+
bronze_schema: convex_streaming_olap_export_bronze
49+
checkpoint_table: connector_checkpoint

platform/databricks/native/extractor/README.md renamed to platform/databricks/delta/extractor/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Convex CDC Job
22

3-
`convex_cdc_job.py` is the Databricks job entrypoint for the Databricks-native
3+
`convex_cdc_job.py` is the Databricks job entrypoint for the Databricks Delta
44
target family.
55

66
It mirrors the current Rust source/checkpoint behavior:
@@ -24,3 +24,6 @@ It mirrors the current Rust source/checkpoint behavior:
2424
- `DATABRICKS_CONTROL_SCHEMA`: defaults to `control`
2525
- `DATABRICKS_BRONZE_SCHEMA`: defaults to `bronze`
2626
- `DATABRICKS_CHECKPOINT_TABLE`: defaults to `connector_checkpoint`
27+
28+
In the bundled Databricks Delta path, these are usually passed as task
29+
parameters rather than exported manually.

platform/databricks/native/extractor/convex_cdc_job.py renamed to platform/databricks/delta/extractor/convex_cdc_job.py

File renamed without changes.

0 commit comments

Comments
 (0)