|
1 | 1 | # convex-sync-kit |
2 | 2 |
|
3 | | -- `Language`:  |
4 | | -- `Source`:  |
5 | | -- `Targets`:   |
6 | | -- `Infra`:  |
| 3 | +Recurring Convex export pipelines for local analytics, Databricks, and downstream systems like Palantir Foundry. |
7 | 4 |
|
8 | | -Convex CDC sync engine with two supported target families: |
| 5 | +[](https://deepwiki.com/shpitdev/convex-sync-kit) |
| 6 | +[](https://github.com/shpitdev/convex-sync-kit/releases) |
| 7 | +[](LICENSE) |
9 | 8 |
|
10 | | -- `S3/export`: append-only raw parquet -> current-state staging parquet -> S3 publish |
11 | | -- `Databricks Delta`: bronze Delta CDC landing -> Lakeflow `AUTO CDC` -> silver current-state Delta tables |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | + |
12 | 14 |
|
13 | | -The source-side behavior intentionally stays close to the public Convex/Fivetran |
14 | | -extraction model: |
15 | | - |
16 | | -- bootstrap with `list_snapshot` |
17 | | -- resume incomplete snapshots from checkpoint |
18 | | -- continue with `document_deltas` |
19 | | -- advance checkpoints only after durable writes succeed |
20 | | - |
21 | | -## Repo Map |
| 15 | +## Choose Your Path |
22 | 16 |
|
23 | 17 | ```mermaid |
24 | 18 | flowchart TD |
25 | | - Root[convex-sync-kit] |
26 | | - Inspect[apps/convex-inspect] |
27 | | - CLI[apps/convex-sync] |
28 | | - Core[crates/convex-sync-core] |
29 | | - S3[crates/convex-export-s3] |
30 | | - AWS[platform/aws] |
31 | | - DBS3[platform/databricks/s3] |
32 | | - DBN[platform/databricks/delta] |
33 | | - Root --> Inspect |
34 | | - Root --> CLI |
35 | | - Root --> Core |
36 | | - Root --> S3 |
37 | | - Root --> AWS |
38 | | - Root --> DBS3 |
39 | | - Root --> DBN |
| 19 | + A[What are you trying to do?] |
| 20 | + A --> B[One-off manual export] |
| 21 | + A --> C[Recurring local analysis] |
| 22 | + A --> D[Databricks] |
| 23 | + A --> E[Palantir Foundry] |
| 24 | +
|
| 25 | + B --> B1[Use the official Convex export docs] |
| 26 | + C --> C1[Run convex-sync locally and query parquet with DuckDB or Polars] |
| 27 | + D --> D1[Recommended: Databricks Delta] |
| 28 | + D --> D2[Reference path: S3 backed views] |
| 29 | + E --> E1[Recommended: Databricks Delta -> Unity Catalog -> Foundry virtual tables] |
| 30 | + E --> E2[Fallback: S3 snapshots -> Foundry S3 virtual tables] |
40 | 31 | ``` |
41 | 32 |
|
42 | | -Read the repo by layer: |
| 33 | +### 1. One-off manual export |
43 | 34 |
|
44 | | -- [`apps/convex-inspect/README.md`](apps/convex-inspect/README.md): direct source inspection commands |
45 | | -- [`apps/convex-sync/README.md`](apps/convex-sync/README.md): CLI surface and S3/export runtime commands |
46 | | -- `crates/convex-sync-core/`: shared Convex client, checkpoint FSM, event normalization, sync engine |
47 | | -- `crates/convex-export-s3/`: raw parquet sink, staging materialization, S3 publish flow |
48 | | -- [`platform/aws/README.md`](platform/aws/README.md): AWS assets for publishing and downstream readers |
49 | | -- [`platform/databricks/README.md`](platform/databricks/README.md): Databricks target family overview |
50 | | -- [`platform/databricks/s3/README.md`](platform/databricks/s3/README.md): Databricks consuming the S3 export path |
51 | | -- [`platform/databricks/delta/README.md`](platform/databricks/delta/README.md): Databricks Delta bronze/silver landing |
52 | | -- [`sources/README.md`](sources/README.md): source-specific defaults layered on top of the shared engine |
| 35 | +If you only need a one-time export or ad hoc backfill, use the official Convex tooling directly. This repo is aimed at recurring pipelines, not the simplest possible one-shot export. |
53 | 36 |
|
54 | | -## Install |
| 37 | +- [Convex streaming import/export](https://docs.convex.dev/production/integrations/streaming-import-export) |
| 38 | +- [Convex streaming export API](https://docs.convex.dev/streaming-export-api) |
55 | 39 |
|
56 | | -Release install: |
| 40 | +### 2. Recurring local analysis |
57 | 41 |
|
58 | | -```bash |
59 | | -curl -fsSL https://raw.githubusercontent.com/shpitdev/convex-sync-kit/main/install.sh | bash |
60 | | -``` |
61 | | - |
62 | | -Local checkout dev install: |
| 42 | +If you want recurring exports but do not want a warehouse yet, run the S3/export engine locally and point the outputs wherever you want. `.memory/` is only this repo's default. Every path-bearing command can be overridden. |
63 | 43 |
|
64 | 44 | ```bash |
65 | | -./install.sh --mode dev --force |
66 | | -convex-sync-dev --help |
67 | | -``` |
| 45 | +mkdir -p /tmp/convex-sync-kit-demo |
68 | 46 |
|
69 | | -Current release coverage: |
| 47 | +convex-sync sync-once \ |
| 48 | + --output /tmp/convex-sync-kit-demo/raw_change_log \ |
| 49 | + --checkpoint-path /tmp/convex-sync-kit-demo/raw_change_log.checkpoint.json |
70 | 50 |
|
71 | | -- stable and prerelease archives target `linux-amd64` |
72 | | -- `convex-sync-dev` is checkout-linked and rebuilds incrementally via Cargo |
73 | | -- release installs go to `~/.local/share/convex-sync/<version>/convex-sync` |
74 | | -- command symlinks go in `~/.local/bin` |
75 | | -- `convex-inspect` is repo-local today and not part of the release artifact |
| 51 | +convex-sync materialize-staging \ |
| 52 | + --raw-change-log /tmp/convex-sync-kit-demo/raw_change_log \ |
| 53 | + --output /tmp/convex-sync-kit-demo/staging \ |
| 54 | + --incremental |
76 | 55 |
|
77 | | -## Source Configs |
78 | | - |
79 | | -The repo name stays generic. Source-specific defaults live under `sources/`. |
| 56 | +duckdb -c "select * from read_parquet('/tmp/convex-sync-kit-demo/staging/**/*.parquet') limit 20" |
| 57 | +``` |
80 | 58 |
|
81 | | -- current source profile: `sources/meshix-api/env.sh` |
82 | | -- activate a source by setting `CONVEX_SYNC_SOURCE=<slug>` |
83 | | -- explicit env vars still win over source defaults |
| 59 | +### 3. Using Databricks |
84 | 60 |
|
85 | | -This is the intended scaling model for running the same engine against many |
86 | | -Convex projects without forking the repo or renaming the binaries. |
| 61 | +There are two supported Databricks paths: |
87 | 62 |
|
88 | | -## Operator Binaries |
| 63 | +| Path | What it creates | When to use it | Recommendation | |
| 64 | +|---|---|---|---| |
| 65 | +| Databricks Delta | Unity Catalog control, bronze, and silver schemas | Primary production path | Recommended | |
| 66 | +| Databricks over S3 | Unity Catalog views over published Parquet snapshots | Reference example, simpler bridge from the Rust exporter | Supported, but secondary | |
89 | 67 |
|
90 | | -- `convex-inspect`: inspect Convex schemas, snapshot pages, and delta pages directly |
91 | | -- `convex-sync`: run the maintained parquet -> staging -> S3 export workflow |
| 68 | +Recommended Databricks Delta flow: |
92 | 69 |
|
93 | | -## Supported Variations |
| 70 | +```bash |
| 71 | +export CONVEX_SYNC_SOURCE=meshix-api |
94 | 72 |
|
95 | | -```mermaid |
96 | | -flowchart LR |
97 | | - C[Convex] |
98 | | - E[shared sync semantics] |
99 | | - S3[S3 export path] |
100 | | - DBN[Databricks Delta path] |
101 | | - DBS3[Databricks over S3 path] |
102 | | - C --> E |
103 | | - E --> S3 |
104 | | - E -. mirrored extractor semantics .-> DBN |
105 | | - S3 --> DBS3 |
| 73 | +just databricks-delta-bootstrap 63d28889f3eb3c4b |
| 74 | +just databricks-delta-sync-secret DEFAULT |
| 75 | +just databricks-delta-deploy DEFAULT prod |
| 76 | +just databricks-delta-run DEFAULT prod |
106 | 77 | ``` |
107 | 78 |
|
108 | | -### `S3/export` |
109 | | - |
110 | | -The maintained Rust runtime path: |
111 | | - |
112 | | -1. `sync-once` writes append-only parquet batches under `.memory/raw_change_log/` |
113 | | -2. `materialize-staging --incremental` builds `.memory/staging/` |
114 | | -3. `publish-s3` uploads `staging/current/...` plus versioned manifests |
115 | | -4. `run` loops those steps on a poll interval |
116 | | - |
117 | | -CLI: |
| 79 | +Reference Databricks over S3 flow: |
118 | 80 |
|
119 | | -- `cargo run -p convex-sync -- sync-once` |
120 | | -- `cargo run -p convex-sync -- materialize-staging` |
121 | | -- `cargo run -p convex-sync -- publish-s3 --bucket your-bucket` |
122 | | -- `cargo run -p convex-sync -- run --bucket your-bucket` |
123 | | - |
124 | | -Inspection: |
125 | | - |
126 | | -- `cargo run -p convex-inspect -- schemas` |
127 | | -- `cargo run -p convex-inspect -- snapshot --table-name users` |
128 | | -- `cargo run -p convex-inspect -- deltas --cursor 0` |
| 81 | +```bash |
| 82 | +export CONVEX_SYNC_SOURCE=meshix-api |
129 | 83 |
|
130 | | -Or via `just`: |
| 84 | +just run --bucket your-bucket --prefix prod |
| 85 | +just databricks-sync-staging-views --warehouse-id 63d28889f3eb3c4b --bucket your-bucket --prefix prod |
| 86 | +``` |
131 | 87 |
|
132 | | -- `just dev-cli --help` |
133 | | -- `just schemas` |
134 | | -- `just snapshot --table-name users` |
135 | | -- `just deltas --cursor 0` |
136 | | -- `just sync-once` |
137 | | -- `just materialize-staging` |
138 | | -- `just publish-s3 --bucket your-bucket` |
139 | | -- `just run --bucket your-bucket` |
| 88 | +### 4. Using Palantir Foundry |
140 | 89 |
|
141 | | -### `Databricks Delta` |
| 90 | +If you are already on Databricks, the recommended Foundry path is: |
142 | 91 |
|
143 | | -Checked-in Databricks Delta assets: |
| 92 | +```text |
| 93 | +Convex -> convex-sync-kit Databricks Delta -> Unity Catalog -> Foundry Databricks source -> virtual tables |
| 94 | +``` |
144 | 95 |
|
145 | | -- `platform/databricks/delta/databricks.yml` |
146 | | -- `platform/databricks/delta/resources/convex_delta_extract.job.yml` |
147 | | -- `platform/databricks/delta/extractor/convex_cdc_job.py` |
148 | | -- `platform/databricks/delta/sql/bootstrap/` |
149 | | -- `platform/databricks/delta/lakeflow/bronze_to_silver_template.sql` |
| 96 | +That path is the best fit because Foundry's Databricks connector supports virtual tables over Unity Catalog, including richer Delta and Iceberg behavior when external access is enabled. |
150 | 97 |
|
151 | | -Runtime split: |
| 98 | +Fallback path: |
152 | 99 |
|
153 | | -1. a Databricks job runs the extractor and appends bronze CDC rows |
154 | | -2. checkpoint rows land in the control schema |
155 | | -3. Lakeflow `AUTO CDC` materializes silver current-state tables |
| 100 | +```text |
| 101 | +Convex -> convex-sync-kit S3 snapshots -> Foundry S3 source -> Parquet virtual tables or dataset sync |
| 102 | +``` |
156 | 103 |
|
157 | | -Packaged entrypoints: |
| 104 | +That works, but it is a simpler and more limited path. Palantir's S3 connector supports Parquet virtual tables, but they rely on schema inference, while the Databricks connector gives you a cleaner Unity Catalog table surface. |
158 | 105 |
|
159 | | -- `just databricks-delta-sync-secret` |
160 | | -- `just databricks-delta-bootstrap <warehouse_id>` |
161 | | -- `just databricks-delta-deploy` |
162 | | -- `just databricks-delta-run` |
163 | | -- `just databricks-delta-smoke <warehouse_id>` |
| 106 | +Relevant Foundry docs: |
164 | 107 |
|
165 | | -Recommended production naming: |
| 108 | +- [Databricks connector](https://www.palantir.com/docs/foundry/available-connectors/databricks/) |
| 109 | +- [Amazon S3 connector](https://www.palantir.com/docs/foundry/available-connectors/amazon-s3/) |
| 110 | +- [Virtual tables](https://www.palantir.com/docs/foundry/data-integration/virtual-tables/index.html) |
166 | 111 |
|
167 | | -- S3-backed Databricks schema: `convex_sync_kit_<source>_s3` |
168 | | -- Delta control schema: `convex_sync_kit_<source>_delta_control` |
169 | | -- Delta bronze schema: `convex_sync_kit_<source>_delta_bronze` |
170 | | -- Delta silver schema: `convex_sync_kit_<source>_delta_silver` |
| 112 | +## What This Repo Produces |
171 | 113 |
|
172 | | -### `Databricks over S3` |
| 114 | +| Path | Core artifacts | Current recommended naming | |
| 115 | +|---|---|---| |
| 116 | +| Local recurring analysis | raw change log parquet, staging parquet | user-defined paths | |
| 117 | +| S3/export | `staging/current`, manifests, versioned snapshots | bucket and prefix chosen by operator | |
| 118 | +| Databricks over S3 | Unity Catalog views over published parquet snapshots | `convex_sync_kit_<source>_s3` | |
| 119 | +| Databricks Delta | checkpoint table, bronze CDC tables, silver current-state tables | `convex_sync_kit_<source>_delta_{control,bronze,silver}` | |
173 | 120 |
|
174 | | -This variation keeps the existing Rust exporter and S3 publish loop, then adds: |
| 121 | +The current checked-in source profile is [`sources/meshix-api/env.sh`](sources/meshix-api/env.sh). That is only one source profile, not a repo identity. Add more source directories as you onboard more Convex projects. |
175 | 122 |
|
176 | | -1. Unity Catalog external location coverage over `staging/current` |
177 | | -2. stable SQL views from `platform/databricks/s3/sql/register_staging_views.sql.tmpl` |
178 | | -3. Databricks consumers reading the published parquet snapshots directly |
| 123 | +## Output Paths And Defaults |
179 | 124 |
|
180 | | -## Platform Assets |
| 125 | +Examples in this repo often use `.memory/` because that is convenient for local development here. It is not a required location. |
181 | 126 |
|
182 | | -Snapshot templates into `.memory/` before running Terraform: |
| 127 | +| Command | Default | How to override | |
| 128 | +|---|---|---| |
| 129 | +| `convex-sync sync-once` | `.memory/raw_change_log` | `--output`, `--checkpoint-path` | |
| 130 | +| `convex-sync materialize-staging` | `.memory/staging` | `--raw-change-log`, `--output`, `--state-path` | |
| 131 | +| `convex-sync publish-s3` | `.memory/staging` | `--staging-dir`, `--bucket`, `--prefix` | |
| 132 | +| `convex-sync run` | `.memory/raw_change_log`, `.memory/staging` | `--output`, `--checkpoint-path`, `--staging-dir`, `--staging-state-path`, `--bucket`, `--prefix` | |
| 133 | +| `convex-inspect` commands | stdout unless set | `--output`, `--output-format` | |
183 | 134 |
|
184 | | -- `just aws-template-snapshot` |
185 | | -- `just databricks-template-snapshot` |
| 135 | +## Docs By Audience |
186 | 136 |
|
187 | | -The S3-backed Databricks landing sync remains supported: |
| 137 | +| Audience | Start here | Why | |
| 138 | +|---|---|---| |
| 139 | +| End users / operators | [`platform/databricks/delta/README.md`](platform/databricks/delta/README.md), [`platform/databricks/s3/README.md`](platform/databricks/s3/README.md), [`sources/README.md`](sources/README.md) | Platform-specific deployment and source defaults | |
| 140 | +| CLI users | [`apps/convex-sync/README.md`](apps/convex-sync/README.md), [`apps/convex-inspect/README.md`](apps/convex-inspect/README.md) | Command reference and CLI help | |
| 141 | +| Contributors | [`platform/databricks/README.md`](platform/databricks/README.md), [`platform/aws/README.md`](platform/aws/README.md), [docs/architecture.md](docs/architecture.md) | Code ownership and platform layout | |
188 | 142 |
|
189 | | -- `just databricks-sync-staging-views --warehouse-id <warehouse-id> --bucket <bucket> --prefix <prefix>` |
| 143 | +## Testing And CI |
190 | 144 |
|
191 | | -That script renders SQL from |
192 | | -`platform/databricks/s3/sql/register_staging_views.sql.tmpl` and applies stable |
193 | | -views over the published S3 parquet files. |
| 145 | +| Layer | Present | Tooling | Runs in CI | |
| 146 | +|---|---|---|---| |
| 147 | +| unit | yes | `cargo test --workspace` | yes | |
| 148 | +| integration | no | `none` | no | |
| 149 | +| e2e api | no | `none` | no | |
| 150 | +| e2e web | no | `none` | no | |
194 | 151 |
|
195 | | -## Verification |
| 152 | +Remote automation: |
196 | 153 |
|
197 | | -Local: |
| 154 | +```bash |
| 155 | +depot ci run --workflow .depot/workflows/ci.yml |
| 156 | +``` |
198 | 157 |
|
199 | | -- `just install-hooks` configures a repo-local pre-commit hook |
200 | | -- the hook runs `just verify` |
| 158 | +Release automation: |
201 | 159 |
|
202 | | -Remote: |
| 160 | +```bash |
| 161 | +depot ci run --workflow .depot/workflows/release.yml |
| 162 | +depot ci run --workflow .depot/workflows/release-rc.yml |
| 163 | +``` |
203 | 164 |
|
204 | | -- `.depot/workflows/ci.yml` runs: |
205 | | - - `02-rustfmt` |
206 | | - - `01-changed-paths` |
207 | | - - `03-clippy-inspect` |
208 | | - - `04-test-inspect` |
209 | | - - `05-clippy-sync` |
210 | | - - `06-test-sync` |
211 | | -- `.depot/workflows/release.yml` creates stable release PRs and publishes CLI archives |
212 | | -- `.depot/workflows/release-rc.yml` publishes numbered prerelease archives from `main` |
213 | | -- `.github/workflows/semantic-pr.yml` enforces conventional PR titles so stable releases can be created automatically from merged PRs |
214 | | -- `.github/workflows/semgrep.yml` runs the lightweight security scan |
| 165 | +## Suggested Screenshots |
215 | 166 |
|
216 | | -## Release Source Of Truth |
| 167 | +If you want to show this repo working in a talk or video, start with: |
217 | 168 |
|
218 | | -Stable releases are driven by merged PR titles on `main`. |
| 169 | +1. The decision tree above, so viewers understand when this repo is the right tool. |
| 170 | +2. Databricks Jobs showing `convex-sync-kit-meshix-api-prod-delta-extract` succeeding. |
| 171 | +3. Unity Catalog showing both: |
| 172 | + - `convex_sync_kit_meshix_api_s3` |
| 173 | + - `convex_sync_kit_meshix_api_delta_control` |
| 174 | + - `convex_sync_kit_meshix_api_delta_bronze` |
| 175 | + - `convex_sync_kit_meshix_api_delta_silver` |
| 176 | +4. A query result from `connector_checkpoint_latest` showing `meshix-api / delta_tail`. |
| 177 | +5. A `SHOW TABLES` result for the bronze schema showing many `_cdc` tables. |
| 178 | +6. The S3-backed `__source_map` view so people can see the reference path is real too. |
219 | 179 |
|
220 | | -- use conventional PR titles such as `feat: ...`, `fix: ...`, or `deps: ...` |
221 | | -- `release-please` now starts release history from commit `0cf9f47` |
222 | | -- merge to `main` opens or advances the stable release PR automatically when a releasable PR lands anywhere the repo-wide release config considers in scope |
223 | | -- both release workflows also support manual `workflow_dispatch`, so there is always a button path in GitHub Actions |
| 180 | +There is a more detailed capture list in [docs/demo-storyboard.md](docs/demo-storyboard.md). |
224 | 181 |
|
225 | 182 | ## References |
226 | 183 |
|
| 184 | +- [Ask DeepWiki about this repo](https://deepwiki.com/shpitdev/convex-sync-kit) |
227 | 185 | - [docs/architecture.md](docs/architecture.md) |
228 | 186 | - [docs/public-reference-map.md](docs/public-reference-map.md) |
229 | 187 | - [docs/release-artifacts.md](docs/release-artifacts.md) |
230 | | -- [Convex streaming export docs](https://docs.convex.dev/production/integrations/streaming-import-export) |
231 | | -- [Convex streaming export API](https://docs.convex.dev/streaming-export-api) |
| 188 | +- [docs/demo-storyboard.md](docs/demo-storyboard.md) |
232 | 189 | - [Upstream Convex `fivetran_source` crate](https://github.com/get-convex/convex-backend/tree/main/crates/fivetran_source) |
233 | 190 | - [Databricks `AUTO CDC` docs](https://docs.databricks.com/aws/en/ldp/cdc) |
| 191 | + |
| 192 | +## License |
| 193 | + |
| 194 | +[MIT](LICENSE) |
0 commit comments