Skip to content

Commit 879cccb

Browse files
docs: clarify user paths and badge surface (#23)
1 parent 6df4cf0 commit 879cccb

4 files changed

Lines changed: 240 additions & 189 deletions

File tree

README.md

Lines changed: 130 additions & 169 deletions
Original file line numberDiff line numberDiff line change
@@ -1,233 +1,194 @@
11
# convex-sync-kit
22

3-
- `Language`: ![Rust](https://img.shields.io/badge/Rust-000000?logo=rust&logoColor=white)
4-
- `Source`: ![Convex](https://img.shields.io/badge/Convex-EE342F?logo=convex&logoColor=white)
5-
- `Targets`: ![Amazon S3](https://img.shields.io/badge/Amazon%20S3-569A31?logo=amazons3&logoColor=white) ![Databricks](https://img.shields.io/badge/Databricks-FF3621?logo=databricks&logoColor=white)
6-
- `Infra`: ![Terraform](https://img.shields.io/badge/Terraform-844FBA?logo=terraform&logoColor=white)
3+
Recurring Convex export pipelines for local analytics, Databricks, and downstream systems like Palantir Foundry.
74

8-
Convex CDC sync engine with two supported target families:
5+
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/shpitdev/convex-sync-kit)
6+
[![Release](https://img.shields.io/github/v/release/shpitdev/convex-sync-kit?display_name=tag)](https://github.com/shpitdev/convex-sync-kit/releases)
7+
[![License: MIT](https://img.shields.io/badge/license-MIT-2ea44f)](LICENSE)
98

10-
- `S3/export`: append-only raw parquet -> current-state staging parquet -> S3 publish
11-
- `Databricks Delta`: bronze Delta CDC landing -> Lakeflow `AUTO CDC` -> silver current-state Delta tables
9+
![Rust](https://img.shields.io/badge/Rust-000000?logo=rust&logoColor=white)
10+
![Convex](https://img.shields.io/badge/Convex-EE342F?logo=convex&logoColor=white)
11+
![Amazon S3](https://img.shields.io/badge/Amazon%20S3-569A31?logo=amazons3&logoColor=white)
12+
![Databricks](https://img.shields.io/badge/Databricks-FF3621?logo=databricks&logoColor=white)
13+
![Palantir Foundry](https://img.shields.io/badge/Palantir%20Foundry-virtual%20tables-101828)
1214

13-
The source-side behavior intentionally stays close to the public Convex/Fivetran
14-
extraction model:
15-
16-
- bootstrap with `list_snapshot`
17-
- resume incomplete snapshots from checkpoint
18-
- continue with `document_deltas`
19-
- advance checkpoints only after durable writes succeed
20-
21-
## Repo Map
15+
## Choose Your Path
2216

2317
```mermaid
2418
flowchart TD
25-
Root[convex-sync-kit]
26-
Inspect[apps/convex-inspect]
27-
CLI[apps/convex-sync]
28-
Core[crates/convex-sync-core]
29-
S3[crates/convex-export-s3]
30-
AWS[platform/aws]
31-
DBS3[platform/databricks/s3]
32-
DBN[platform/databricks/delta]
33-
Root --> Inspect
34-
Root --> CLI
35-
Root --> Core
36-
Root --> S3
37-
Root --> AWS
38-
Root --> DBS3
39-
Root --> DBN
19+
A[What are you trying to do?]
20+
A --> B[One-off manual export]
21+
A --> C[Recurring local analysis]
22+
A --> D[Databricks]
23+
A --> E[Palantir Foundry]
24+
25+
B --> B1[Use the official Convex export docs]
26+
C --> C1[Run convex-sync locally and query parquet with DuckDB or Polars]
27+
D --> D1[Recommended: Databricks Delta]
28+
D --> D2[Reference path: S3 backed views]
29+
E --> E1[Recommended: Databricks Delta -> Unity Catalog -> Foundry virtual tables]
30+
E --> E2[Fallback: S3 snapshots -> Foundry S3 virtual tables]
4031
```
4132

42-
Read the repo by layer:
33+
### 1. One-off manual export
4334

44-
- [`apps/convex-inspect/README.md`](apps/convex-inspect/README.md): direct source inspection commands
45-
- [`apps/convex-sync/README.md`](apps/convex-sync/README.md): CLI surface and S3/export runtime commands
46-
- `crates/convex-sync-core/`: shared Convex client, checkpoint FSM, event normalization, sync engine
47-
- `crates/convex-export-s3/`: raw parquet sink, staging materialization, S3 publish flow
48-
- [`platform/aws/README.md`](platform/aws/README.md): AWS assets for publishing and downstream readers
49-
- [`platform/databricks/README.md`](platform/databricks/README.md): Databricks target family overview
50-
- [`platform/databricks/s3/README.md`](platform/databricks/s3/README.md): Databricks consuming the S3 export path
51-
- [`platform/databricks/delta/README.md`](platform/databricks/delta/README.md): Databricks Delta bronze/silver landing
52-
- [`sources/README.md`](sources/README.md): source-specific defaults layered on top of the shared engine
35+
If you only need a one-time export or ad hoc backfill, use the official Convex tooling directly. This repo is aimed at recurring pipelines, not the simplest possible one-shot export.
5336

54-
## Install
37+
- [Convex streaming import/export](https://docs.convex.dev/production/integrations/streaming-import-export)
38+
- [Convex streaming export API](https://docs.convex.dev/streaming-export-api)
5539

56-
Release install:
40+
### 2. Recurring local analysis
5741

58-
```bash
59-
curl -fsSL https://raw.githubusercontent.com/shpitdev/convex-sync-kit/main/install.sh | bash
60-
```
61-
62-
Local checkout dev install:
42+
If you want recurring exports but do not want a warehouse yet, run the S3/export engine locally and point the outputs wherever you want. `.memory/` is only this repo's default. Every path-bearing command can be overridden.
6343

6444
```bash
65-
./install.sh --mode dev --force
66-
convex-sync-dev --help
67-
```
45+
mkdir -p /tmp/convex-sync-kit-demo
6846

69-
Current release coverage:
47+
convex-sync sync-once \
48+
--output /tmp/convex-sync-kit-demo/raw_change_log \
49+
--checkpoint-path /tmp/convex-sync-kit-demo/raw_change_log.checkpoint.json
7050

71-
- stable and prerelease archives target `linux-amd64`
72-
- `convex-sync-dev` is checkout-linked and rebuilds incrementally via Cargo
73-
- release installs go to `~/.local/share/convex-sync/<version>/convex-sync`
74-
- command symlinks go in `~/.local/bin`
75-
- `convex-inspect` is repo-local today and not part of the release artifact
51+
convex-sync materialize-staging \
52+
--raw-change-log /tmp/convex-sync-kit-demo/raw_change_log \
53+
--output /tmp/convex-sync-kit-demo/staging \
54+
--incremental
7655

77-
## Source Configs
78-
79-
The repo name stays generic. Source-specific defaults live under `sources/`.
56+
duckdb -c "select * from read_parquet('/tmp/convex-sync-kit-demo/staging/**/*.parquet') limit 20"
57+
```
8058

81-
- current source profile: `sources/meshix-api/env.sh`
82-
- activate a source by setting `CONVEX_SYNC_SOURCE=<slug>`
83-
- explicit env vars still win over source defaults
59+
### 3. Using Databricks
8460

85-
This is the intended scaling model for running the same engine against many
86-
Convex projects without forking the repo or renaming the binaries.
61+
There are two supported Databricks paths:
8762

88-
## Operator Binaries
63+
| Path | What it creates | When to use it | Recommendation |
64+
|---|---|---|---|
65+
| Databricks Delta | Unity Catalog control, bronze, and silver schemas | Primary production path | Recommended |
66+
| Databricks over S3 | Unity Catalog views over published Parquet snapshots | Reference example, simpler bridge from the Rust exporter | Supported, but secondary |
8967

90-
- `convex-inspect`: inspect Convex schemas, snapshot pages, and delta pages directly
91-
- `convex-sync`: run the maintained parquet -> staging -> S3 export workflow
68+
Recommended Databricks Delta flow:
9269

93-
## Supported Variations
70+
```bash
71+
export CONVEX_SYNC_SOURCE=meshix-api
9472

95-
```mermaid
96-
flowchart LR
97-
C[Convex]
98-
E[shared sync semantics]
99-
S3[S3 export path]
100-
DBN[Databricks Delta path]
101-
DBS3[Databricks over S3 path]
102-
C --> E
103-
E --> S3
104-
E -. mirrored extractor semantics .-> DBN
105-
S3 --> DBS3
73+
just databricks-delta-bootstrap 63d28889f3eb3c4b
74+
just databricks-delta-sync-secret DEFAULT
75+
just databricks-delta-deploy DEFAULT prod
76+
just databricks-delta-run DEFAULT prod
10677
```
10778

108-
### `S3/export`
109-
110-
The maintained Rust runtime path:
111-
112-
1. `sync-once` writes append-only parquet batches under `.memory/raw_change_log/`
113-
2. `materialize-staging --incremental` builds `.memory/staging/`
114-
3. `publish-s3` uploads `staging/current/...` plus versioned manifests
115-
4. `run` loops those steps on a poll interval
116-
117-
CLI:
79+
Reference Databricks over S3 flow:
11880

119-
- `cargo run -p convex-sync -- sync-once`
120-
- `cargo run -p convex-sync -- materialize-staging`
121-
- `cargo run -p convex-sync -- publish-s3 --bucket your-bucket`
122-
- `cargo run -p convex-sync -- run --bucket your-bucket`
123-
124-
Inspection:
125-
126-
- `cargo run -p convex-inspect -- schemas`
127-
- `cargo run -p convex-inspect -- snapshot --table-name users`
128-
- `cargo run -p convex-inspect -- deltas --cursor 0`
81+
```bash
82+
export CONVEX_SYNC_SOURCE=meshix-api
12983

130-
Or via `just`:
84+
just run --bucket your-bucket --prefix prod
85+
just databricks-sync-staging-views --warehouse-id 63d28889f3eb3c4b --bucket your-bucket --prefix prod
86+
```
13187

132-
- `just dev-cli --help`
133-
- `just schemas`
134-
- `just snapshot --table-name users`
135-
- `just deltas --cursor 0`
136-
- `just sync-once`
137-
- `just materialize-staging`
138-
- `just publish-s3 --bucket your-bucket`
139-
- `just run --bucket your-bucket`
88+
### 4. Using Palantir Foundry
14089

141-
### `Databricks Delta`
90+
If you are already on Databricks, the recommended Foundry path is:
14291

143-
Checked-in Databricks Delta assets:
92+
```text
93+
Convex -> convex-sync-kit Databricks Delta -> Unity Catalog -> Foundry Databricks source -> virtual tables
94+
```
14495

145-
- `platform/databricks/delta/databricks.yml`
146-
- `platform/databricks/delta/resources/convex_delta_extract.job.yml`
147-
- `platform/databricks/delta/extractor/convex_cdc_job.py`
148-
- `platform/databricks/delta/sql/bootstrap/`
149-
- `platform/databricks/delta/lakeflow/bronze_to_silver_template.sql`
96+
That path is the best fit because Foundry's Databricks connector supports virtual tables over Unity Catalog, including richer Delta and Iceberg behavior when external access is enabled.
15097

151-
Runtime split:
98+
Fallback path:
15299

153-
1. a Databricks job runs the extractor and appends bronze CDC rows
154-
2. checkpoint rows land in the control schema
155-
3. Lakeflow `AUTO CDC` materializes silver current-state tables
100+
```text
101+
Convex -> convex-sync-kit S3 snapshots -> Foundry S3 source -> Parquet virtual tables or dataset sync
102+
```
156103

157-
Packaged entrypoints:
104+
That works, but it is a simpler and more limited path. Palantir's S3 connector supports Parquet virtual tables, but they rely on schema inference, while the Databricks connector gives you a cleaner Unity Catalog table surface.
158105

159-
- `just databricks-delta-sync-secret`
160-
- `just databricks-delta-bootstrap <warehouse_id>`
161-
- `just databricks-delta-deploy`
162-
- `just databricks-delta-run`
163-
- `just databricks-delta-smoke <warehouse_id>`
106+
Relevant Foundry docs:
164107

165-
Recommended production naming:
108+
- [Databricks connector](https://www.palantir.com/docs/foundry/available-connectors/databricks/)
109+
- [Amazon S3 connector](https://www.palantir.com/docs/foundry/available-connectors/amazon-s3/)
110+
- [Virtual tables](https://www.palantir.com/docs/foundry/data-integration/virtual-tables/index.html)
166111

167-
- S3-backed Databricks schema: `convex_sync_kit_<source>_s3`
168-
- Delta control schema: `convex_sync_kit_<source>_delta_control`
169-
- Delta bronze schema: `convex_sync_kit_<source>_delta_bronze`
170-
- Delta silver schema: `convex_sync_kit_<source>_delta_silver`
112+
## What This Repo Produces
171113

172-
### `Databricks over S3`
114+
| Path | Core artifacts | Current recommended naming |
115+
|---|---|---|
116+
| Local recurring analysis | raw change log parquet, staging parquet | user-defined paths |
117+
| S3/export | `staging/current`, manifests, versioned snapshots | bucket and prefix chosen by operator |
118+
| Databricks over S3 | Unity Catalog views over published parquet snapshots | `convex_sync_kit_<source>_s3` |
119+
| Databricks Delta | checkpoint table, bronze CDC tables, silver current-state tables | `convex_sync_kit_<source>_delta_{control,bronze,silver}` |
173120

174-
This variation keeps the existing Rust exporter and S3 publish loop, then adds:
121+
The current checked-in source profile is [`sources/meshix-api/env.sh`](sources/meshix-api/env.sh). That is only one source profile, not a repo identity. Add more source directories as you onboard more Convex projects.
175122

176-
1. Unity Catalog external location coverage over `staging/current`
177-
2. stable SQL views from `platform/databricks/s3/sql/register_staging_views.sql.tmpl`
178-
3. Databricks consumers reading the published parquet snapshots directly
123+
## Output Paths And Defaults
179124

180-
## Platform Assets
125+
Examples in this repo often use `.memory/` because that is convenient for local development here. It is not a required location.
181126

182-
Snapshot templates into `.memory/` before running Terraform:
127+
| Command | Default | How to override |
128+
|---|---|---|
129+
| `convex-sync sync-once` | `.memory/raw_change_log` | `--output`, `--checkpoint-path` |
130+
| `convex-sync materialize-staging` | `.memory/staging` | `--raw-change-log`, `--output`, `--state-path` |
131+
| `convex-sync publish-s3` | `.memory/staging` | `--staging-dir`, `--bucket`, `--prefix` |
132+
| `convex-sync run` | `.memory/raw_change_log`, `.memory/staging` | `--output`, `--checkpoint-path`, `--staging-dir`, `--staging-state-path`, `--bucket`, `--prefix` |
133+
| `convex-inspect` commands | stdout unless set | `--output`, `--output-format` |
183134

184-
- `just aws-template-snapshot`
185-
- `just databricks-template-snapshot`
135+
## Docs By Audience
186136

187-
The S3-backed Databricks landing sync remains supported:
137+
| Audience | Start here | Why |
138+
|---|---|---|
139+
| End users / operators | [`platform/databricks/delta/README.md`](platform/databricks/delta/README.md), [`platform/databricks/s3/README.md`](platform/databricks/s3/README.md), [`sources/README.md`](sources/README.md) | Platform-specific deployment and source defaults |
140+
| CLI users | [`apps/convex-sync/README.md`](apps/convex-sync/README.md), [`apps/convex-inspect/README.md`](apps/convex-inspect/README.md) | Command reference and CLI help |
141+
| Contributors | [`platform/databricks/README.md`](platform/databricks/README.md), [`platform/aws/README.md`](platform/aws/README.md), [docs/architecture.md](docs/architecture.md) | Code ownership and platform layout |
188142

189-
- `just databricks-sync-staging-views --warehouse-id <warehouse-id> --bucket <bucket> --prefix <prefix>`
143+
## Testing And CI
190144

191-
That script renders SQL from
192-
`platform/databricks/s3/sql/register_staging_views.sql.tmpl` and applies stable
193-
views over the published S3 parquet files.
145+
| Layer | Present | Tooling | Runs in CI |
146+
|---|---|---|---|
147+
| unit | yes | `cargo test --workspace` | yes |
148+
| integration | no | `none` | no |
149+
| e2e api | no | `none` | no |
150+
| e2e web | no | `none` | no |
194151

195-
## Verification
152+
Remote automation:
196153

197-
Local:
154+
```bash
155+
depot ci run --workflow .depot/workflows/ci.yml
156+
```
198157

199-
- `just install-hooks` configures a repo-local pre-commit hook
200-
- the hook runs `just verify`
158+
Release automation:
201159

202-
Remote:
160+
```bash
161+
depot ci run --workflow .depot/workflows/release.yml
162+
depot ci run --workflow .depot/workflows/release-rc.yml
163+
```
203164

204-
- `.depot/workflows/ci.yml` runs:
205-
- `02-rustfmt`
206-
- `01-changed-paths`
207-
- `03-clippy-inspect`
208-
- `04-test-inspect`
209-
- `05-clippy-sync`
210-
- `06-test-sync`
211-
- `.depot/workflows/release.yml` creates stable release PRs and publishes CLI archives
212-
- `.depot/workflows/release-rc.yml` publishes numbered prerelease archives from `main`
213-
- `.github/workflows/semantic-pr.yml` enforces conventional PR titles so stable releases can be created automatically from merged PRs
214-
- `.github/workflows/semgrep.yml` runs the lightweight security scan
165+
## Suggested Screenshots
215166

216-
## Release Source Of Truth
167+
If you want to show this repo working in a talk or video, start with:
217168

218-
Stable releases are driven by merged PR titles on `main`.
169+
1. The decision tree above, so viewers understand when this repo is the right tool.
170+
2. Databricks Jobs showing `convex-sync-kit-meshix-api-prod-delta-extract` succeeding.
171+
3. Unity Catalog showing both:
172+
- `convex_sync_kit_meshix_api_s3`
173+
- `convex_sync_kit_meshix_api_delta_control`
174+
- `convex_sync_kit_meshix_api_delta_bronze`
175+
- `convex_sync_kit_meshix_api_delta_silver`
176+
4. A query result from `connector_checkpoint_latest` showing `meshix-api / delta_tail`.
177+
5. A `SHOW TABLES` result for the bronze schema showing many `_cdc` tables.
178+
6. The S3-backed `__source_map` view so people can see the reference path is real too.
219179

220-
- use conventional PR titles such as `feat: ...`, `fix: ...`, or `deps: ...`
221-
- `release-please` now starts release history from commit `0cf9f47`
222-
- merge to `main` opens or advances the stable release PR automatically when a releasable PR lands anywhere the repo-wide release config considers in scope
223-
- both release workflows also support manual `workflow_dispatch`, so there is always a button path in GitHub Actions
180+
There is a more detailed capture list in [docs/demo-storyboard.md](docs/demo-storyboard.md).
224181

225182
## References
226183

184+
- [Ask DeepWiki about this repo](https://deepwiki.com/shpitdev/convex-sync-kit)
227185
- [docs/architecture.md](docs/architecture.md)
228186
- [docs/public-reference-map.md](docs/public-reference-map.md)
229187
- [docs/release-artifacts.md](docs/release-artifacts.md)
230-
- [Convex streaming export docs](https://docs.convex.dev/production/integrations/streaming-import-export)
231-
- [Convex streaming export API](https://docs.convex.dev/streaming-export-api)
188+
- [docs/demo-storyboard.md](docs/demo-storyboard.md)
232189
- [Upstream Convex `fivetran_source` crate](https://github.com/get-convex/convex-backend/tree/main/crates/fivetran_source)
233190
- [Databricks `AUTO CDC` docs](https://docs.databricks.com/aws/en/ldp/cdc)
191+
192+
## License
193+
194+
[MIT](LICENSE)

0 commit comments

Comments
 (0)