Skip to content

Commit dda3480

Browse files
authored
feat: implement from-proto-generate-csv (#10)
* feat: initial work on from-proto-export * feat: fix binary type handling in from-proto-export * fix: num args * chore: renamed from from-proto-export to from-proto-generate-csv * feat: proper shutdown handling * feat: align cursor behaviour of from-proto-generate-csv * feat: tighter alignment between from-proto and from-proto-generate-csv * feat: improve semantic type docs * feat: improve semantic type docs
1 parent 30744a6 commit dda3480

12 files changed

Lines changed: 3278 additions & 41 deletions

README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,7 @@ substreams-sink-sql generate-csv "risingwave://root:@localhost:4566/dev?schema=p
258258
> RisingWave's streaming architecture makes it particularly well-suited for high-throughput injection scenarios. Its append-optimized design can handle large CSV imports efficiently while maintaining real-time query performance.
259259

260260
> [!NOTE]
261+
261262
> We are using 14490000 as our stop block, pick you stop block close to chain's HEAD or smaller like us to perform an experiment, adjust to your needs.
262263

263264
This will generate block segmented CSV files for each table in your schema inside the folder `./data/tables`. Next step is to actually inject those CSV files into your database. You can use `psql` and inject directly with it.
@@ -312,3 +313,23 @@ When choosing this value you should consider 2 things:
312313
- Amount of RAM you want to allocate.
313314

314315
Let's take a container that is going to have 8 GiB of RAM. We suggest leaving 512 MiB for other part of the `generate-csv` tasks, which mean we could dedicated 7.488 GiB to buffering. If your schema has 10 tables, you should use `--buffer-max-size=785173709` (`7.488 GiB / 10 = 748.8 MiB = 785173709`).
316+
317+
### Ingestion Modes
318+
319+
This sink supports two primary ingestion modes tailored to different data contracts and operational needs:
320+
321+
- run: Consumes DatabaseChanges and applies CRUD operations against an existing schema (created via `setup`). Uses system tables `cursors` and `substreams_history` for cursoring and optional reorg handling. Best when your module emits DatabaseChanges and you want tight DB control. See flags `--batch-block-flush-interval`, `--batch-row-flush-interval`, `--live-block-flush-interval`.
322+
323+
- from-proto: Consumes a typed protobuf message (from your output module), derives and manages schema automatically, and inserts entities in relational tables. Uses `_cursor_`, `_blocks_`, and `_sink_info_`. Great for greenfield ingestion or when you want schema derived from your protos.
324+
325+
Key behaviors and recommendations:
326+
327+
- Finalization (from-proto): Any outstanding partial batch flushes when the requested range completes; the final cursor is stored.
328+
- Live + constraints: Live streaming works with or without constraints. For heavy backfills, prefer `--no-constraints` for speed, then use constraints for live integrity if needed.
329+
- Reorg handling (run): `--undo-buffer-size` controls strategy. Non-zero buffers blocks (disables DB-level reorgs near head); zero enables DB-level reorgs (where supported).
330+
- Mode handoff: `run` and `from-proto` maintain different system tables by design. If you backfilled with `from-proto`, continue live with `from-proto` to reuse `_cursor_`/`_blocks_`. Switching to `run` directly will not reuse the same cursor tables and may require a migration.
331+
332+
See also:
333+
334+
- docs/FROM_PROTO.md — full from-proto reference (flags, schema, live/reorgs).
335+
- docs/FROM_PROTO_GENERATE_CSV_README.md — CSV backfill aligned to from-proto.

0 commit comments

Comments
 (0)