feat(sqlite): add streaming SqliteSidecarBuilder keyed by a column#24
Merged
Conversation
open_or_build opens parquet files and synthesises monotonic 0..N keys. Add SqliteSidecarBuilder: an incremental begin/push_batch/finish builder that consumes RecordBatches and reads each row's key from a designated column, so a caller can build the sidecar from any batch source (e.g. a storage engine's native rowid) in a single bounded-memory pass without materialising an intermediate parquet file. Shares CREATE/INSERT DDL and row-param construction with the parquet path via new ddl()/row_to_params() helpers; build_table behaviour is unchanged. One transaction wraps the build; an abandoned builder rolls back on drop. Keys are stored as INTEGER PRIMARY KEY, preserving the B-tree point-lookup performance.
Comment on lines
+362
to
+368
| let ncols = batch.num_columns(); | ||
| if self.key_col_index >= ncols { | ||
| return Err(DataFusionError::Execution(format!( | ||
| "key_col_index {} out of range for batch with {ncols} columns", | ||
| self.key_col_index | ||
| ))); | ||
| } |
There was a problem hiding this comment.
nit: key_col_index is bounds-checked here but value_col_indices is not — if any entry is >= batch.num_columns(), row_to_params will panic on batch.column(ci) instead of returning a DataFusionError. Worth validating both for consistency so a bad caller index always surfaces as a clean error. (not blocking)
push_batch validated key_col_index but not value_col_indices; an out-of-range entry would panic in row_to_params on batch.column(ci) rather than returning a clean DataFusionError. Validate both, and add a test for the out-of-range value-index case. Addresses PR review nit.
Collaborator
Author
|
Addressed the nit in a289ae4: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
SqliteSidecarBuilder— an incrementalbegin() → push_batch() → finish()builder for the SQLite point-lookup sidecar, fed from aRecordBatchstream instead of parquet files on disk.Unlike
SqliteLookupProvider::open_or_build(opens parquet files itself and synthesises monotonic0..Nkeys), this builder takes batches one at a time and reads each row's key from a designated column — e.g. a storage engine's nativerowid. That lets a caller drive a single pass over a source (fanning each batch out to both a USearch index and this sidecar) without first materialising an intermediate parquet file.Why
A consumer building a vector index over a non-parquet source (a DuckLake snapshot, addressed by its native
rowid) has no single parquet file to hand toopen_or_build, and its rows aren't keyed by a dense0..Nordinal. Rather than make the consumer stage the whole snapshot to a temp parquet (an extra full serialize + re-decode), this entrypoint consumes the batch stream the consumer already has.Details
push_batchinserts a batch's rows and returns, accumulating nothing. Dropping the builder beforefinish()rolls the transaction back, so a half-built table is never persisted.INTEGER PRIMARY KEY(the rowid-alias B-tree) — point-lookup performance is unchanged; sparse/non-monotonic keys are fine.CREATE/INSERTDDL and row-param construction with the parquetbuild_tablevia newddl()/row_to_params()helpers;build_tablebehaviour is unchanged.Tests
4 new tests (sparse-rowid round-trip, abandon-on-drop rollback, UInt64 keys, validation errors) + the existing 7
sqlite_providertests; full suite (--all-features) green,fmt+clippy --all-targets -D warningsclean.