Commit 285e87c
authored
feat: add ParquetLookupProvider and SqliteLookupProvider behind feature gates (#3)
* feat: add ParquetLookupProvider and SqliteLookupProvider behind feature gates
Ports the two storage backends proven in the df-vector-search benchmark POC
into the library as optional, feature-gated providers.
- `parquet-provider` feature: ParquetLookupProvider — concurrent row-group
reads from any ObjectStore (S3, local FS) with pre-cached parquet footers
and optional RowSelection for page-skip optimisation
- `sqlite-provider` feature: SqliteLookupProvider — B-tree point lookups via
a WAL-mode connection pool; builds from parquet on first run
- `keys` module (always compiled): pack_key / unpack_key / DatasetLayout —
shared key encoding utilities extracted from indexing.rs
- Integration tests for both providers (9 tests total)
Breaking change in Cargo.toml: tokio gains the "sync" feature unconditionally
(needed by SqliteLookupProvider's Semaphore; tokio was already a hard dep).
* fix: address P1/P2 review comments
P1 — sqlite_provider: add ConnGuard to return connection to pool on
panic, preventing permanent pool shrinkage and cascading failures.
P1 — parquet_provider: replace unchecked file_keys[file_idx] /
metadata_cache[file_idx] with bounds-checked .get() calls that return
DataFusionError instead of panicking on stale or mismatched keys.
P2 — parquet_provider: replace .expect("column mismatch") in projection
mapping with a proper ? returning DataFusionError::Execution.
P2 — sqlite_provider: replace println! with tracing::info! throughout;
library code should not write to stdout directly.
* fix: address second round of P1/P2 review comments
P1 — keys: add debug_assert! on pack_key inputs to catch out-of-range
file_idx/rg_idx (>= 65536) or local_offset (>= 2^32) in debug builds
instead of silently producing a wrong key.
P1 — sqlite_provider: add quote_ident() helper that doubles embedded
double-quotes before interpolating table_name into SQL, preventing SQL
injection via a crafted table name. Applied to all four SQL format
strings (SELECT COUNT, SELECT *, CREATE TABLE, INSERT INTO).
P2 — sqlite_provider: validate pool_size >= 1 at construction time and
return DataFusionError instead of allowing an unusable provider.
Replace pool.lock().unwrap().pop().unwrap() with explicit map_err /
ok_or_else so mutex poison and empty-pool are surfaced as errors.
* fix: quote column names in DDL, guard empty col_bufs
* fix: address bot review comments — debug_assert overflow, projection reorder, UInt64 note
- keys.rs: use u32::MAX as usize instead of (1 << 32) to avoid
overflow on 32-bit targets
- parquet_provider.rs: reorder filtered columns from parquet-schema
order back to idxs order after ProjectionMask::roots
- sqlite_provider.rs: add comment documenting UInt64 > i64::MAX
wrap-around behaviour when storing to SQLite INTEGER
* fix(scan): return NotImplemented instead of silent empty MemTable
Both providers scan() returned empty MemTable, causing full-table SQL
queries to silently produce zero rows. Return NotImplemented so the
limitation is explicit — these providers are fetch_by_keys only.
Removes the now-unused MemTable import from both files.
* test: add regression tests for bugs found in PR review
- keys.rs: unit tests for pack/unpack roundtrip, boundary values,
and debug_assert panics for out-of-range file_idx and local_offset
- parquet: test projection with non-monotonic column order (the P1 bug
where columns were silently swapped without the reorder fix)
- parquet: test stale file_idx returns Err not panic
- parquet + sqlite: test scan() returns NotImplemented (not silent empty)
- sqlite: test table name with spaces works via quote_ident
* fix(keys): promote debug_assert to assert in pack_key
All three range checks (file_idx, rg_idx, local_offset) now fire in
release builds. overflow silently produces a key pointing to the wrong
row with no downstream error — a hard panic is the right behaviour.
Also removes the now-incorrect #[cfg(debug_assertions)] guards from
the corresponding should_panic tests.
* fix(sqlite): atomic build, Float32/64 read, SELECT projection
- Wrap CREATE TABLE inside the INSERT transaction so a mid-build crash
cannot leave an empty table that open_or_build silently accepts on
the next startup.
- Add missing Float32 and Float64 arms to sql_values_to_arrow; these
types were written correctly by arrow_cell_to_sql but read back via
the unsupported-type error path, breaking any schema with float cols.
- Replace SELECT * with an explicit column list built from out_schema
so only projected columns are fetched from SQLite. Also removes the
now-dead projected_indices helper.
* feat(keys): add key_prefix param to DatasetLayout::from_files
The hardcoded "parquet/" prefix created silent coupling between
from_files and the ObjectStore root — a mismatch caused all lookups
to 404 with no clear error.
key_prefix is now explicit: callers pass "" for a flat layout,
"parquet/" for a parquet/ subdirectory, or any S3 prefix needed to
match their object store configuration.
* ci: trigger re-review1 parent 4d685a8 commit 285e87c
8 files changed
Lines changed: 1943 additions & 104 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
8 | 12 | | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
13 | 17 | | |
14 | | - | |
15 | | - | |
16 | | - | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
17 | 31 | | |
18 | 32 | | |
19 | | - | |
| 33 | + | |
| 34 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
67 | 74 | | |
68 | 75 | | |
69 | 76 | | |
| |||
72 | 79 | | |
73 | 80 | | |
74 | 81 | | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
75 | 87 | | |
76 | 88 | | |
77 | 89 | | |
| |||
0 commit comments