fix(sqlite): reconstruct all basic payload types + validate at build time#27
Conversation
There was a problem hiding this comment.
Symmetric and correct: the new Utf8View arms in arrow_cell_to_sql, serialize_list, and sql_values_to_arrow plus the missing LargeUtf8 read-back arm close the round-trip gap, matching the write side which already maps both to TEXT. Test covers value preservation and null survival.
There was a problem hiding this comment.
Clean, well-tested fix. The new LargeUtf8/Utf8View scalar arms and the List<Utf8View> arm in sql_values_to_arrow() mirror the existing read-back patterns and pair the correct builder with the item field. Write and read paths are symmetric, and the regression tests cover null preservation across all three cases.
There was a problem hiding this comment.
Read/write paths are now symmetric across Utf8/LargeUtf8/Utf8View and List/LargeList, the json_text_to_list<O> refactor removes duplication and fixes the latent List<LargeUtf8> builder-mismatch panic, and the new tests cover all variants including nulls. LGTM.
There was a problem hiding this comment.
Clean, well-scoped fix. The generic json_text_to_list<O> unifies List/LargeList reconstruction and incidentally fixes the latent List<LargeUtf8> builder-mismatch panic. Write/read paths stay symmetric and tests cover the scalar, List, and LargeList variants plus null/empty edge cases.
PR #27 closed the reachable LargeUtf8/Utf8View/list read-back gaps, but the underlying defect class remained: the write side accepted any Arrow type (silent TEXT/NULL fallback) while the read side reconstructed only a subset, so an unsupported payload column built a "successful" sidecar that then failed on the first query (runtimedb#631). Close the class structurally: - Add `payload_type_supported` as the single source of truth for which Arrow types round-trip, and `validate_payload_schema` to gate every build entry point (`SqliteSidecarBuilder::begin`, `open_or_build`). An unsupported column now fails at index-build time with the column named, instead of at query time — which is where runtimedb's index-create step will surface it. - Widen actual support to the common parquet/DuckDB scalar types that previously fell through: Boolean, Date32, Date64, and Timestamp (all units, time zone preserved on read-back). Tests: reject a Decimal128 payload at build time; round-trip Boolean, Date32, and a tz-carrying Timestamp.
| value_col_indices.len() | ||
| ))); | ||
| } | ||
| validate_payload_schema(&schema, key_col_index)?; |
There was a problem hiding this comment.
Blocking: key_col_index is the wrong index to pass here. key_col_index indexes into the input batch passed to push_batch, but validate_payload_schema skips that index within the output schema — whose key column is always field 0 (see the begin docstring and finish()'s self.schema.field(0)).
When a caller passes key_col_index != 0 (a perfectly valid config — e.g. a storage engine whose rowid is not the first batch column), this:
- skips the schema field at position
key_col_index, which is a payload value column — leaving it unvalidated, so an unsupported type there slips straight through to the query-time failure this PR is meant to eliminate; and - validates schema field 0 (the actual key) as if it were a payload.
open_or_build correctly passes 0. begin should too, since the schema's key is always field 0:
| validate_payload_schema(&schema, key_col_index)?; | |
| validate_payload_schema(&schema, 0)?; |
There was a problem hiding this comment.
Review
Blocking Issues
src/sqlite_provider.rs:340—SqliteSidecarBuilder::begincallsvalidate_payload_schema(&schema, key_col_index), butkey_col_indexindexes the input batch whilevalidate_payload_schemaskips that position in the output schema (whose key is always field 0). Whenkey_col_index != 0, a payload value column is skipped (unvalidated) and the build can still produce a "successful" sidecar that fails at query time — the exact defect class this PR closes.open_or_buildpasses0correctly;beginshould pass0as well.
Action Required
- Change the
beginvalidation call tovalidate_payload_schema(&schema, 0). - Consider adding a regression test that drives
beginwithkey_col_index != 0and an unsupported payload column to lock in the fix.
Round out the supported set with the remaining primitive Arrow types so a parquet/DuckDB column of any basic type round-trips instead of being rejected at build time: - small integers: Int8, Int16, UInt8, UInt16 (→ INTEGER) - time-of-day: Time32 (Second/Millisecond), Time64 (Microsecond/Nanosecond) (→ INTEGER, unit restored from the schema on read-back) - binary: Binary, LargeBinary (→ BLOB) These add match arms only; each column hits a single arm, so there is no runtime cost for types a given sidecar doesn't use. Float16 (needs the `half` crate) and Decimal128/256 (need lossless text encoding) remain out of scope and are still rejected early by validate_payload_schema. Test: round-trip all eight new types with null cells, asserting both the reconstructed Arrow type and the values.
| value_col_indices.len() | ||
| ))); | ||
| } | ||
| validate_payload_schema(&schema, key_col_index)?; |
There was a problem hiding this comment.
Blocking (unresolved from cycle 1): this still passes key_col_index, which is the wrong index. key_col_index indexes into the input batch (see the begin docstring, lines 316-319), but validate_payload_schema skips that index within the output schema — whose key column is always field 0 (docstring line 314-315, and finish()'s self.schema.field(0) at line 408).
When a caller passes key_col_index != 0 (valid config — e.g. a storage engine whose key is not the first batch column), this:
- skips output schema field at position
key_col_index, which is a payload value column — leaving it unvalidated, so an unsupported type there slips through to the exact query-time failure this PR is meant to eliminate; and - validates output schema field 0 (the actual key) as if it were a payload.
open_or_build correctly passes 0. begin should too:
| validate_payload_schema(&schema, key_col_index)?; | |
| validate_payload_schema(&schema, 0)?; |
There was a problem hiding this comment.
Review
Blocking Issues
src/sqlite_provider.rs:340—begin()passeskey_col_indextovalidate_payload_schema, but that function skips an index into the output schema (key always at field 0), whilekey_col_indexindexes the input batch. This is the same blocking issue raised in cycle 1; it has not been fixed in the code. For anykey_col_index != 0, a payload column goes unvalidated (defeating the PR purpose) and the key column is validated as a payload instead.
Action Required
Change line 340 to validate_payload_schema(&schema, 0)?; to match open_or_build, since the output schema key column is always field 0.
`validate_payload_schema` was passed `begin`'s `key_col_index`, but that indexes the input batch while the function skips that position in the *output* schema — whose key is always field 0. For any `key_col_index != 0` this skipped a real payload column (leaving an unsupported type to fail at query time, the exact defect this guards against) and validated the key column instead. Drop the parameter and always skip output field 0, matching how `finish()` and `open_or_build` derive the key from `schema.field(0)`. Add a regression test driving `begin` with a non-zero `key_col_index` and an unsupported payload column.
There was a problem hiding this comment.
Cycle 3: prior blocking issue resolved — validate_payload_schema now drops the key_col_index parameter and always skips output-schema field 0, with both begin and open_or_build updated. Write/affinity/read paths are consistent with payload_type_supported. Approving.
Fixes hotdata-dev/runtimedb#631.
Goal
Fix vector search failing at query time with
SqliteLookupProvider: unsupported Arrow type LargeUtf8— the index built fine, only read-back failed. DuckDB/parquet readers emitLargeUtf8for strings, which the provider's write side accepted but the read-back path (sql_values_to_arrow) had no arm for.Root cause: a write/read type asymmetry
The write side always accepted more types than the read side could reconstruct — a long-standing, structural asymmetry, not a one-off. The write path has permissive catch-alls (
arrow_type_to_sql→TEXT,arrow_cell_to_sql→NULL,serialize_list→null) so it silently accepts almost anything and builds a "successful" sidecar; the read path then errors on the first query. The failure is deferred from build time to query time, which is why #631 looked like a stale-type mystery rather than a missing code path.This PR closes the class in three layers.
1. Reconstruct every type the write side accepts
Made the read-back path total for the existing string/list types, fixing the reported gap and the rest of the same bug class:
LargeUtf8andUtf8Viewhad no read-back armList<Utf8View>had no reconstruction armList<LargeUtf8>was routed through aUtf8StringBuilder, panicking in Arrow 58 on the field/child type mismatchLargeList<_>was serialized on write but never reconstructed on readList reconstruction is refactored into a generic
json_text_to_list::<O: OffsetSizeTrait>soListandLargeListshare one path.2. Fail fast at build time (so this can't recur silently)
payload_type_supported()is now the single source of truth for which Arrow types round-trip; the write arms, column affinity, and read arms all track it.validate_payload_schema()runs at every build entry point (SqliteSidecarBuilder::begin,open_or_build). An unsupported payload column now fails at index-build time with the column named, instead of building a sidecar that explodes on the first query. In runtimedb this surfaces at index-create, not in/v1/query.3. Support all basic scalar types
Widened actual support to every primitive Arrow type DuckDB/parquet commonly emit:
These are additional
matcharms only — each column dispatches to exactly one, so there is no runtime cost for types a given sidecar doesn't use.Tests
Decimal128payload column at build time, asserting the error names the columnOut of scope (now rejected early instead of silently broken)
Decimal128/256— needs a lossless encoding (a 128-bit value doesn't fit SQLite INTEGER); deferred deliberately.Float16— would add thehalfcrate dependency for a type parquet/DuckDB rarely emit.Struct,Dictionary,FixedSizeList, nested lists).runtimedb, not this crate.