fix(sqlite): reconstruct all basic payload types + validate at build time by zfarrell · Pull Request #27 · hotdata-dev/datafusion-vector-search-ext

zfarrell · 2026-06-09T23:01:05Z

Fixes hotdata-dev/runtimedb#631.

Goal

Fix vector search failing at query time with SqliteLookupProvider: unsupported Arrow type LargeUtf8 — the index built fine, only read-back failed. DuckDB/parquet readers emit LargeUtf8 for strings, which the provider's write side accepted but the read-back path (sql_values_to_arrow) had no arm for.

Root cause: a write/read type asymmetry

The write side always accepted more types than the read side could reconstruct — a long-standing, structural asymmetry, not a one-off. The write path has permissive catch-alls (arrow_type_to_sql → TEXT, arrow_cell_to_sql → NULL, serialize_list → null) so it silently accepts almost anything and builds a "successful" sidecar; the read path then errors on the first query. The failure is deferred from build time to query time, which is why #631 looked like a stale-type mystery rather than a missing code path.

This PR closes the class in three layers.

1. Reconstruct every type the write side accepts

Made the read-back path total for the existing string/list types, fixing the reported gap and the rest of the same bug class:

scalar LargeUtf8 and Utf8View had no read-back arm
List<Utf8View> had no reconstruction arm
List<LargeUtf8> was routed through a Utf8 StringBuilder, panicking in Arrow 58 on the field/child type mismatch
LargeList<_> was serialized on write but never reconstructed on read

List reconstruction is refactored into a generic json_text_to_list::<O: OffsetSizeTrait> so List and LargeList share one path.

2. Fail fast at build time (so this can't recur silently)

payload_type_supported() is now the single source of truth for which Arrow types round-trip; the write arms, column affinity, and read arms all track it.
validate_payload_schema() runs at every build entry point (SqliteSidecarBuilder::begin, open_or_build). An unsupported payload column now fails at index-build time with the column named, instead of building a sidecar that explodes on the first query. In runtimedb this surfaces at index-create, not in /v1/query.

3. Support all basic scalar types

Widened actual support to every primitive Arrow type DuckDB/parquet commonly emit:

Category	Types	Stored as
Integers	Int8/16/32/64, UInt8/16/32/64	INTEGER
Floats	Float32/64	REAL
Boolean	Boolean	INTEGER (0/1)
Strings	Utf8, LargeUtf8, Utf8View	TEXT
Binary	Binary, LargeBinary	BLOB
Temporal	Date32/64, Time32, Time64, Timestamp (all units, time zone preserved)	INTEGER
Lists	List/LargeList of {Utf8, LargeUtf8, Utf8View, Int32, Int64}	TEXT (JSON)

These are additional match arms only — each column dispatches to exactly one, so there is no runtime cost for types a given sidecar doesn't use.

Tests

round-trip Boolean / Date32 / tz-carrying Timestamp
round-trip all eight small-int / time / binary types with null cells, asserting both the reconstructed Arrow type and the values
reject a Decimal128 payload column at build time, asserting the error names the column
plus the existing list null/empty-cell coverage

Out of scope (now rejected early instead of silently broken)

Decimal128/256 — needs a lossless encoding (a 128-bit value doesn't fit SQLite INTEGER); deferred deliberately.
Float16 — would add the half crate dependency for a type parquet/DuckDB rarely emit.
Composites (Struct, Dictionary, FixedSizeList, nested lists).
The stale-sidecar-on-dataset-recreate item from the issue lives in runtimedb, not this crate.

claude

Symmetric and correct: the new Utf8View arms in arrow_cell_to_sql, serialize_list, and sql_values_to_arrow plus the missing LargeUtf8 read-back arm close the round-trip gap, matching the write side which already maps both to TEXT. Test covers value preservation and null survival.

claude

Clean, well-tested fix. The new LargeUtf8/Utf8View scalar arms and the List<Utf8View> arm in sql_values_to_arrow() mirror the existing read-back patterns and pair the correct builder with the item field. Write and read paths are symmetric, and the regression tests cover null preservation across all three cases.

claude

Symmetric, well-tested fix. Read-side reconstruction now matches the write side for LargeUtf8/Utf8View (scalar and list), and the list arms correctly avoid the StringBuilder type-mismatch panic. Tests cover types, values, and null round-tripping.

claude

Read/write paths are now symmetric across Utf8/LargeUtf8/Utf8View and List/LargeList, the json_text_to_list<O> refactor removes duplication and fixes the latent List<LargeUtf8> builder-mismatch panic, and the new tests cover all variants including nulls. LGTM.

claude

Clean, well-scoped fix. The generic json_text_to_list<O> unifies List/LargeList reconstruction and incidentally fixes the latent List<LargeUtf8> builder-mismatch panic. Write/read paths stay symmetric and tests cover the scalar, List, and LargeList variants plus null/empty edge cases.

PR #27 closed the reachable LargeUtf8/Utf8View/list read-back gaps, but the underlying defect class remained: the write side accepted any Arrow type (silent TEXT/NULL fallback) while the read side reconstructed only a subset, so an unsupported payload column built a "successful" sidecar that then failed on the first query (runtimedb#631). Close the class structurally: - Add `payload_type_supported` as the single source of truth for which Arrow types round-trip, and `validate_payload_schema` to gate every build entry point (`SqliteSidecarBuilder::begin`, `open_or_build`). An unsupported column now fails at index-build time with the column named, instead of at query time — which is where runtimedb's index-create step will surface it. - Widen actual support to the common parquet/DuckDB scalar types that previously fell through: Boolean, Date32, Date64, and Timestamp (all units, time zone preserved on read-back). Tests: reject a Decimal128 payload at build time; round-trip Boolean, Date32, and a tz-carrying Timestamp.

claude · 2026-06-10T07:19:50Z

                value_col_indices.len()
            )));
        }
+        validate_payload_schema(&schema, key_col_index)?;


Blocking: key_col_index is the wrong index to pass here. key_col_index indexes into the input batch passed to push_batch, but validate_payload_schema skips that index within the output schema — whose key column is always field 0 (see the begin docstring and finish()'s self.schema.field(0)).

When a caller passes key_col_index != 0 (a perfectly valid config — e.g. a storage engine whose rowid is not the first batch column), this:

skips the schema field at position key_col_index, which is a payload value column — leaving it unvalidated, so an unsupported type there slips straight through to the query-time failure this PR is meant to eliminate; and

validates schema field 0 (the actual key) as if it were a payload.

open_or_build correctly passes 0. begin should too, since the schema's key is always field 0:

Suggested change

validate_payload_schema(&schema, key_col_index)?;

validate_payload_schema(&schema, 0)?;

claude

Review

Blocking Issues

src/sqlite_provider.rs:340 — SqliteSidecarBuilder::begin calls validate_payload_schema(&schema, key_col_index), but key_col_index indexes the input batch while validate_payload_schema skips that position in the output schema (whose key is always field 0). When key_col_index != 0, a payload value column is skipped (unvalidated) and the build can still produce a "successful" sidecar that fails at query time — the exact defect class this PR closes. open_or_build passes 0 correctly; begin should pass 0 as well.

Action Required

Change the begin validation call to validate_payload_schema(&schema, 0).
Consider adding a regression test that drives begin with key_col_index != 0 and an unsupported payload column to lock in the fix.

Round out the supported set with the remaining primitive Arrow types so a parquet/DuckDB column of any basic type round-trips instead of being rejected at build time: - small integers: Int8, Int16, UInt8, UInt16 (→ INTEGER) - time-of-day: Time32 (Second/Millisecond), Time64 (Microsecond/Nanosecond) (→ INTEGER, unit restored from the schema on read-back) - binary: Binary, LargeBinary (→ BLOB) These add match arms only; each column hits a single arm, so there is no runtime cost for types a given sidecar doesn't use. Float16 (needs the `half` crate) and Decimal128/256 (need lossless text encoding) remain out of scope and are still rejected early by validate_payload_schema. Test: round-trip all eight new types with null cells, asserting both the reconstructed Arrow type and the values.

claude · 2026-06-10T07:25:09Z

                value_col_indices.len()
            )));
        }
+        validate_payload_schema(&schema, key_col_index)?;


Blocking (unresolved from cycle 1): this still passes key_col_index, which is the wrong index. key_col_index indexes into the input batch (see the begin docstring, lines 316-319), but validate_payload_schema skips that index within the output schema — whose key column is always field 0 (docstring line 314-315, and finish()'s self.schema.field(0) at line 408).

When a caller passes key_col_index != 0 (valid config — e.g. a storage engine whose key is not the first batch column), this:

skips output schema field at position key_col_index, which is a payload value column — leaving it unvalidated, so an unsupported type there slips through to the exact query-time failure this PR is meant to eliminate; and

validates output schema field 0 (the actual key) as if it were a payload.

open_or_build correctly passes 0. begin should too:

Suggested change

validate_payload_schema(&schema, key_col_index)?;

validate_payload_schema(&schema, 0)?;

claude

Review

Blocking Issues

src/sqlite_provider.rs:340 — begin() passes key_col_index to validate_payload_schema, but that function skips an index into the output schema (key always at field 0), while key_col_index indexes the input batch. This is the same blocking issue raised in cycle 1; it has not been fixed in the code. For any key_col_index != 0, a payload column goes unvalidated (defeating the PR purpose) and the key column is validated as a payload instead.

Action Required

Change line 340 to validate_payload_schema(&schema, 0)?; to match open_or_build, since the output schema key column is always field 0.

`validate_payload_schema` was passed `begin`'s `key_col_index`, but that indexes the input batch while the function skips that position in the *output* schema — whose key is always field 0. For any `key_col_index != 0` this skipped a real payload column (leaving an unsupported type to fail at query time, the exact defect this guards against) and validated the key column instead. Drop the parameter and always skip output field 0, matching how `finish()` and `open_or_build` derive the key from `schema.field(0)`. Add a regression test driving `begin` with a non-zero `key_col_index` and an unsupported payload column.

claude

Cycle 3: prior blocking issue resolved — validate_payload_schema now drops the key_col_index parameter and always skips output-schema field 0, with both begin and open_or_build updated. Write/affinity/read paths are consistent with payload_type_supported. Approving.

fix(sqlite): reconstruct LargeUtf8 and Utf8View on read-back

7393253