Skip to content

fix(sqlite): reconstruct all basic payload types + validate at build time#27

Merged
anoop-narang merged 8 commits into
mainfrom
fix/sqlite-largeutf8-readback
Jun 10, 2026
Merged

fix(sqlite): reconstruct all basic payload types + validate at build time#27
anoop-narang merged 8 commits into
mainfrom
fix/sqlite-largeutf8-readback

Conversation

@zfarrell

@zfarrell zfarrell commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fixes hotdata-dev/runtimedb#631.

Goal

Fix vector search failing at query time with SqliteLookupProvider: unsupported Arrow type LargeUtf8 — the index built fine, only read-back failed. DuckDB/parquet readers emit LargeUtf8 for strings, which the provider's write side accepted but the read-back path (sql_values_to_arrow) had no arm for.

Root cause: a write/read type asymmetry

The write side always accepted more types than the read side could reconstruct — a long-standing, structural asymmetry, not a one-off. The write path has permissive catch-alls (arrow_type_to_sqlTEXT, arrow_cell_to_sqlNULL, serialize_listnull) so it silently accepts almost anything and builds a "successful" sidecar; the read path then errors on the first query. The failure is deferred from build time to query time, which is why #631 looked like a stale-type mystery rather than a missing code path.

This PR closes the class in three layers.

1. Reconstruct every type the write side accepts

Made the read-back path total for the existing string/list types, fixing the reported gap and the rest of the same bug class:

  • scalar LargeUtf8 and Utf8View had no read-back arm
  • List<Utf8View> had no reconstruction arm
  • List<LargeUtf8> was routed through a Utf8 StringBuilder, panicking in Arrow 58 on the field/child type mismatch
  • LargeList<_> was serialized on write but never reconstructed on read

List reconstruction is refactored into a generic json_text_to_list::<O: OffsetSizeTrait> so List and LargeList share one path.

2. Fail fast at build time (so this can't recur silently)

  • payload_type_supported() is now the single source of truth for which Arrow types round-trip; the write arms, column affinity, and read arms all track it.
  • validate_payload_schema() runs at every build entry point (SqliteSidecarBuilder::begin, open_or_build). An unsupported payload column now fails at index-build time with the column named, instead of building a sidecar that explodes on the first query. In runtimedb this surfaces at index-create, not in /v1/query.

3. Support all basic scalar types

Widened actual support to every primitive Arrow type DuckDB/parquet commonly emit:

Category Types Stored as
Integers Int8/16/32/64, UInt8/16/32/64 INTEGER
Floats Float32/64 REAL
Boolean Boolean INTEGER (0/1)
Strings Utf8, LargeUtf8, Utf8View TEXT
Binary Binary, LargeBinary BLOB
Temporal Date32/64, Time32, Time64, Timestamp (all units, time zone preserved) INTEGER
Lists List/LargeList of {Utf8, LargeUtf8, Utf8View, Int32, Int64} TEXT (JSON)

These are additional match arms only — each column dispatches to exactly one, so there is no runtime cost for types a given sidecar doesn't use.

Tests

  • round-trip Boolean / Date32 / tz-carrying Timestamp
  • round-trip all eight small-int / time / binary types with null cells, asserting both the reconstructed Arrow type and the values
  • reject a Decimal128 payload column at build time, asserting the error names the column
  • plus the existing list null/empty-cell coverage

Out of scope (now rejected early instead of silently broken)

  • Decimal128/256 — needs a lossless encoding (a 128-bit value doesn't fit SQLite INTEGER); deferred deliberately.
  • Float16 — would add the half crate dependency for a type parquet/DuckDB rarely emit.
  • Composites (Struct, Dictionary, FixedSizeList, nested lists).
  • The stale-sidecar-on-dataset-recreate item from the issue lives in runtimedb, not this crate.

claude[bot]
claude Bot previously approved these changes Jun 9, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symmetric and correct: the new Utf8View arms in arrow_cell_to_sql, serialize_list, and sql_values_to_arrow plus the missing LargeUtf8 read-back arm close the round-trip gap, matching the write side which already maps both to TEXT. Test covers value preservation and null survival.

claude[bot]
claude Bot previously approved these changes Jun 9, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-tested fix. The new LargeUtf8/Utf8View scalar arms and the List<Utf8View> arm in sql_values_to_arrow() mirror the existing read-back patterns and pair the correct builder with the item field. Write and read paths are symmetric, and the regression tests cover null preservation across all three cases.

claude[bot]
claude Bot previously approved these changes Jun 9, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symmetric, well-tested fix. Read-side reconstruction now matches the write side for LargeUtf8/Utf8View (scalar and list), and the list arms correctly avoid the StringBuilder type-mismatch panic. Tests cover types, values, and null round-tripping.

claude[bot]
claude Bot previously approved these changes Jun 9, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read/write paths are now symmetric across Utf8/LargeUtf8/Utf8View and List/LargeList, the json_text_to_list<O> refactor removes duplication and fixes the latent List<LargeUtf8> builder-mismatch panic, and the new tests cover all variants including nulls. LGTM.

claude[bot]
claude Bot previously approved these changes Jun 9, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix. The generic json_text_to_list<O> unifies List/LargeList reconstruction and incidentally fixes the latent List<LargeUtf8> builder-mismatch panic. Write/read paths stay symmetric and tests cover the scalar, List, and LargeList variants plus null/empty edge cases.

PR #27 closed the reachable LargeUtf8/Utf8View/list read-back gaps, but
the underlying defect class remained: the write side accepted any Arrow
type (silent TEXT/NULL fallback) while the read side reconstructed only a
subset, so an unsupported payload column built a "successful" sidecar that
then failed on the first query (runtimedb#631).

Close the class structurally:

- Add `payload_type_supported` as the single source of truth for which
  Arrow types round-trip, and `validate_payload_schema` to gate every
  build entry point (`SqliteSidecarBuilder::begin`, `open_or_build`). An
  unsupported column now fails at index-build time with the column named,
  instead of at query time — which is where runtimedb's index-create step
  will surface it.
- Widen actual support to the common parquet/DuckDB scalar types that
  previously fell through: Boolean, Date32, Date64, and Timestamp (all
  units, time zone preserved on read-back).

Tests: reject a Decimal128 payload at build time; round-trip Boolean,
Date32, and a tz-carrying Timestamp.
Comment thread src/sqlite_provider.rs Outdated
value_col_indices.len()
)));
}
validate_payload_schema(&schema, key_col_index)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: key_col_index is the wrong index to pass here. key_col_index indexes into the input batch passed to push_batch, but validate_payload_schema skips that index within the output schema — whose key column is always field 0 (see the begin docstring and finish()'s self.schema.field(0)).

When a caller passes key_col_index != 0 (a perfectly valid config — e.g. a storage engine whose rowid is not the first batch column), this:

  • skips the schema field at position key_col_index, which is a payload value column — leaving it unvalidated, so an unsupported type there slips straight through to the query-time failure this PR is meant to eliminate; and
  • validates schema field 0 (the actual key) as if it were a payload.

open_or_build correctly passes 0. begin should too, since the schema's key is always field 0:

Suggested change
validate_payload_schema(&schema, key_col_index)?;
validate_payload_schema(&schema, 0)?;

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • src/sqlite_provider.rs:340SqliteSidecarBuilder::begin calls validate_payload_schema(&schema, key_col_index), but key_col_index indexes the input batch while validate_payload_schema skips that position in the output schema (whose key is always field 0). When key_col_index != 0, a payload value column is skipped (unvalidated) and the build can still produce a "successful" sidecar that fails at query time — the exact defect class this PR closes. open_or_build passes 0 correctly; begin should pass 0 as well.

Action Required

  • Change the begin validation call to validate_payload_schema(&schema, 0).
  • Consider adding a regression test that drives begin with key_col_index != 0 and an unsupported payload column to lock in the fix.

Round out the supported set with the remaining primitive Arrow types so a
parquet/DuckDB column of any basic type round-trips instead of being
rejected at build time:

- small integers: Int8, Int16, UInt8, UInt16 (→ INTEGER)
- time-of-day: Time32 (Second/Millisecond), Time64 (Microsecond/Nanosecond)
  (→ INTEGER, unit restored from the schema on read-back)
- binary: Binary, LargeBinary (→ BLOB)

These add match arms only; each column hits a single arm, so there is no
runtime cost for types a given sidecar doesn't use. Float16 (needs the
`half` crate) and Decimal128/256 (need lossless text encoding) remain out
of scope and are still rejected early by validate_payload_schema.

Test: round-trip all eight new types with null cells, asserting both the
reconstructed Arrow type and the values.
Comment thread src/sqlite_provider.rs Outdated
value_col_indices.len()
)));
}
validate_payload_schema(&schema, key_col_index)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking (unresolved from cycle 1): this still passes key_col_index, which is the wrong index. key_col_index indexes into the input batch (see the begin docstring, lines 316-319), but validate_payload_schema skips that index within the output schema — whose key column is always field 0 (docstring line 314-315, and finish()'s self.schema.field(0) at line 408).

When a caller passes key_col_index != 0 (valid config — e.g. a storage engine whose key is not the first batch column), this:

  • skips output schema field at position key_col_index, which is a payload value column — leaving it unvalidated, so an unsupported type there slips through to the exact query-time failure this PR is meant to eliminate; and
  • validates output schema field 0 (the actual key) as if it were a payload.

open_or_build correctly passes 0. begin should too:

Suggested change
validate_payload_schema(&schema, key_col_index)?;
validate_payload_schema(&schema, 0)?;

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • src/sqlite_provider.rs:340begin() passes key_col_index to validate_payload_schema, but that function skips an index into the output schema (key always at field 0), while key_col_index indexes the input batch. This is the same blocking issue raised in cycle 1; it has not been fixed in the code. For any key_col_index != 0, a payload column goes unvalidated (defeating the PR purpose) and the key column is validated as a payload instead.

Action Required

Change line 340 to validate_payload_schema(&schema, 0)?; to match open_or_build, since the output schema key column is always field 0.

@anoop-narang anoop-narang changed the title fix(sqlite): reconstruct LargeUtf8 and Utf8View on read-back fix(sqlite): reconstruct all basic payload types + validate at build time Jun 10, 2026
`validate_payload_schema` was passed `begin`'s `key_col_index`, but that
indexes the input batch while the function skips that position in the
*output* schema — whose key is always field 0. For any `key_col_index != 0`
this skipped a real payload column (leaving an unsupported type to fail at
query time, the exact defect this guards against) and validated the key
column instead.

Drop the parameter and always skip output field 0, matching how `finish()`
and `open_or_build` derive the key from `schema.field(0)`. Add a regression
test driving `begin` with a non-zero `key_col_index` and an unsupported
payload column.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cycle 3: prior blocking issue resolved — validate_payload_schema now drops the key_col_index parameter and always skips output-schema field 0, with both begin and open_or_build updated. Write/affinity/read paths are consistent with payload_type_supported. Approving.

@anoop-narang anoop-narang merged commit a0d4293 into main Jun 10, 2026
6 checks passed
@anoop-narang anoop-narang deleted the fix/sqlite-largeutf8-readback branch June 10, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants