refactor(registry): split provider into scan_provider + lookup_provider#6
Conversation
Separate RegisteredTable.provider into scan_provider (TableProvider for WHERE evaluation) and lookup_provider (PointLookupProvider for key-based fetch), preparing for Parquet-native adaptive filtering.
| task_ctx: Arc<TaskContext>, | ||
| ) -> Result<Vec<RecordBatch>> { | ||
| let provider_schema = registered.provider.schema(); | ||
| let provider_schema = registered.scan_provider.schema(); |
There was a problem hiding this comment.
P1 — key_col_idx from lookup_provider is applied to scan_provider batches
provider_schema is correctly taken from scan_provider here, but the very next line:
let key_col_idx = provider_key_col_idx(registered)?;calls lookup_provider.schema().index_of(key_col). That index is then used at the scan-phase loop to read batch.column(key_col_idx) where batch comes from scan_plan (i.e. scan_provider). When the two providers have different schemas — the exact split this PR is designed to enable — the column positions can differ, silently reading the wrong column or panicking.
The tests don't catch this because both test registrations pass provider.clone() for both arguments, making the schemas identical.
Fix: derive two separate indices:
// for scan batches
let scan_key_col_idx = provider_schema
.index_of(®istered.key_col)
.map_err(|_| DataFusionError::Execution(format!(
"key column '{}' not found in scan provider schema", registered.key_col
)))?;
// for lookup batches passed to attach_distances
let lookup_key_col_idx = provider_key_col_idx(registered)?;Use scan_key_col_idx in the scan loop and lookup_key_col_idx at the two attach_distances(…) call sites.
There was a problem hiding this comment.
Review
Issues
P1 — wrong key column index applied to scan batches src/planner.rs
In adaptive_filtered_execute, key_col_idx is derived from lookup_provider.schema() (via provider_key_col_idx) but is immediately used to read batch.column(key_col_idx) where batch comes from iterating the scan_plan built from scan_provider. When the two providers have different schemas — the exact scenario this PR enables (Parquet scan + SQLite lookup) — the key column can be at a different index in each schema, causing the scan phase to silently read the wrong column or panic.
The tests pass because both make_exec_ctx and make_ctx register provider.clone() for both arguments, making the schemas identical and masking the divergence.
The same key_col_idx is also passed to attach_distances further down, where it operates on batches from lookup_provider.fetch_by_keys — that usage is correct.
Action Required
Split into two indices before the scan loop:
// Index in scan_provider schema — used when reading scan batches.
let scan_key_col_idx = provider_schema
.index_of(®istered.key_col)
.map_err(|_| DataFusionError::Execution(format!(
"USearchExecPlanner: key column '{}' not found in scan provider schema",
registered.key_col
)))?;
// Index in lookup_provider schema — used by attach_distances on fetched rows.
let lookup_key_col_idx = provider_key_col_idx(registered)?;Use scan_key_col_idx in the scan-phase loop (batch.column(scan_key_col_idx)) and lookup_key_col_idx at both attach_distances call sites.
…emas The scan_provider and lookup_provider may have different schemas, so the key column can be at different indices. Use scan_key_col_idx when reading scan batches and lookup_key_col_idx for attach_distances on fetched rows.
| reg_key, | ||
| make_populated_index(), | ||
| provider.clone(), | ||
| provider.clone(), |
There was a problem hiding this comment.
Both scan_provider and lookup_provider are the same HashKeyProvider here (and in optimizer_rule.rs), so the tests never exercise the case where the two schemas differ — the exact scenario this PR was designed to enable. Consider adding a test that uses a minimal scan_provider schema (id + vector columns) alongside a lookup_provider schema that has additional non-vector columns, to verify scan_key_col_idx and lookup_key_col_idx are resolved independently and correctly under divergent schemas.
| index.change_expansion_search(config.expansion_search); | ||
|
|
||
| let data_schema = provider.schema(); | ||
| let data_schema = lookup_provider.schema(); |
There was a problem hiding this comment.
P1 — Missing validation of key_col in scan_provider.schema()
add_with_config checks that key_col exists in lookup_provider.schema() (lines 279–283), but never checks scan_provider.schema(). In adaptive_filtered_execute, the planner calls scan_provider.schema().index_of(®istered.key_col) at query execution time — so a scan_provider that lacks the key column silently passes registration and only fails on the first filtered query.
The whole motivation of this PR is to allow different providers for scan vs lookup; the tests mask the gap because they pass the same HashKeyProvider for both.
Fix: add an equivalent guard for scan_provider immediately after the existing lookup_provider check:
scan_provider
.schema()
.index_of(key_col)
.map_err(|_| {
DataFusionError::Execution(format!(
"USearchRegistry: key column '{key_col}' not found in scan provider schema for table '{name}'"
))
})?;There was a problem hiding this comment.
Review
Issues
P1 — scan_provider key column not validated at registration time
src/registry.rs, add_with_config (~line 277)
add_with_config validates that key_col exists in lookup_provider.schema() at registration time, but applies no equivalent check to scan_provider.schema(). In adaptive_filtered_execute (planner.rs:357), the key column is looked up in the scan provider schema at query execution time. A misconfigured registration — passing a scan_provider that lacks the key column — passes silently and only produces a runtime error on the first filtered query.
This gap is invisible in the current tests because both test callsites pass the same HashKeyProvider for scan_provider and lookup_provider. The explicit purpose of this PR is to enable different objects for the two roles, so this is exactly the case where the bug will surface in practice.
Action Required
Add a scan_provider.schema().index_of(key_col) guard in add_with_config immediately after the existing lookup_provider check (see inline comment). Also add a test that registers with a scan_provider missing the key column and asserts the registration returns an error.
…ation Add scan_provider.schema().index_of(key_col) guard in add_with_config to catch misconfigured registrations early instead of at query execution time.
Summary
RegisteredTable.providerintoscan_provider(TableProvider) andlookup_provider(PointLookupProvider)TableProvidersupertrait fromPointLookupProvider, addschema()directly to the traitscan_providerfor pre-scan andlookup_providerfor key-based fetchContext
Step 2 of the Parquet-native adaptive filter plan. Prepares the ext crate interface so runtimedb can pass a Parquet
TableProviderfor WHERE evaluation and a SQLitePointLookupProviderfor key-based fetch independently.Test plan
cargo fmt --checkcargo clippy -- -D warningscargo test --features sqlite-provider— all 37 tests pass