refactor(registry): split provider into scan_provider + lookup_provider by anoop-narang · Pull Request #6 · hotdata-dev/datafusion-vector-search-ext

anoop-narang · 2026-03-17T18:01:38Z

Summary

Split RegisteredTable.provider into scan_provider (TableProvider) and lookup_provider (PointLookupProvider)
Remove TableProvider supertrait from PointLookupProvider, add schema() directly to the trait
Update planner to use scan_provider for pre-scan and lookup_provider for key-based fetch

Context

Step 2 of the Parquet-native adaptive filter plan. Prepares the ext crate interface so runtimedb can pass a Parquet TableProvider for WHERE evaluation and a SQLite PointLookupProvider for key-based fetch independently.

Test plan

cargo fmt --check
cargo clippy -- -D warnings
cargo test --features sqlite-provider — all 37 tests pass

Separate RegisteredTable.provider into scan_provider (TableProvider for WHERE evaluation) and lookup_provider (PointLookupProvider for key-based fetch), preparing for Parquet-native adaptive filtering.

claude · 2026-03-17T18:03:55Z

    task_ctx: Arc<TaskContext>,
 ) -> Result<Vec<RecordBatch>> {
-    let provider_schema = registered.provider.schema();
+    let provider_schema = registered.scan_provider.schema();


P1 — key_col_idx from lookup_provider is applied to scan_provider batches

provider_schema is correctly taken from scan_provider here, but the very next line:

let key_col_idx = provider_key_col_idx(registered)?;

calls lookup_provider.schema().index_of(key_col). That index is then used at the scan-phase loop to read batch.column(key_col_idx) where batch comes from scan_plan (i.e. scan_provider). When the two providers have different schemas — the exact split this PR is designed to enable — the column positions can differ, silently reading the wrong column or panicking.

The tests don't catch this because both test registrations pass provider.clone() for both arguments, making the schemas identical.

Fix: derive two separate indices:

// for scan batches let scan_key_col_idx = provider_schema .index_of(&registered.key_col) .map_err(|_| DataFusionError::Execution(format!( "key column '{}' not found in scan provider schema", registered.key_col )))?; // for lookup batches passed to attach_distances let lookup_key_col_idx = provider_key_col_idx(registered)?;

Use scan_key_col_idx in the scan loop and lookup_key_col_idx at the two attach_distances(…) call sites.

claude

Review

Issues

P1 — wrong key column index applied to scan batches src/planner.rs

In adaptive_filtered_execute, key_col_idx is derived from lookup_provider.schema() (via provider_key_col_idx) but is immediately used to read batch.column(key_col_idx) where batch comes from iterating the scan_plan built from scan_provider. When the two providers have different schemas — the exact scenario this PR enables (Parquet scan + SQLite lookup) — the key column can be at a different index in each schema, causing the scan phase to silently read the wrong column or panic.

The tests pass because both make_exec_ctx and make_ctx register provider.clone() for both arguments, making the schemas identical and masking the divergence.

The same key_col_idx is also passed to attach_distances further down, where it operates on batches from lookup_provider.fetch_by_keys — that usage is correct.

Action Required

Split into two indices before the scan loop:

// Index in scan_provider schema — used when reading scan batches.
let scan_key_col_idx = provider_schema
    .index_of(&registered.key_col)
    .map_err(|_| DataFusionError::Execution(format!(
        "USearchExecPlanner: key column '{}' not found in scan provider schema",
        registered.key_col
    )))?;

// Index in lookup_provider schema — used by attach_distances on fetched rows.
let lookup_key_col_idx = provider_key_col_idx(registered)?;

Use scan_key_col_idx in the scan-phase loop (batch.column(scan_key_col_idx)) and lookup_key_col_idx at both attach_distances call sites.

…emas The scan_provider and lookup_provider may have different schemas, so the key column can be at different indices. Use scan_key_col_idx when reading scan batches and lookup_key_col_idx for attach_distances on fetched rows.

claude · 2026-03-17T18:08:21Z

        reg_key,
        make_populated_index(),
        provider.clone(),
+        provider.clone(),


Both scan_provider and lookup_provider are the same HashKeyProvider here (and in optimizer_rule.rs), so the tests never exercise the case where the two schemas differ — the exact scenario this PR was designed to enable. Consider adding a test that uses a minimal scan_provider schema (id + vector columns) alongside a lookup_provider schema that has additional non-vector columns, to verify scan_key_col_idx and lookup_key_col_idx are resolved independently and correctly under divergent schemas.

claude · 2026-03-17T18:12:56Z

        index.change_expansion_search(config.expansion_search);

-        let data_schema = provider.schema();
+        let data_schema = lookup_provider.schema();


P1 — Missing validation of key_col in scan_provider.schema()

add_with_config checks that key_col exists in lookup_provider.schema() (lines 279–283), but never checks scan_provider.schema(). In adaptive_filtered_execute, the planner calls scan_provider.schema().index_of(&registered.key_col) at query execution time — so a scan_provider that lacks the key column silently passes registration and only fails on the first filtered query.

The whole motivation of this PR is to allow different providers for scan vs lookup; the tests mask the gap because they pass the same HashKeyProvider for both.

Fix: add an equivalent guard for scan_provider immediately after the existing lookup_provider check:

scan_provider .schema() .index_of(key_col) .map_err(|_| { DataFusionError::Execution(format!( "USearchRegistry: key column '{key_col}' not found in scan provider schema for table '{name}'" )) })?;

claude

Review

Issues

P1 — scan_provider key column not validated at registration time
src/registry.rs, add_with_config (~line 277)

add_with_config validates that key_col exists in lookup_provider.schema() at registration time, but applies no equivalent check to scan_provider.schema(). In adaptive_filtered_execute (planner.rs:357), the key column is looked up in the scan provider schema at query execution time. A misconfigured registration — passing a scan_provider that lacks the key column — passes silently and only produces a runtime error on the first filtered query.

This gap is invisible in the current tests because both test callsites pass the same HashKeyProvider for scan_provider and lookup_provider. The explicit purpose of this PR is to enable different objects for the two roles, so this is exactly the case where the bug will surface in practice.

Action Required

Add a scan_provider.schema().index_of(key_col) guard in add_with_config immediately after the existing lookup_provider check (see inline comment). Also add a test that registers with a scan_provider missing the key column and asserts the registration returns an error.

…ation Add scan_provider.schema().index_of(key_col) guard in add_with_config to catch misconfigured registrations early instead of at query execution time.

refactor(registry): split provider into scan_provider + lookup_provider

7079e80

Separate RegisteredTable.provider into scan_provider (TableProvider for WHERE evaluation) and lookup_provider (PointLookupProvider for key-based fetch), preparing for Parquet-native adaptive filtering.

claude Bot reviewed Mar 17, 2026

View reviewed changes

claude Bot requested changes Mar 17, 2026

View reviewed changes

claude Bot previously approved these changes Mar 17, 2026

View reviewed changes

claude Bot reviewed Mar 17, 2026

View reviewed changes

style(planner): fix rustfmt formatting

d474463

anoop-narang dismissed claude[bot]’s stale review via d474463 March 17, 2026 18:11

claude Bot reviewed Mar 17, 2026

View reviewed changes

claude Bot requested changes Mar 17, 2026

View reviewed changes

fix(registry): validate key column in scan_provider schema at registr…

8d12143

…ation Add scan_provider.schema().index_of(key_col) guard in add_with_config to catch misconfigured registrations early instead of at query execution time.

claude Bot approved these changes Mar 17, 2026

View reviewed changes

anoop-narang merged commit 4605f0e into main Mar 17, 2026
5 checks passed

anoop-narang deleted the feat/split-scan-lookup-providers branch March 17, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(registry): split provider into scan_provider + lookup_provider#6

refactor(registry): split provider into scan_provider + lookup_provider#6
anoop-narang merged 4 commits into
mainfrom
feat/split-scan-lookup-providers

anoop-narang commented Mar 17, 2026

Uh oh!

claude Bot Mar 17, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Mar 17, 2026

Uh oh!

claude Bot Mar 17, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anoop-narang commented Mar 17, 2026

Summary

Context

Test plan

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Issues

Action Required

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Issues

Action Required

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant