Skip to content

refactor(registry): split provider into scan_provider + lookup_provider#6

Merged
anoop-narang merged 4 commits into
mainfrom
feat/split-scan-lookup-providers
Mar 17, 2026
Merged

refactor(registry): split provider into scan_provider + lookup_provider#6
anoop-narang merged 4 commits into
mainfrom
feat/split-scan-lookup-providers

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Summary

  • Split RegisteredTable.provider into scan_provider (TableProvider) and lookup_provider (PointLookupProvider)
  • Remove TableProvider supertrait from PointLookupProvider, add schema() directly to the trait
  • Update planner to use scan_provider for pre-scan and lookup_provider for key-based fetch

Context

Step 2 of the Parquet-native adaptive filter plan. Prepares the ext crate interface so runtimedb can pass a Parquet TableProvider for WHERE evaluation and a SQLite PointLookupProvider for key-based fetch independently.

Test plan

  • cargo fmt --check
  • cargo clippy -- -D warnings
  • cargo test --features sqlite-provider — all 37 tests pass

Separate RegisteredTable.provider into scan_provider (TableProvider for
WHERE evaluation) and lookup_provider (PointLookupProvider for key-based
fetch), preparing for Parquet-native adaptive filtering.
Comment thread src/planner.rs
task_ctx: Arc<TaskContext>,
) -> Result<Vec<RecordBatch>> {
let provider_schema = registered.provider.schema();
let provider_schema = registered.scan_provider.schema();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — key_col_idx from lookup_provider is applied to scan_provider batches

provider_schema is correctly taken from scan_provider here, but the very next line:

let key_col_idx = provider_key_col_idx(registered)?;

calls lookup_provider.schema().index_of(key_col). That index is then used at the scan-phase loop to read batch.column(key_col_idx) where batch comes from scan_plan (i.e. scan_provider). When the two providers have different schemas — the exact split this PR is designed to enable — the column positions can differ, silently reading the wrong column or panicking.

The tests don't catch this because both test registrations pass provider.clone() for both arguments, making the schemas identical.

Fix: derive two separate indices:

// for scan batches
let scan_key_col_idx = provider_schema
    .index_of(&registered.key_col)
    .map_err(|_| DataFusionError::Execution(format!(
        "key column '{}' not found in scan provider schema", registered.key_col
    )))?;

// for lookup batches passed to attach_distances
let lookup_key_col_idx = provider_key_col_idx(registered)?;

Use scan_key_col_idx in the scan loop and lookup_key_col_idx at the two attach_distances(…) call sites.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Issues

P1 — wrong key column index applied to scan batches src/planner.rs

In adaptive_filtered_execute, key_col_idx is derived from lookup_provider.schema() (via provider_key_col_idx) but is immediately used to read batch.column(key_col_idx) where batch comes from iterating the scan_plan built from scan_provider. When the two providers have different schemas — the exact scenario this PR enables (Parquet scan + SQLite lookup) — the key column can be at a different index in each schema, causing the scan phase to silently read the wrong column or panic.

The tests pass because both make_exec_ctx and make_ctx register provider.clone() for both arguments, making the schemas identical and masking the divergence.

The same key_col_idx is also passed to attach_distances further down, where it operates on batches from lookup_provider.fetch_by_keys — that usage is correct.

Action Required

Split into two indices before the scan loop:

// Index in scan_provider schema — used when reading scan batches.
let scan_key_col_idx = provider_schema
    .index_of(&registered.key_col)
    .map_err(|_| DataFusionError::Execution(format!(
        "USearchExecPlanner: key column '{}' not found in scan provider schema",
        registered.key_col
    )))?;

// Index in lookup_provider schema — used by attach_distances on fetched rows.
let lookup_key_col_idx = provider_key_col_idx(registered)?;

Use scan_key_col_idx in the scan-phase loop (batch.column(scan_key_col_idx)) and lookup_key_col_idx at both attach_distances call sites.

…emas

The scan_provider and lookup_provider may have different schemas, so the
key column can be at different indices. Use scan_key_col_idx when reading
scan batches and lookup_key_col_idx for attach_distances on fetched rows.
claude[bot]
claude Bot previously approved these changes Mar 17, 2026
Comment thread tests/execution.rs
reg_key,
make_populated_index(),
provider.clone(),
provider.clone(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both scan_provider and lookup_provider are the same HashKeyProvider here (and in optimizer_rule.rs), so the tests never exercise the case where the two schemas differ — the exact scenario this PR was designed to enable. Consider adding a test that uses a minimal scan_provider schema (id + vector columns) alongside a lookup_provider schema that has additional non-vector columns, to verify scan_key_col_idx and lookup_key_col_idx are resolved independently and correctly under divergent schemas.

Comment thread src/registry.rs
index.change_expansion_search(config.expansion_search);

let data_schema = provider.schema();
let data_schema = lookup_provider.schema();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — Missing validation of key_col in scan_provider.schema()

add_with_config checks that key_col exists in lookup_provider.schema() (lines 279–283), but never checks scan_provider.schema(). In adaptive_filtered_execute, the planner calls scan_provider.schema().index_of(&registered.key_col) at query execution time — so a scan_provider that lacks the key column silently passes registration and only fails on the first filtered query.

The whole motivation of this PR is to allow different providers for scan vs lookup; the tests mask the gap because they pass the same HashKeyProvider for both.

Fix: add an equivalent guard for scan_provider immediately after the existing lookup_provider check:

scan_provider
    .schema()
    .index_of(key_col)
    .map_err(|_| {
        DataFusionError::Execution(format!(
            "USearchRegistry: key column '{key_col}' not found in scan provider schema for table '{name}'"
        ))
    })?;

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Issues

P1 — scan_provider key column not validated at registration time
src/registry.rs, add_with_config (~line 277)

add_with_config validates that key_col exists in lookup_provider.schema() at registration time, but applies no equivalent check to scan_provider.schema(). In adaptive_filtered_execute (planner.rs:357), the key column is looked up in the scan provider schema at query execution time. A misconfigured registration — passing a scan_provider that lacks the key column — passes silently and only produces a runtime error on the first filtered query.

This gap is invisible in the current tests because both test callsites pass the same HashKeyProvider for scan_provider and lookup_provider. The explicit purpose of this PR is to enable different objects for the two roles, so this is exactly the case where the bug will surface in practice.

Action Required

Add a scan_provider.schema().index_of(key_col) guard in add_with_config immediately after the existing lookup_provider check (see inline comment). Also add a test that registers with a scan_provider missing the key column and asserts the registration returns an error.

…ation

Add scan_provider.schema().index_of(key_col) guard in add_with_config to
catch misconfigured registrations early instead of at query execution time.
@anoop-narang anoop-narang merged commit 4605f0e into main Mar 17, 2026
5 checks passed
@anoop-narang anoop-narang deleted the feat/split-scan-lookup-providers branch March 17, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant