feat: split VectorIndexResolver into peek/resolve/ensure_loaded by anoop-narang · Pull Request #18 · hotdata-dev/datafusion-vector-search-ext

anoop-narang · 2026-04-04T05:04:52Z

Summary

Split the VectorIndexResolver trait into three methods to match DataFusion's sync/async pipeline:

peek(key) — sync, cheap: returns VectorIndexMeta (metric, scalar_kind, schema, key_col, config). Used by the optimizer rule to validate ANN rewrites without needing loaded index objects.
resolve(key) — sync: returns fully loaded Arc<RegisteredTable> from cache. Used by the executor.
ensure_loaded(key) — async: loads the index from catalog/storage on cache miss. Called by the physical planner before building execution plans.

USearchRegistry implements all three (always cached, ensure_loaded is a no-op). Production systems can implement catalog-backed loading in ensure_loaded.

Changes

registry.rs: Add VectorIndexMeta struct, split trait, implement for USearchRegistry
rule.rs: Use peek() instead of resolve() for optimizer validation
planner.rs: Call ensure_loaded() (async) before resolve() in physical planner
lib.rs: Export new types

Backward Compatibility

USearchRegistry implements the full trait — existing users just need to update the trait import
resolve() still works as before for the executor path

Replace direct USearchRegistry dependency with a VectorIndexResolver trait throughout the optimizer rule, physical planner, executor, and UDTF. USearchRegistry implements the trait (backward compatible). This enables production systems to implement a catalog-backed resolver that treats the in-memory index as a cache rather than the source of truth, following the same publication model as sorted/BM25 indexes. Key changes: - Add VectorIndexResolver trait with resolve(name) -> Option<Arc<RegisteredTable>> - USearchRegistry implements VectorIndexResolver - All consumers now accept Arc<dyn VectorIndexResolver> - Add into_resolver() convenience method on USearchRegistry

Split the resolver trait to match DataFusion's sync/async pipeline: - peek(key) — sync, cheap: returns VectorIndexMeta (metric, schema) for the optimizer rule to validate ANN rewrites without loading - resolve(key) — sync: returns loaded RegisteredTable from cache - ensure_loaded(key) — async: loads index on cache miss, called by the physical planner before building execution plans The optimizer rule uses peek (sync, no I/O). The planner calls ensure_loaded (async) then resolve. The executor uses resolve (sync, cache only). This enables catalog-backed resolution where ensure_loaded downloads and loads index files on demand, while the optimizer stays sync and fast.

claude · 2026-04-04T05:07:25Z

        };

-        let registered = self.registry.get(&table_name).ok_or_else(|| {
+        let registered = self.registry.resolve(&table_name).ok_or_else(|| {


P1 — UDTF bypasses ensure_loaded, breaking catalog-backed implementations.

TableFunctionImpl::call() is sync, so it cannot call async fn ensure_loaded(). For USearchRegistry this is fine (always cached), but for any production VectorIndexResolver where ensure_loaded actually does I/O, vector_usearch() will always return "not registered" for tables that haven't been pre-loaded — with no indication of why.

The PR introduces VectorIndexResolver as an extension point for production catalog-backed systems, but the UDTF silently doesn't support that path.

Two options:

Option A — Document the limitation so implementors know to pre-load before calling vector_usearch():

Suggested change

let registered = self.registry.resolve(&table_name).ok_or_else(|| {

// NOTE: The UDTF path is sync and cannot call ensure_loaded().

// Production VectorIndexResolver implementations must pre-load

// any table before it can be queried via vector_usearch().

let registered = self.registry.resolve(&table_name).ok_or_else(|| {

Option B — Add a sync peek_resolve() / try_resolve_blocking() method to the trait that implementations can override to do a synchronous cache lookup + blocking load, falling back to resolve() for the base case. This is more invasive but makes the contract explicit for implementors.

claude

Review

Issues

[P1] src/udtf.rs:88 — USearchUDTF::call() calls resolve() without calling ensure_loaded() first. Since call() is sync, it can't call async fn ensure_loaded(). For USearchRegistry this is harmless (always cached), but for any production VectorIndexResolver where ensure_loaded does real I/O, every call to vector_usearch() will fail with "not registered" for tables that haven't been pre-loaded. No error or warning signals why.

This is the exact scenario the PR is designed to enable, and the UDTF silently breaks it.

Action Required

Either document the limitation on USearchUDTF (tables must be pre-loaded before use with vector_usearch() for catalog-backed implementations), or add a mechanism for sync callers to trigger loading. Option A from the inline comment is the minimal fix.

anoop-narang · 2026-04-04T05:20:21Z

Addressed in 649cc86. This keeps vector_usearch() as-is architecturally, but makes the current limitation explicit: USearchUDTF::call() is synchronous, does not call ensure_loaded(), and is therefore cache-only for async-backed VectorIndexResolver implementations. I also updated the cache-miss error to say the table is not loaded locally and to direct users to the optimizer/planner vector query path or pre-loading first.

claude

Review

Issues

planner.rs:365 — usearch_execute calls resolve() without calling ensure_loaded() first. The comment claims eviction is "handled correctly", but for any VectorIndexResolver implementation that supports cache eviction (which is the stated purpose of this trait split), the entry won't be reloaded — it just fails with a confusing error. The index will be evicted between plan and execute time, resolve() returns None, and the query dies with "not in registry at execute time".

This is a direct contract violation: the trait docstring for ensure_loaded says it loads on cache miss, but that guarantee is only upheld in plan_extension, not in usearch_execute.

Action Required

In usearch_execute (planner.rs ~line 364), add ensure_loaded before resolve:

// Re-fetch at execute time; reload if evicted between plan and execute.
params.registry.ensure_loaded(&params.table_name).await?;
let registered = params.registry.resolve(&params.table_name).ok_or_else(|| {
    DataFusionError::Execution(format!(
        "USearchExec: table '{}' not in registry at execute time",
        params.table_name
    ))
})?;

Also update the comment — the current "so cache eviction between plan and execute is handled correctly" is factually wrong without the ensure_loaded call. For USearchRegistry (always cached) this is harmless, but the PR's stated goal is enabling production catalog-backed resolvers where eviction is real and this path will silently break.

claude · 2026-04-04T05:21:31Z

 ) -> Result<Vec<RecordBatch>> {
    // Re-fetch at execute time so cache eviction between plan and execute is handled correctly.
-    let registered = params.registry.get(&params.table_name).ok_or_else(|| {
+    let registered = params.registry.resolve(&params.table_name).ok_or_else(|| {


resolve() is called without a preceding ensure_loaded(). For async-backed resolvers that support cache eviction, the index will be gone here and the query fails — eviction is not actually handled, the error message is just friendlier.

Fix: call ensure_loaded first:

params.registry.ensure_loaded(&params.table_name).await?; let registered = params.registry.resolve(&params.table_name).ok_or_else(|| { ... })?;

Also update the comment on line 364 — "handled correctly" implies the index is reloaded on eviction, which it is not without this call.

claude · 2026-04-04T05:27:51Z

+/// Contains enough info for the optimizer to validate the ANN rewrite
+/// and build the logical plan node, without needing the loaded index objects.
+#[derive(Debug, Clone)]
+pub struct VectorIndexMeta {


scalar_kind, key_col, and config are not read by the only caller of peek() — the optimizer rule uses only metric and schema. Carrying three extra fields in VectorIndexMeta increases implementor burden: a catalog-backed resolver must provide these in the sync peek() path even though they're only needed after ensure_loaded. Consider trimming VectorIndexMeta to { metric, schema } and letting callers get the rest via resolve() once the index is loaded. If you want to keep them for forward-compat, a doc note explaining which fields are actually required by the optimizer would help implementors.

claude · 2026-04-04T05:27:54Z

+    /// On cache miss or stale entry, downloads files and loads the index.
+    /// On cache hit with current generation, returns immediately.
+    /// Called from the async physical planner before building the execution plan.
+    async fn ensure_loaded(&self, name: &str) -> datafusion::common::Result<()>;


The trait contract between ensure_loaded and resolve has a latent TOCTOU window: a concurrent eviction between the two calls produces an error rather than a retry. For USearchRegistry this is fine (never evicts), but production async-backed resolvers need to be aware. Consider documenting that implementations should guarantee the entry stays resident for at least one resolve() call after ensure_loaded returns successfully (e.g. via a short-lived pin/refcount), or explicitly note that callers must handle the None case with a retry.

anoop-narang added 2 commits April 4, 2026 00:28

claude Bot reviewed Apr 4, 2026

View reviewed changes

claude Bot requested changes Apr 4, 2026

View reviewed changes

Clarify vector_usearch cache-only behavior

649cc86

claude Bot requested changes Apr 4, 2026

View reviewed changes

claude Bot reviewed Apr 4, 2026

View reviewed changes

Reload vector indexes at execute time

5d23a84

claude Bot reviewed Apr 4, 2026

View reviewed changes

claude Bot approved these changes Apr 4, 2026

View reviewed changes

anoop-narang merged commit f9b4610 into main Apr 4, 2026
5 checks passed

anoop-narang deleted the feat/resolver-trait branch April 4, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split VectorIndexResolver into peek/resolve/ensure_loaded#18

feat: split VectorIndexResolver into peek/resolve/ensure_loaded#18
anoop-narang merged 4 commits into
mainfrom
feat/resolver-trait

anoop-narang commented Apr 4, 2026

Uh oh!

claude Bot Apr 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

anoop-narang commented Apr 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Apr 4, 2026

Uh oh!

claude Bot Apr 4, 2026

Uh oh!

claude Bot Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        let registered = self.registry.resolve(&table_name).ok_or_else(|| {
+        // NOTE: The UDTF path is sync and cannot call ensure_loaded().
+        // Production VectorIndexResolver implementations must pre-load
+        // any table before it can be queried via vector_usearch().
+        let registered = self.registry.resolve(&table_name).ok_or_else(|| {

Conversation

anoop-narang commented Apr 4, 2026

Summary

Changes

Backward Compatibility

Uh oh!

claude Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Issues

Action Required

Uh oh!

anoop-narang commented Apr 4, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Issues

Action Required

Uh oh!

claude Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant