feat: split VectorIndexResolver into peek/resolve/ensure_loaded#18
Conversation
Replace direct USearchRegistry dependency with a VectorIndexResolver trait throughout the optimizer rule, physical planner, executor, and UDTF. USearchRegistry implements the trait (backward compatible). This enables production systems to implement a catalog-backed resolver that treats the in-memory index as a cache rather than the source of truth, following the same publication model as sorted/BM25 indexes. Key changes: - Add VectorIndexResolver trait with resolve(name) -> Option<Arc<RegisteredTable>> - USearchRegistry implements VectorIndexResolver - All consumers now accept Arc<dyn VectorIndexResolver> - Add into_resolver() convenience method on USearchRegistry
Split the resolver trait to match DataFusion's sync/async pipeline: - peek(key) — sync, cheap: returns VectorIndexMeta (metric, schema) for the optimizer rule to validate ANN rewrites without loading - resolve(key) — sync: returns loaded RegisteredTable from cache - ensure_loaded(key) — async: loads index on cache miss, called by the physical planner before building execution plans The optimizer rule uses peek (sync, no I/O). The planner calls ensure_loaded (async) then resolve. The executor uses resolve (sync, cache only). This enables catalog-backed resolution where ensure_loaded downloads and loads index files on demand, while the optimizer stays sync and fast.
| }; | ||
|
|
||
| let registered = self.registry.get(&table_name).ok_or_else(|| { | ||
| let registered = self.registry.resolve(&table_name).ok_or_else(|| { |
There was a problem hiding this comment.
P1 — UDTF bypasses ensure_loaded, breaking catalog-backed implementations.
TableFunctionImpl::call() is sync, so it cannot call async fn ensure_loaded(). For USearchRegistry this is fine (always cached), but for any production VectorIndexResolver where ensure_loaded actually does I/O, vector_usearch() will always return "not registered" for tables that haven't been pre-loaded — with no indication of why.
The PR introduces VectorIndexResolver as an extension point for production catalog-backed systems, but the UDTF silently doesn't support that path.
Two options:
Option A — Document the limitation so implementors know to pre-load before calling vector_usearch():
| let registered = self.registry.resolve(&table_name).ok_or_else(|| { | |
| // NOTE: The UDTF path is sync and cannot call ensure_loaded(). | |
| // Production VectorIndexResolver implementations must pre-load | |
| // any table before it can be queried via vector_usearch(). | |
| let registered = self.registry.resolve(&table_name).ok_or_else(|| { |
Option B — Add a sync peek_resolve() / try_resolve_blocking() method to the trait that implementations can override to do a synchronous cache lookup + blocking load, falling back to resolve() for the base case. This is more invasive but makes the contract explicit for implementors.
There was a problem hiding this comment.
Review
Issues
[P1] src/udtf.rs:88 — USearchUDTF::call() calls resolve() without calling ensure_loaded() first. Since call() is sync, it can't call async fn ensure_loaded(). For USearchRegistry this is harmless (always cached), but for any production VectorIndexResolver where ensure_loaded does real I/O, every call to vector_usearch() will fail with "not registered" for tables that haven't been pre-loaded. No error or warning signals why.
This is the exact scenario the PR is designed to enable, and the UDTF silently breaks it.
Action Required
Either document the limitation on USearchUDTF (tables must be pre-loaded before use with vector_usearch() for catalog-backed implementations), or add a mechanism for sync callers to trigger loading. Option A from the inline comment is the minimal fix.
|
Addressed in 649cc86. This keeps vector_usearch() as-is architecturally, but makes the current limitation explicit: USearchUDTF::call() is synchronous, does not call ensure_loaded(), and is therefore cache-only for async-backed VectorIndexResolver implementations. I also updated the cache-miss error to say the table is not loaded locally and to direct users to the optimizer/planner vector query path or pre-loading first. |
There was a problem hiding this comment.
Review
Issues
planner.rs:365 — usearch_execute calls resolve() without calling ensure_loaded() first. The comment claims eviction is "handled correctly", but for any VectorIndexResolver implementation that supports cache eviction (which is the stated purpose of this trait split), the entry won't be reloaded — it just fails with a confusing error. The index will be evicted between plan and execute time, resolve() returns None, and the query dies with "not in registry at execute time".
This is a direct contract violation: the trait docstring for ensure_loaded says it loads on cache miss, but that guarantee is only upheld in plan_extension, not in usearch_execute.
Action Required
In usearch_execute (planner.rs ~line 364), add ensure_loaded before resolve:
// Re-fetch at execute time; reload if evicted between plan and execute.
params.registry.ensure_loaded(¶ms.table_name).await?;
let registered = params.registry.resolve(¶ms.table_name).ok_or_else(|| {
DataFusionError::Execution(format!(
"USearchExec: table '{}' not in registry at execute time",
params.table_name
))
})?;Also update the comment — the current "so cache eviction between plan and execute is handled correctly" is factually wrong without the ensure_loaded call. For USearchRegistry (always cached) this is harmless, but the PR's stated goal is enabling production catalog-backed resolvers where eviction is real and this path will silently break.
| ) -> Result<Vec<RecordBatch>> { | ||
| // Re-fetch at execute time so cache eviction between plan and execute is handled correctly. | ||
| let registered = params.registry.get(¶ms.table_name).ok_or_else(|| { | ||
| let registered = params.registry.resolve(¶ms.table_name).ok_or_else(|| { |
There was a problem hiding this comment.
resolve() is called without a preceding ensure_loaded(). For async-backed resolvers that support cache eviction, the index will be gone here and the query fails — eviction is not actually handled, the error message is just friendlier.
Fix: call ensure_loaded first:
params.registry.ensure_loaded(¶ms.table_name).await?;
let registered = params.registry.resolve(¶ms.table_name).ok_or_else(|| { ... })?;Also update the comment on line 364 — "handled correctly" implies the index is reloaded on eviction, which it is not without this call.
| /// Contains enough info for the optimizer to validate the ANN rewrite | ||
| /// and build the logical plan node, without needing the loaded index objects. | ||
| #[derive(Debug, Clone)] | ||
| pub struct VectorIndexMeta { |
There was a problem hiding this comment.
scalar_kind, key_col, and config are not read by the only caller of peek() — the optimizer rule uses only metric and schema. Carrying three extra fields in VectorIndexMeta increases implementor burden: a catalog-backed resolver must provide these in the sync peek() path even though they're only needed after ensure_loaded. Consider trimming VectorIndexMeta to { metric, schema } and letting callers get the rest via resolve() once the index is loaded. If you want to keep them for forward-compat, a doc note explaining which fields are actually required by the optimizer would help implementors.
| /// On cache miss or stale entry, downloads files and loads the index. | ||
| /// On cache hit with current generation, returns immediately. | ||
| /// Called from the async physical planner before building the execution plan. | ||
| async fn ensure_loaded(&self, name: &str) -> datafusion::common::Result<()>; |
There was a problem hiding this comment.
The trait contract between ensure_loaded and resolve has a latent TOCTOU window: a concurrent eviction between the two calls produces an error rather than a retry. For USearchRegistry this is fine (never evicts), but production async-backed resolvers need to be aware. Consider documenting that implementations should guarantee the entry stays resident for at least one resolve() call after ensure_loaded returns successfully (e.g. via a short-lived pin/refcount), or explicitly note that callers must handle the None case with a retry.
Summary
Split the
VectorIndexResolvertrait into three methods to match DataFusion's sync/async pipeline:peek(key)— sync, cheap: returnsVectorIndexMeta(metric, scalar_kind, schema, key_col, config). Used by the optimizer rule to validate ANN rewrites without needing loaded index objects.resolve(key)— sync: returns fully loadedArc<RegisteredTable>from cache. Used by the executor.ensure_loaded(key)— async: loads the index from catalog/storage on cache miss. Called by the physical planner before building execution plans.USearchRegistryimplements all three (always cached,ensure_loadedis a no-op). Production systems can implement catalog-backed loading inensure_loaded.Changes
registry.rs: AddVectorIndexMetastruct, split trait, implement forUSearchRegistryrule.rs: Usepeek()instead ofresolve()for optimizer validationplanner.rs: Callensure_loaded()(async) beforeresolve()in physical plannerlib.rs: Export new typesBackward Compatibility
USearchRegistryimplements the full trait — existing users just need to update the trait importresolve()still works as before for the executor path