feat: distributed vector search via index segment selection#24
feat: distributed vector search via index segment selection#24
Conversation
Switches lance / lance-core / lance-index / lance-io / lance-linalg from crates.io 3.0.1 to a git+rev pin at lance commit d630106d (release tag v5.0.0-beta.5). Adds lance-datagen / lance-file / lance-table dev deps from the same rev. Beta-5 introduces the segment-model APIs (Scanner::with_index_segments, commit_existing_index_segments) that subsequent commits expose through the C ABI for distributed vector search. The DatasetIndexExt trait moved from lance_index to lance::index; src/index.rs adjusts the import. arrow stays on 57.0.0 (matches beta-5). Adds uuid 1.x for the upcoming UUID-based segment API.
Adds lance_dataset_index_segment_count(name) and lance_dataset_index_segments(name, out_uuids) to enumerate the physical segments of a logical vector index. Each segment is identified by its 16-byte UUID (RFC 4122 layout, written as raw bytes into a caller-allocated buffer of len*16 bytes). The header also declares lance_scanner_set_index_segments (impl in the next commit). Both pieces together let a distributed query engine like Velox shard a single k-NN query across workers.
Five tests covering the new segment APIs: - index_segment_count + listing UUIDs from a freshly built IVF index - segment_count returns NotFound for an unknown index name - end-to-end nearest scoped to listed segment UUIDs returns k results - unknown UUID surfaces an error at scan materialize time (not at setter time) - NULL safety for the scanner setter (NULL scanner; NULL ptr with non-zero len; NULL ptr with len=0 clears successfully)
Dataset::index_segment_count(name) and Dataset::index_segments(name) return std::vector<std::array<uint8_t, 16>>. Scanner::index_segments takes either the typed vector or a raw byte pointer + length. C++ smoke test verifies the wrappers compile and link (no real distributed dataset to exercise against in the test fixture).
| const LanceDataset* dataset, | ||
| const char* index_name, | ||
| uint8_t* out_uuids | ||
| ); |
There was a problem hiding this comment.
The buffer is sized by the caller using a separate lance_dataset_index_segment_count() call. The implementation reloads the snapshot independently in each call (snap.load_indices() is invoked twice), and the inner loop writes count * 16 bytes without any capacity check.
The C++ wrapper makes this two-call pattern explicit:
uint64_t count = index_segment_count(index_name); // call #1: snapshot load
std::vector<std::array<uint8_t, 16>> out(count);
... lance_dataset_index_segments(...) // call #2: snapshot loadBetween call #1 and call #2, a concurrent writer could commit a new segment for the same logical index — exactly the distributed-build use case mentioned in the follow-ups section of the PR description. The second snapshot would then return more segments than the first, and the inner loop at src/index.rs:255–260 would overrun the caller's buffer:
for (i, seg) in segments.iter().enumerate() {
let bytes = seg.uuid.as_bytes();
unsafe {
std::ptr::copy_nonoverlapping(bytes.as_ptr(), out_uuids.add(i * 16), 16);
}
}There is no SAFETY: comment justifying why out_uuids is large enough.
Possible Fixes
Adopt the well-established "capacity in, count out" FFI pattern (commonly seen in raw C APIs that fill caller-provided buffers):
int32_t lance_dataset_index_segments(
const LanceDataset* dataset,
const char* index_name,
uint8_t* out_uuids,
size_t capacity, /* bytes available in out_uuids */
uint64_t* out_count /* how many UUIDs were actually written */
);Reuse LANCE_ERR_INVALID_ARGUMENT (or introduce a new sentinel — the codebase currently has 8: LANCE_ERR_INVALID_ARGUMENT, LANCE_ERR_IO, LANCE_ERR_NOT_FOUND, LANCE_ERR_DATASET_ALREADY_EXISTS, LANCE_ERR_INDEX, LANCE_ERR_INTERNAL, LANCE_ERR_NOT_SUPPORTED, LANCE_ERR_COMMIT_CONFLICT) when capacity < segments.len() * 16. This also lets callers do single-shot retrieval with a guess and re-allocate if needed, removing the two-snapshot anti-pattern entirely.
Lighter-weight alternative: have a single Rust call return the count and a heap-allocated buffer with the segments. The codebase already exposes lance_free_string for CString-style strings; an analogous lance_free_uuid_buffer (or generic lance_free_bytes) would be a small, well-scoped addition. This eliminates caller-side sizing altogether at the cost of an extra allocation.
| scanner.prefilter(true); | ||
| } | ||
| if let Some(segments) = &self.index_segments { | ||
| scanner.with_index_segments(segments.clone())?; |
There was a problem hiding this comment.
The with_index_segments(...) call is placed inside if let Some(n) = &self.nearest { ... }:
if let Some(n) = &self.nearest {
scanner.nearest(&n.column, n.query.as_ref(), n.k as usize)?;
...
if let Some(segments) = &self.index_segments {
scanner.with_index_segments(segments.clone())?;
}
}If a caller invokes lance_scanner_set_index_segments(...) but never calls lance_scanner_nearest(...), the segment restriction is silently ignored — no error, no warning. For a distributed-query worker scanning the wrong segments, this is a correctness footgun.
Recommended fix. Either:
- Validate at materialize time — return an error if
index_segments.is_some() && nearest.is_none()with a message such as"index_segments requires nearest() to be configured". - Validate at setter time — in
lance_scanner_set_index_segments, reject ifs.nearest.is_none()and document the ordering requirement (consistent with the project's existing fail-fast guards such as theif k == 0 { ... }check insidescanner_nearest_inner).
Option 1 is more flexible (allows a builder to set segments before nearest); option 2 fails earlier and is closer to the rest of the file's style.
| let snap = ds.snapshot(); | ||
| match block_on(snap.load_indices()) { | ||
| Ok(indices) => { | ||
| let count = indices.iter().filter(|i| i.name == name).count(); |
There was a problem hiding this comment.
lance_dataset_index_count excludes system indexes (!is_system_index). The new lance_dataset_index_segment_count does not. If a system index ever shares a name with a user-visible index, the count silently includes it and lance_dataset_index_segments emits its UUIDs, which a worker may then attempt to query.
| })?; | ||
| let snap = ds.snapshot(); | ||
| let indices = block_on(snap.load_indices())?; | ||
| let segments: Vec<_> = indices.iter().filter(|i| i.name == name).collect(); |
|
@LuciferYang thanks for the review, would address comments when free. this is still an in progress PR which need stable 5.0.0 release so we can adopt the segment model for distributed index search. |
Exposes Lance's segment-model APIs through the C ABI so a distributed query engine (Velox, Presto worker, etc.) can fan a single k-NN query out across workers, each scanning a slice of the logical index's physical segments. Tracks lance#6309.
Distributed query pattern
Summary
lance_dataset_index_segment_count(ds, name)— number of physical segments in a logical vector index. Returns 0 +LANCE_ERR_NOT_FOUNDfor an unknown name.lance_dataset_index_segments(ds, name, out_uuids)— fills a caller-allocated buffer (count * 16bytes) with each segment's 16-byte UUID (RFC 4122).lance_scanner_set_index_segments(scanner, segment_uuids, len)— restricts the nextlance_scanner_nearest()query to a subset of segments.len=0(any pointer) clears the restriction.C++ wrappers:
Dataset::index_segment_count(name)→uint64_tDataset::index_segments(name)→std::vector<std::array<uint8_t, 16>>Scanner::index_segments(uuids)(typed vector overload + rawuint8_t*+ len overload) — fluentLance dep bump
To get
Scanner::with_index_segments()(merged in lance #6376) we bump from crates.iolance = \"3.0.1\"to agit+revpin at lance commitd630106d(release tagv5.0.0-beta.5). beta-5 keeps arrow on 57.0.0 — no transitive arrow churn. TheDatasetIndexExttrait moved fromlance_indextolance::index; one import path adjusted insrc/index.rs.When lance publishes 5.0.0 stable, the git+rev can be replaced with the version pin.
Test plan
cargo fmtcleancargo clippy --all-targets -- -D warningscleancargo test— 75 passed (70 from main + 5 new)cargo test --test compile_and_run_test -- --ignored— 2 passed (C + C++ smoke)New tests:
test_index_segment_count_and_list— build IVF index, count = 1, list returns a non-zero UUID.test_index_segment_count_unknown_index— unknown name →NotFound.test_scanner_set_index_segments_with_listed_uuids— end-to-end k=5 nearest restricted to listed segment UUID, returns 5 results.test_scanner_set_index_segments_unknown_uuid— bogus UUID is accepted at setter time, surfaces as an error at scan materialize time with a message containing "segment".test_scanner_set_index_segments_null_safety— NULL scanner / NULL pointer with len>0 / NULL with len=0 (clears).Follow-ups (not in this PR)
commit_existing_index_segments()andmerge_existing_index_segments()exist upstream — they'd let workers each train one segment and the coordinator commit them atomically.🤖 Generated with Claude Code