feat: distributed vector search via index segment selection#24
Open
feat: distributed vector search via index segment selection#24
Conversation
Switches lance / lance-core / lance-index / lance-io / lance-linalg from crates.io 3.0.1 to a git+rev pin at lance commit d630106d (release tag v5.0.0-beta.5). Adds lance-datagen / lance-file / lance-table dev deps from the same rev. Beta-5 introduces the segment-model APIs (Scanner::with_index_segments, commit_existing_index_segments) that subsequent commits expose through the C ABI for distributed vector search. The DatasetIndexExt trait moved from lance_index to lance::index; src/index.rs adjusts the import. arrow stays on 57.0.0 (matches beta-5). Adds uuid 1.x for the upcoming UUID-based segment API.
Adds lance_dataset_index_segment_count(name) and lance_dataset_index_segments(name, out_uuids) to enumerate the physical segments of a logical vector index. Each segment is identified by its 16-byte UUID (RFC 4122 layout, written as raw bytes into a caller-allocated buffer of len*16 bytes). The header also declares lance_scanner_set_index_segments (impl in the next commit). Both pieces together let a distributed query engine like Velox shard a single k-NN query across workers.
Five tests covering the new segment APIs: - index_segment_count + listing UUIDs from a freshly built IVF index - segment_count returns NotFound for an unknown index name - end-to-end nearest scoped to listed segment UUIDs returns k results - unknown UUID surfaces an error at scan materialize time (not at setter time) - NULL safety for the scanner setter (NULL scanner; NULL ptr with non-zero len; NULL ptr with len=0 clears successfully)
Dataset::index_segment_count(name) and Dataset::index_segments(name) return std::vector<std::array<uint8_t, 16>>. Scanner::index_segments takes either the typed vector or a raw byte pointer + length. C++ smoke test verifies the wrappers compile and link (no real distributed dataset to exercise against in the test fixture).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Exposes Lance's segment-model APIs through the C ABI so a distributed query engine (Velox, Presto worker, etc.) can fan a single k-NN query out across workers, each scanning a slice of the logical index's physical segments. Tracks lance#6309.
Distributed query pattern
Summary
lance_dataset_index_segment_count(ds, name)— number of physical segments in a logical vector index. Returns 0 +LANCE_ERR_NOT_FOUNDfor an unknown name.lance_dataset_index_segments(ds, name, out_uuids)— fills a caller-allocated buffer (count * 16bytes) with each segment's 16-byte UUID (RFC 4122).lance_scanner_set_index_segments(scanner, segment_uuids, len)— restricts the nextlance_scanner_nearest()query to a subset of segments.len=0(any pointer) clears the restriction.C++ wrappers:
Dataset::index_segment_count(name)→uint64_tDataset::index_segments(name)→std::vector<std::array<uint8_t, 16>>Scanner::index_segments(uuids)(typed vector overload + rawuint8_t*+ len overload) — fluentLance dep bump
To get
Scanner::with_index_segments()(merged in lance #6376) we bump from crates.iolance = \"3.0.1\"to agit+revpin at lance commitd630106d(release tagv5.0.0-beta.5). beta-5 keeps arrow on 57.0.0 — no transitive arrow churn. TheDatasetIndexExttrait moved fromlance_indextolance::index; one import path adjusted insrc/index.rs.When lance publishes 5.0.0 stable, the git+rev can be replaced with the version pin.
Test plan
cargo fmtcleancargo clippy --all-targets -- -D warningscleancargo test— 75 passed (70 from main + 5 new)cargo test --test compile_and_run_test -- --ignored— 2 passed (C + C++ smoke)New tests:
test_index_segment_count_and_list— build IVF index, count = 1, list returns a non-zero UUID.test_index_segment_count_unknown_index— unknown name →NotFound.test_scanner_set_index_segments_with_listed_uuids— end-to-end k=5 nearest restricted to listed segment UUID, returns 5 results.test_scanner_set_index_segments_unknown_uuid— bogus UUID is accepted at setter time, surfaces as an error at scan materialize time with a message containing "segment".test_scanner_set_index_segments_null_safety— NULL scanner / NULL pointer with len>0 / NULL with len=0 (clears).Follow-ups (not in this PR)
commit_existing_index_segments()andmerge_existing_index_segments()exist upstream — they'd let workers each train one segment and the coordinator commit them atomically.🤖 Generated with Claude Code