Skip to content

feat: distributed vector search via index segment selection#24

Open
jja725 wants to merge 4 commits intomainfrom
feat/distributed-vector-search
Open

feat: distributed vector search via index segment selection#24
jja725 wants to merge 4 commits intomainfrom
feat/distributed-vector-search

Conversation

@jja725
Copy link
Copy Markdown
Collaborator

@jja725 jja725 commented Apr 24, 2026

Exposes Lance's segment-model APIs through the C ABI so a distributed query engine (Velox, Presto worker, etc.) can fan a single k-NN query out across workers, each scanning a slice of the logical index's physical segments. Tracks lance#6309.

Distributed query pattern

Coordinator                                  Worker(s)
─────────────────                            ───────────────
open dataset
list segments  ──────── slice ──────────►   open same dataset
                                             scanner.nearest(q, k)
                                             scanner.index_segments(my_slice)
                                             return partial top-k stream
heap-merge partial top-k  ◄─────────────────  (Velox top-k operator handles this)

Summary

lance_dataset_index_segment_count(ds, name) — number of physical segments in a logical vector index. Returns 0 + LANCE_ERR_NOT_FOUND for an unknown name.

lance_dataset_index_segments(ds, name, out_uuids) — fills a caller-allocated buffer (count * 16 bytes) with each segment's 16-byte UUID (RFC 4122).

lance_scanner_set_index_segments(scanner, segment_uuids, len) — restricts the next lance_scanner_nearest() query to a subset of segments. len=0 (any pointer) clears the restriction.

C++ wrappers:

  • Dataset::index_segment_count(name)uint64_t
  • Dataset::index_segments(name)std::vector<std::array<uint8_t, 16>>
  • Scanner::index_segments(uuids) (typed vector overload + raw uint8_t* + len overload) — fluent

Lance dep bump

To get Scanner::with_index_segments() (merged in lance #6376) we bump from crates.io lance = \"3.0.1\" to a git+rev pin at lance commit d630106d (release tag v5.0.0-beta.5). beta-5 keeps arrow on 57.0.0 — no transitive arrow churn. The DatasetIndexExt trait moved from lance_index to lance::index; one import path adjusted in src/index.rs.

When lance publishes 5.0.0 stable, the git+rev can be replaced with the version pin.

Test plan

  • cargo fmt clean
  • cargo clippy --all-targets -- -D warnings clean
  • cargo test75 passed (70 from main + 5 new)
  • cargo test --test compile_and_run_test -- --ignored — 2 passed (C + C++ smoke)

New tests:

  • test_index_segment_count_and_list — build IVF index, count = 1, list returns a non-zero UUID.
  • test_index_segment_count_unknown_index — unknown name → NotFound.
  • test_scanner_set_index_segments_with_listed_uuids — end-to-end k=5 nearest restricted to listed segment UUID, returns 5 results.
  • test_scanner_set_index_segments_unknown_uuid — bogus UUID is accepted at setter time, surfaces as an error at scan materialize time with a message containing "segment".
  • test_scanner_set_index_segments_null_safety — NULL scanner / NULL pointer with len>0 / NULL with len=0 (clears).

Follow-ups (not in this PR)

  • Per-segment metadata: today we only expose UUID. A future pass could add fragment_bitmap / dataset_version / num_indexed_rows so coordinators can balance work by segment size.
  • Distributed build: commit_existing_index_segments() and merge_existing_index_segments() exist upstream — they'd let workers each train one segment and the coordinator commit them atomically.
  • Once lance publishes 5.0.0 stable, replace the git+rev pin with a version pin.

🤖 Generated with Claude Code

jja725 added 4 commits April 24, 2026 16:03
Switches lance / lance-core / lance-index / lance-io / lance-linalg
from crates.io 3.0.1 to a git+rev pin at lance commit d630106d
(release tag v5.0.0-beta.5). Adds lance-datagen / lance-file /
lance-table dev deps from the same rev.

Beta-5 introduces the segment-model APIs (Scanner::with_index_segments,
commit_existing_index_segments) that subsequent commits expose through
the C ABI for distributed vector search.

The DatasetIndexExt trait moved from lance_index to lance::index;
src/index.rs adjusts the import. arrow stays on 57.0.0 (matches beta-5).
Adds uuid 1.x for the upcoming UUID-based segment API.
Adds lance_dataset_index_segment_count(name) and
lance_dataset_index_segments(name, out_uuids) to enumerate the
physical segments of a logical vector index. Each segment is
identified by its 16-byte UUID (RFC 4122 layout, written as raw
bytes into a caller-allocated buffer of len*16 bytes).

The header also declares lance_scanner_set_index_segments (impl
in the next commit). Both pieces together let a distributed query
engine like Velox shard a single k-NN query across workers.
Five tests covering the new segment APIs:
- index_segment_count + listing UUIDs from a freshly built IVF index
- segment_count returns NotFound for an unknown index name
- end-to-end nearest scoped to listed segment UUIDs returns k results
- unknown UUID surfaces an error at scan materialize time (not at
  setter time)
- NULL safety for the scanner setter (NULL scanner; NULL ptr with
  non-zero len; NULL ptr with len=0 clears successfully)
Dataset::index_segment_count(name) and Dataset::index_segments(name)
return std::vector<std::array<uint8_t, 16>>. Scanner::index_segments
takes either the typed vector or a raw byte pointer + length.

C++ smoke test verifies the wrappers compile and link (no real
distributed dataset to exercise against in the test fixture).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant