Skip to content

Commit 1ef6310

Browse files
authored
refactor!: align distributed index build around segments (#6313)
This refactor removes the old `IndexSegment` / segment-builder workflow from distributed vector indexing and aligns the public flow around `IndexMetadata` returned from fragment-level builds. Worker builds now produce uncommitted index metadata, callers optionally merge caller-defined metadata groups, and the final metadata set is committed as one logical index. ```python # Build uncommitted index metadata segment_0 = dataset.create_index_uncommitted(...) segment_1 = dataset.create_index_uncommitted(...) # Merge segments if needed merged_segment = dataset.merge_existing_index_segments([ segment_0, segment_1, ]) # Commit segments dataset.commit_existing_index_segments("vector_idx", "vector", [merged_segment]) ```
1 parent 526a72f commit 1ef6310

21 files changed

Lines changed: 556 additions & 1180 deletions

File tree

docs/src/guide/distributed_indexing.md

Lines changed: 17 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -93,37 +93,29 @@ First, multiple workers build segments in parallel:
9393
or Python `create_index_uncommitted(..., fragment_ids=...)`
9494
2. each worker writes one segment under `indices/<segment_uuid>/`
9595

96-
### Segment Build
96+
### Segment Merge
9797

98-
Then the caller turns those existing segments into one or more physical
99-
segments:
98+
Then the caller decides whether those existing segments should be committed as-is
99+
or merged into larger segments:
100100

101-
1. create a builder with `create_index_segment_builder()`
102-
2. provide segment metadata with `with_segments(...)`
103-
3. optionally choose a grouping policy with `with_target_segment_bytes(...)`
104-
4. call `plan()` to get `Vec<IndexSegmentPlan>`
105-
106-
At that point the caller has two execution choices:
107-
108-
- call `build(plan)` for each plan and run those builds in parallel
109-
- call `build_all()` to let Lance build every planned segment on the current node
110-
111-
After the physical segments are built, publish them with
112-
`commit_existing_index_segments(...)`.
101+
1. keep the worker outputs as-is and commit them directly with
102+
`commit_existing_index_segments(...)`, or
103+
2. group one or more existing segments and call
104+
`merge_existing_index_segments(...)` for each caller-defined group
105+
3. commit the final segment list with `commit_existing_index_segments(...)`
113106

114107
Within a single commit, built segments must have disjoint fragment coverage.
115108

116-
## Internal Segmented Finalize Model
109+
## Internal Finalize Model
117110

118111
Internally, Lance models distributed vector segment build as:
119112

120-
1. **plan** which input segments should become each physical segment
121-
2. **build** each segment from its selected input segments
122-
3. **commit** the resulting physical segments as one logical index
113+
1. **build** one uncommitted segment per worker
114+
2. **optionally merge** caller-defined groups of existing segments
115+
3. **commit** the resulting segments as one logical index
123116

124-
The plan step is driven by the segment metadata returned from
125-
`execute_uncommitted()` and any additional inputs requested by the segment
126-
build APIs.
117+
The merge step is driven directly by the `IndexMetadata` returned from
118+
`execute_uncommitted()`.
127119

128120
This is intentionally a storage-level model:
129121

@@ -133,10 +125,10 @@ This is intentionally a storage-level model:
133125

134126
## Segment Grouping
135127

136-
When Lance builds segments from existing inputs, it may either:
128+
The caller chooses the final segment grouping:
137129

138-
- keep segment boundaries, so each input segment becomes one physical segment
139-
- group multiple input segments into a larger physical segment
130+
- keep segment boundaries, so each worker output is committed directly
131+
- merge multiple existing segments into a larger segment before commit
140132

141133
The grouping decision is separate from worker build. Workers only build
142134
segments; Lance applies the segment build policy when it plans

docs/src/images/distributed_vector_segment_build.svg

Lines changed: 12 additions & 12 deletions
Loading

0 commit comments

Comments
 (0)