feat: support distributed IVF vector index builds by jiaoew1991 · Pull Request #479 · lance-format/lance-spark

jiaoew1991 · 2026-04-25T01:34:27Z

Summary

Adds ivf_flat, ivf_pq, ivf_sq as CREATE INDEX methods, using lance-core's multi-segment commit API (commitExistingIndexSegments) to publish per-fragment segments atomically under one logical index name.

User-facing

ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
  WITH (num_partitions=256, num_sub_vectors=16, num_bits=8, metric_type='l2');

Grammar is untouched; IndexUtils.buildIndexType learns three new cases. Replace-on-recreate is preserved via dropIndex(name) before commit (the segment commit API is additive by default).

Design

Driver-side training: lance-core's distributed build (rust/lance/src/index/vector.rs:514) rejects per-fragment-trained centroids — it requires pre-computed IVF centroids (and a PQ codebook for IVF_PQ). The driver calls VectorTrainer.trainIvfCentroids / trainPqCodebook once; every per-fragment task uses the same artifacts. This also keeps all segments in the same query-time compatibility group.
Per-fragment tasks: each calls createIndex(IndexOptions.builder(...).withFragmentIds([fid]).build()), returns an uncommitted Index.
Executor → driver handoff: org.lance.index.Index is not Serializable, so a small Scala case class LanceIndexHandle carries its fields as primitives and the driver rebuilds Index via the builder.
Commit: driver pre-drops any same-name index, then calls commitExistingIndexSegments(name, column, segments).
Scalar path unchanged: FragmentBasedIndexJob / RangeBasedBTreeIndexJob still use mergeIndexMetadata + manual CreateIndex transaction. In v6.0.0-beta.2 scalar per-fragment createIndex produces partial files that require the merge finalize step — the segment commit API is effectively vector-only for this release.

Known follow-ups (not in this PR)

IVF_HNSW_FLAT / IVF_HNSW_SQ / IVF_HNSW_PQ (supported by lance-core, code add here is small)
IVF_RQ (same)
Expose more WITH-args: sample_rate, max_iters, hnsw_m, hnsw_ef_construction, etc.
OPTIMIZE INDEX for incremental builds over new fragments
Size-aware fragment grouping for unbalanced datasets

References

lance-core distributed build refactor: refactor!: align distributed index build around segments lance#6313
Multi-segment design doc: Lance Vector Index: Multi-Segment Final State lance#6189
Index segment commit API: feat: add index segment commit API lance#6209

Test plan

All 16 AddIndexTest cases green on lance-spark-3.5_2.12 (11 existing scalar + 5 new vector)
All 16 AddIndexTest cases green on lance-spark-4.0_2.13
Manual smoke test on a non-local dataset

…/ IVF_SQ) Adds a VectorIndexJob path in AddIndexExec that uses lance-core's multi-segment commit API: the driver pre-trains IVF centroids (and PQ codebook for IVF_PQ) once via VectorTrainer, then each Spark task calls createIndex(withFragmentIds([fid])) with those shared artifacts. The driver collects the uncommitted per-fragment segments and publishes them atomically under one logical index name via commitExistingIndexSegments. Pre-trained centroids are required: lance-core's distributed build path rejects per-fragment-trained centroids (rust/lance/src/index/vector.rs "missing precomputed IVF centroids"). Sharing centroids across segments also keeps them in the same query-time compatibility group, avoiding per-segment routing overhead. Grammar is unchanged — users write: ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding) WITH (num_partitions=256, num_sub_vectors=16, metric_type='l2'); Scalar BTREE / INVERTED paths are untouched; those still produce partial per-fragment files that require mergeIndexMetadata and cannot use the segment commit API in v6.0.0-beta.2. Replace-on-recreate is preserved via dropIndex(name) before commit, since commitExistingIndexSegments is additive by default.

sezruby · 2026-04-30T21:27:41Z

+ * v1 trains centroids independently per fragment (each segment is its own compatibility
+ * group at query time). Shared-centroid broadcast is a follow-up.


This doc contradicts the implementation, on a correctness-relevant property.

runAndCommit calls trainArtifactsOnDriver once (line 578) and ships centroids+codebook to every task via trainedSpec — and the inline comment at lines 574–577 correctly states the opposite of this paragraph:

"Train once here, broadcast to executors, and every per-fragment segment shares the same centroids/codebook so segments land in the same compatibility group at query time."

The PR description matches the inline comment too. This class doc looks like a stale draft from before driver-side training was added. Could you remove or rewrite it before merge? A future maintainer trusting "each segment is its own compatibility group" could file a bogus bug — or "fix" the implementation to match the doc and break the compatibility-group guarantee.

sezruby · 2026-04-30T21:28:38Z

+  private def trainArtifactsOnDriver(column: String): (Array[Float], Array[Float]) = {
+    val dataset = Utils.openDatasetBuilder(readOptions).build()
+    try {
+      val ivfBuilder = new IvfBuildParams.Builder().setNumPartitions(spec.numPartitions)
+      spec.sampleRate.foreach(ivfBuilder.setSampleRate)
+      spec.maxIters.foreach(ivfBuilder.setMaxIters)
+      val centroids = VectorTrainer.trainIvfCentroids(dataset, column, ivfBuilder.build())
+
+      val codebook: Array[Float] = indexType match {
+        case IndexType.IVF_PQ =>
+          val pqBuilder = new PQBuildParams.Builder()
+          spec.numSubVectors.foreach(pqBuilder.setNumSubVectors)
+          spec.pqNumBits.foreach(pqBuilder.setNumBits)
+          spec.pqMaxIters.foreach(pqBuilder.setMaxIters)
+          spec.sampleRate.foreach(pqBuilder.setSampleRate)
+          VectorTrainer.trainPqCodebook(dataset, column, pqBuilder.build())
+        case _ => null


Question: should useResidual be applied here too?

The executor-side toVectorIndexParams (line 714) sets useResidual on the IVF build params used to build the per-fragment segment, but neither IvfBuildParams.Builder nor PQBuildParams.Builder constructed in this method see it. If useResidual=true is supposed to influence either centroid training or — more likely — PQ codebook training (residual-PQ wants the codebook trained on vector − centroid, not raw vectors), the codebook would be shaped for raw-vector chunks while per-fragment build encodes residual chunks against it. The index would still build, but recall would be silently degraded.

Does lance-core's VectorTrainer.trainPqCodebook derive the residual mode from somewhere else, or does it need to be told via IvfBuildParams.setUseResidual / PQBuildParams? If the latter, this is a bug. If lance-core auto-derives it, ignore — and possibly worth a one-line comment here noting that the omission is deliberate.

sezruby · 2026-04-30T21:30:28Z

+      val sameNameExists = dataset.getIndexes.asScala.exists(_.name() == indexName)
+      if (sameNameExists) {
+        dataset.dropIndex(indexName)
+      }
+      dataset.commitExistingIndexSegments(indexName, column, segments)


FYI / minor — dropIndex and commitExistingIndexSegments each perform their own Lance dataset commit (manifest version bump), so they're two separate atomic steps, not one. If the second commit fails (version conflict, transient I/O error, driver crash between calls), the table is left at the intermediate state — old index dropped, new index not yet installed — until the user reruns CREATE INDEX. The existing scalar CreateIndex path is single-commit, so this is a small behavioral regression specifically for the replace-on-recreate case.

The inline comment above this block correctly notes why the pre-drop is needed (the segment commit API is additive), but doesn't acknowledge the failure window. Two suggestions, in increasing order of effort:

Add a one-liner comment here noting that the replace is non-atomic by design (so future readers / debuggers don't have to re-derive that).

Longer term: consider a replace flag on commitExistingIndexSegments upstream in lance-core so the drop+commit can fold into a single manifest commit. Probably a separate issue, not this PR.

Not a blocker for me — just want to make sure the trade-off is on the record.

sezruby · 2026-05-03T05:23:38Z

+      val ivfBuilder = new IvfBuildParams.Builder().setNumPartitions(spec.numPartitions)
+      spec.sampleRate.foreach(ivfBuilder.setSampleRate)
+      spec.maxIters.foreach(ivfBuilder.setMaxIters)
+      val centroids = VectorTrainer.trainIvfCentroids(dataset, column, ivfBuilder.build())


Question: should metric_type be applied to driver-side training too?

spec.metricType is parsed correctly and threaded into toVectorIndexParams on the executor side (line 719: setDistanceType(dt)), so each per-fragment build encodes against the user-requested metric. But the IvfBuildParams.Builder here on line 623 — and the PQBuildParams.Builder on line 630 — never see it. Worse, IvfBuildParams.Builder and PQBuildParams.Builder in lance@6.0.0-beta.2 (the version pinned in pom.xml) don't actually expose a setDistanceType / setMetricType setter, so even if you wanted to pass it from here, you can't.

Digging into the JNI bridge: java/lance-jni/src/vector_trainer.rs in both inner_train_ivf_centroids and inner_train_pq_codebook hardcodes let metric_type = MetricType::L2; with the comment "For now we default to L2 metric; tests and Java bindings currently use L2." This hardcode is still present on main and on v7.0.0-beta.1. So today, VectorTrainer.trainIvfCentroids / trainPqCodebook always cluster with L2 regardless of what the user asked for.

Concretely, lance-core's build_ivf_model (rust/lance/src/index/vector/ivf.rs) L2-normalizes training data when metric is cosine, and train_ivf_model passes the metric into KMeansParams::new(..., metric_type). So for metric_type='cosine', 'dot', or 'hamming', this PR builds centroids on un-normalized L2 geometry, then per-fragment encoders quantize against those centroids using the user's actual metric — silently mismatched index, degraded recall. BaseAddIndexTest#testCreateIvfSqIndex (line 549) which uses metric_type='cosine' only asserts segment count, so the regression is invisible to CI.

Two things to consider before merge:

Until lance-core exposes a metric setter on IvfBuildParams.Builder / PQBuildParams.Builder (and the JNI stops hardcoding L2), should this code path reject non-L2 metric_type with a clear error so users don't get silent recall loss? Or fall back to the existing single-segment CreateIndex path for non-L2?

Same shape as my line-636 comment about useResidual — the JNI's build_ivf_params_from_java doesn't read getUseResidual either, so even though IvfBuildParams.Builder has a setUseResidual setter, training currently ignores it. Both of these need an upstream lance-core fix to actually be honored.

If lance-core changes are in flight that I missed, ignore — but please link them here so future debuggers don't have to re-derive the L2 hardcode.

github-actions Bot added the enhancement New feature or request label Apr 25, 2026

jiaoew1991 requested a review from jackye1995 April 25, 2026 16:30

eddyxu requested review from BubbleCal and hamersaw April 29, 2026 23:35

sezruby reviewed Apr 30, 2026

View reviewed changes

sezruby reviewed May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support distributed IVF vector index builds#479

feat: support distributed IVF vector index builds#479
jiaoew1991 wants to merge 1 commit intolance-format:mainfrom
jiaoew1991:feat/vector-index-support

jiaoew1991 commented Apr 25, 2026

Uh oh!

sezruby Apr 30, 2026

Uh oh!

sezruby Apr 30, 2026

Uh oh!

sezruby Apr 30, 2026

Uh oh!

sezruby May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		* v1 trains centroids independently per fragment (each segment is its own compatibility
		* group at query time). Shared-centroid broadcast is a follow-up.

Conversation

jiaoew1991 commented Apr 25, 2026

Summary

User-facing

Design

Known follow-ups (not in this PR)

References

Test plan

Uh oh!

sezruby Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

sezruby Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

sezruby Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

sezruby May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants