Background
AddIndexExec's FragmentBasedIndexJob spawns one Spark task per fragment, all sharing a single index UUID via IndexOptions.withIndexUUID(uuid). On the lance-core side this maps each task's per-fragment build to the same indices_dir().join(uuid) directory (rust/lance/src/index/scalar.rs:287 — LanceIndexStore::from_dataset_for_new(dataset, uuid)).
For index types that emit a single artifact per build, all N executors race to write the same object-store path. For ZONEMAP that artifact is zonemap.lance (rust/lance-index/src/scalar/zonemap.rs:50,745); whichever task commits last wins, and the read path (getZonemapStats) returns only one fragment's worth of zone stats.
Reproduction
AddIndexTest.testCreateZonemapIndex on a 20-fragment table (Spark local[10], 20 single-row inserts):
Set<Integer> fragmentIdsWithStats =
zonemapStats("id").stream()
.map(ZoneStats::getFragmentId)
.collect(Collectors.toSet());
// expected: 20, actual: 1
This was the empirical confirmation that surfaced during review of #513. The current PR keeps the loose assertFalse(zonemapStats("id").isEmpty()) and documents the limitation.
Why this is also a BTree concern
The same code path is used for USING btree (default fragment build_mode). The race is functionally invisible there because btree's user-visible behavior is correctness-preserving: a fragment without an index entry simply gets a full scan — slower, never wrong. That's why the existing testCreateIndexDistributed has not caught it. The race is real, just silent.
Fix options
(A) Single-call driver-side build for affected types. Have AddIndexExec call dataset.createIndex(opts) once on the driver with all fragment IDs. build_scalar_index then loads training data across every fragment and produces a single consistent artifact. Loses per-fragment parallelism (acceptable for ZONEMAP — small payload — possibly not for large btree builds).
(B) Scalar-index segment commit path. Mirror lance-core's vector segment-commit pattern (commit_existing_index_segments, rust/lance/src/dataset/index.rs:230) for scalar indexes. Each task gets a fresh UUID and writes to its own directory; the driver collects N IndexMetadata segments and commits them as separate entries with the same name. getZonemapStats already supports multi-segment reads (java/lance-jni/src/blocking_dataset.rs:3437–3470). Requires a lance-core API extension.
(C) Per-task UUID with manifest-side join. Have lance-spark commit N IndexMetadata entries (different UUIDs, same name) directly via AddIndexOperation.withNewIndices(List). The lance-core describe_indices already groups by name (rust/lance/src/index.rs:957–962), so consumers see one logical index. Most code-local to lance-spark.
I attempted (A) first but ran into manifest-visibility issues that need deeper Lance-runtime debugging (driver-side createIndex returns successfully but getIndexes reports zero entries — possibly session-cache opacity). (C) is the most attractive given existing lance-core infrastructure.
Scope of this issue
- Investigate (A) vs (C); pick the smaller-blast-radius fix.
- Apply to ZONEMAP first (visible symptom), then BTree fragment-mode (silent symptom).
- Add a strict per-fragment-coverage test once the fix lands:
assertEquals(fragmentsIndexed, fragmentIdsWithStats.size()).
Related
Background
AddIndexExec'sFragmentBasedIndexJobspawns one Spark task per fragment, all sharing a single index UUID viaIndexOptions.withIndexUUID(uuid). On the lance-core side this maps each task's per-fragment build to the sameindices_dir().join(uuid)directory (rust/lance/src/index/scalar.rs:287—LanceIndexStore::from_dataset_for_new(dataset, uuid)).For index types that emit a single artifact per build, all N executors race to write the same object-store path. For ZONEMAP that artifact is
zonemap.lance(rust/lance-index/src/scalar/zonemap.rs:50,745); whichever task commits last wins, and the read path (getZonemapStats) returns only one fragment's worth of zone stats.Reproduction
AddIndexTest.testCreateZonemapIndexon a 20-fragment table (Sparklocal[10], 20 single-row inserts):This was the empirical confirmation that surfaced during review of #513. The current PR keeps the loose
assertFalse(zonemapStats("id").isEmpty())and documents the limitation.Why this is also a BTree concern
The same code path is used for
USING btree(default fragment build_mode). The race is functionally invisible there because btree's user-visible behavior is correctness-preserving: a fragment without an index entry simply gets a full scan — slower, never wrong. That's why the existingtestCreateIndexDistributedhas not caught it. The race is real, just silent.Fix options
(A) Single-call driver-side build for affected types. Have AddIndexExec call
dataset.createIndex(opts)once on the driver with all fragment IDs.build_scalar_indexthen loads training data across every fragment and produces a single consistent artifact. Loses per-fragment parallelism (acceptable for ZONEMAP — small payload — possibly not for large btree builds).(B) Scalar-index segment commit path. Mirror lance-core's vector segment-commit pattern (
commit_existing_index_segments,rust/lance/src/dataset/index.rs:230) for scalar indexes. Each task gets a fresh UUID and writes to its own directory; the driver collects N IndexMetadata segments and commits them as separate entries with the same name.getZonemapStatsalready supports multi-segment reads (java/lance-jni/src/blocking_dataset.rs:3437–3470). Requires a lance-core API extension.(C) Per-task UUID with manifest-side join. Have lance-spark commit N IndexMetadata entries (different UUIDs, same name) directly via
AddIndexOperation.withNewIndices(List). The lance-coredescribe_indicesalready groups by name (rust/lance/src/index.rs:957–962), so consumers see one logical index. Most code-local to lance-spark.I attempted (A) first but ran into manifest-visibility issues that need deeper Lance-runtime debugging (driver-side
createIndexreturns successfully butgetIndexesreports zero entries — possibly session-cache opacity). (C) is the most attractive given existing lance-core infrastructure.Scope of this issue
assertEquals(fragmentsIndexed, fragmentIdsWithStats.size()).Related
USING zonemapSQL support; documents this limitation in the PR but does not fix.getZonemapStatsto feed Spark CBO column statistics; the race directly bounds the achievable coverage on un-ANALYZEd tables.