Skip to content

Distributed scalar-index build races on per-fragment zonemap.lance #514

@LuciferYang

Description

@LuciferYang

Background

AddIndexExec's FragmentBasedIndexJob spawns one Spark task per fragment, all sharing a single index UUID via IndexOptions.withIndexUUID(uuid). On the lance-core side this maps each task's per-fragment build to the same indices_dir().join(uuid) directory (rust/lance/src/index/scalar.rs:287LanceIndexStore::from_dataset_for_new(dataset, uuid)).

For index types that emit a single artifact per build, all N executors race to write the same object-store path. For ZONEMAP that artifact is zonemap.lance (rust/lance-index/src/scalar/zonemap.rs:50,745); whichever task commits last wins, and the read path (getZonemapStats) returns only one fragment's worth of zone stats.

Reproduction

AddIndexTest.testCreateZonemapIndex on a 20-fragment table (Spark local[10], 20 single-row inserts):

Set<Integer> fragmentIdsWithStats =
    zonemapStats("id").stream()
        .map(ZoneStats::getFragmentId)
        .collect(Collectors.toSet());
// expected: 20, actual: 1

This was the empirical confirmation that surfaced during review of #513. The current PR keeps the loose assertFalse(zonemapStats("id").isEmpty()) and documents the limitation.

Why this is also a BTree concern

The same code path is used for USING btree (default fragment build_mode). The race is functionally invisible there because btree's user-visible behavior is correctness-preserving: a fragment without an index entry simply gets a full scan — slower, never wrong. That's why the existing testCreateIndexDistributed has not caught it. The race is real, just silent.

Fix options

(A) Single-call driver-side build for affected types. Have AddIndexExec call dataset.createIndex(opts) once on the driver with all fragment IDs. build_scalar_index then loads training data across every fragment and produces a single consistent artifact. Loses per-fragment parallelism (acceptable for ZONEMAP — small payload — possibly not for large btree builds).

(B) Scalar-index segment commit path. Mirror lance-core's vector segment-commit pattern (commit_existing_index_segments, rust/lance/src/dataset/index.rs:230) for scalar indexes. Each task gets a fresh UUID and writes to its own directory; the driver collects N IndexMetadata segments and commits them as separate entries with the same name. getZonemapStats already supports multi-segment reads (java/lance-jni/src/blocking_dataset.rs:3437–3470). Requires a lance-core API extension.

(C) Per-task UUID with manifest-side join. Have lance-spark commit N IndexMetadata entries (different UUIDs, same name) directly via AddIndexOperation.withNewIndices(List). The lance-core describe_indices already groups by name (rust/lance/src/index.rs:957–962), so consumers see one logical index. Most code-local to lance-spark.

I attempted (A) first but ran into manifest-visibility issues that need deeper Lance-runtime debugging (driver-side createIndex returns successfully but getIndexes reports zero entries — possibly session-cache opacity). (C) is the most attractive given existing lance-core infrastructure.

Scope of this issue

  • Investigate (A) vs (C); pick the smaller-blast-radius fix.
  • Apply to ZONEMAP first (visible symptom), then BTree fragment-mode (silent symptom).
  • Add a strict per-fragment-coverage test once the fix lands: assertEquals(fragmentsIndexed, fragmentIdsWithStats.size()).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions