Skip to content

Recognize USING zonemap as a CREATE INDEX method #512

@LuciferYang

Description

@LuciferYang

Background

ALTER TABLE ... CREATE INDEX ... USING <method> currently accepts btree and fts only (see IndexUtils.buildIndexType in AddIndexExec.scala). Lance's underlying IndexType enum already has a ZONEMAP value, but there is no way to request a zonemap index from SQL today.

This matters because btree does NOT produce a zonemap as a side-effect. Verified in lance-core:

  • BTreeIndexPlugin::index_type()IndexType::BTree (rust/lance-index/src/scalar/btree.rs:1474).
  • BTree training writes only page_lookup.lance + page_data.lance (btree.rs:71-72, all new_index_file(...) call sites at btree.rs:1893,1942,1947,2206); it never writes zonemap.lance.
  • Dataset.getZonemapStats(col) filters by desc.index_type().to_lowercase().contains("zonemap") (java/lance-jni/src/blocking_dataset.rs:3437–3439) and then opens zonemap.lance directly — so a btree-typed entry does not satisfy the filter, and even if it did, the file lookup would fail.

Net: until this issue is resolved, the only ways to get a zonemap on a Lance dataset are (a) call the lower-level Lance Java SDK directly, or (b) skip zonemap entirely and accept that lance-spark's CBO column-statistics feature has no source for per-column min/max/nullCount on tables created via SQL.

Why expose it

A zonemap-only index is meaningfully cheaper than btree:

  • Smaller on-disk footprint (one zonemap.lance file with per-zone summaries; no per-page lookup tree).
  • Faster build time.
  • Sufficient for: range / equality fragment pruning at scan time, runtime / dynamic-file-pruning filters, and column-statistics reporting to Spark's CBO.

When the workload doesn't need fast single-row point lookups, btree's overhead is wasted — and even then, btree alone does not enable CBO column statistics, which require a zonemap.

Proposal

Recognize zonemap as a method name in both directions of the IndexUtils mapping:

def buildIndexType(method: String): IndexType = method.toLowerCase match {
  case "btree"   => IndexType.BTREE
  case "fts"     => IndexType.INVERTED
  case "zonemap" => IndexType.ZONEMAP   // <-- added
  case other     => throw new UnsupportedOperationException(...)
}

def buildScalarIndexParamType(method: String): String = method.toLowerCase match {
  case "btree"   => "btree"
  case "fts"     => "inverted"
  case "zonemap" => "zonemap"           // <-- added
  case other     => throw new UnsupportedOperationException(...)
}

The PR also needs to skip dataset.mergeIndexMetadata(...) for IndexType.ZONEMAP, because lance-core's merge_index_metadata has no ZoneMap arm and would throw Unsupported index type at runtime; per-fragment zonemap.lance files are independently sufficient for getZonemapStats.

Test plan

  • New IndexUtilsTest in lance-spark-base_2.12/src/test/java/...:
    • All three method names map to the correct IndexType.
    • Lookup is case-insensitive (already the existing contract).
    • Reverse lookup buildScalarIndexParamType round-trips for all three.
    • Unknown methods still throw UnsupportedOperationException.
  • Integration tests in BaseAddIndexTest covering: zonemap creation succeeds end-to-end; getZonemapStats is empty after a btree-only build; same-name USING btree then USING zonemap is last-write-wins; reverse direction; different names coexist.

Out of scope

This issue is scope-limited to recognizing the keyword and making the build pipeline tolerant of IndexType.ZONEMAP. Tooling discussion (Spark UI / SHOW INDEXES rendering, default index choice for CBO column statistics) belongs in separate issues.

PR

#513.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions