Background
ALTER TABLE ... CREATE INDEX ... USING <method> currently accepts btree and fts only (see IndexUtils.buildIndexType in AddIndexExec.scala). Lance's underlying IndexType enum already has a ZONEMAP value, but there is no way to request a zonemap index from SQL today.
This matters because btree does NOT produce a zonemap as a side-effect. Verified in lance-core:
BTreeIndexPlugin::index_type() → IndexType::BTree (rust/lance-index/src/scalar/btree.rs:1474).
- BTree training writes only
page_lookup.lance + page_data.lance (btree.rs:71-72, all new_index_file(...) call sites at btree.rs:1893,1942,1947,2206); it never writes zonemap.lance.
Dataset.getZonemapStats(col) filters by desc.index_type().to_lowercase().contains("zonemap") (java/lance-jni/src/blocking_dataset.rs:3437–3439) and then opens zonemap.lance directly — so a btree-typed entry does not satisfy the filter, and even if it did, the file lookup would fail.
Net: until this issue is resolved, the only ways to get a zonemap on a Lance dataset are (a) call the lower-level Lance Java SDK directly, or (b) skip zonemap entirely and accept that lance-spark's CBO column-statistics feature has no source for per-column min/max/nullCount on tables created via SQL.
Why expose it
A zonemap-only index is meaningfully cheaper than btree:
- Smaller on-disk footprint (one
zonemap.lance file with per-zone summaries; no per-page lookup tree).
- Faster build time.
- Sufficient for: range / equality fragment pruning at scan time, runtime / dynamic-file-pruning filters, and column-statistics reporting to Spark's CBO.
When the workload doesn't need fast single-row point lookups, btree's overhead is wasted — and even then, btree alone does not enable CBO column statistics, which require a zonemap.
Proposal
Recognize zonemap as a method name in both directions of the IndexUtils mapping:
def buildIndexType(method: String): IndexType = method.toLowerCase match {
case "btree" => IndexType.BTREE
case "fts" => IndexType.INVERTED
case "zonemap" => IndexType.ZONEMAP // <-- added
case other => throw new UnsupportedOperationException(...)
}
def buildScalarIndexParamType(method: String): String = method.toLowerCase match {
case "btree" => "btree"
case "fts" => "inverted"
case "zonemap" => "zonemap" // <-- added
case other => throw new UnsupportedOperationException(...)
}
The PR also needs to skip dataset.mergeIndexMetadata(...) for IndexType.ZONEMAP, because lance-core's merge_index_metadata has no ZoneMap arm and would throw Unsupported index type at runtime; per-fragment zonemap.lance files are independently sufficient for getZonemapStats.
Test plan
- New
IndexUtilsTest in lance-spark-base_2.12/src/test/java/...:
- All three method names map to the correct
IndexType.
- Lookup is case-insensitive (already the existing contract).
- Reverse lookup
buildScalarIndexParamType round-trips for all three.
- Unknown methods still throw
UnsupportedOperationException.
- Integration tests in
BaseAddIndexTest covering: zonemap creation succeeds end-to-end; getZonemapStats is empty after a btree-only build; same-name USING btree then USING zonemap is last-write-wins; reverse direction; different names coexist.
Out of scope
This issue is scope-limited to recognizing the keyword and making the build pipeline tolerant of IndexType.ZONEMAP. Tooling discussion (Spark UI / SHOW INDEXES rendering, default index choice for CBO column statistics) belongs in separate issues.
PR
#513.
Background
ALTER TABLE ... CREATE INDEX ... USING <method>currently acceptsbtreeandftsonly (seeIndexUtils.buildIndexTypeinAddIndexExec.scala). Lance's underlyingIndexTypeenum already has aZONEMAPvalue, but there is no way to request a zonemap index from SQL today.This matters because btree does NOT produce a zonemap as a side-effect. Verified in lance-core:
BTreeIndexPlugin::index_type()→IndexType::BTree(rust/lance-index/src/scalar/btree.rs:1474).page_lookup.lance+page_data.lance(btree.rs:71-72, allnew_index_file(...)call sites atbtree.rs:1893,1942,1947,2206); it never writeszonemap.lance.Dataset.getZonemapStats(col)filters bydesc.index_type().to_lowercase().contains("zonemap")(java/lance-jni/src/blocking_dataset.rs:3437–3439) and then openszonemap.lancedirectly — so a btree-typed entry does not satisfy the filter, and even if it did, the file lookup would fail.Net: until this issue is resolved, the only ways to get a zonemap on a Lance dataset are (a) call the lower-level Lance Java SDK directly, or (b) skip zonemap entirely and accept that lance-spark's CBO column-statistics feature has no source for per-column min/max/nullCount on tables created via SQL.
Why expose it
A zonemap-only index is meaningfully cheaper than btree:
zonemap.lancefile with per-zone summaries; no per-page lookup tree).When the workload doesn't need fast single-row point lookups, btree's overhead is wasted — and even then, btree alone does not enable CBO column statistics, which require a zonemap.
Proposal
Recognize
zonemapas a method name in both directions of theIndexUtilsmapping:The PR also needs to skip
dataset.mergeIndexMetadata(...)forIndexType.ZONEMAP, because lance-core'smerge_index_metadatahas noZoneMaparm and would throwUnsupported index typeat runtime; per-fragmentzonemap.lancefiles are independently sufficient forgetZonemapStats.Test plan
IndexUtilsTestinlance-spark-base_2.12/src/test/java/...:IndexType.buildScalarIndexParamTyperound-trips for all three.UnsupportedOperationException.BaseAddIndexTestcovering: zonemap creation succeeds end-to-end;getZonemapStatsis empty after a btree-only build; same-nameUSING btreethenUSING zonemapis last-write-wins; reverse direction; different names coexist.Out of scope
This issue is scope-limited to recognizing the keyword and making the build pipeline tolerant of
IndexType.ZONEMAP. Tooling discussion (Spark UI /SHOW INDEXESrendering, default index choice for CBO column statistics) belongs in separate issues.PR
#513.