Skip to content

Commit 2959c55

Browse files
committed
feat(zonemap): add build-time consolidated build path
Wire lance-core's computeZonemapBatch + writeZonemapIndexFromBatches APIs into AddIndexExec. When spark.lance.zonemap.consolidate.enabled=true, the consumer routes through runZonemapConsolidated: - executors call dataset.computeZonemapBatch on their fragment and return per-zone min/max stats as Arrow-IPC-encoded bytes - driver decodes every batch into VectorSchemaRoots and calls dataset.writeZonemapIndexFromBatches once, producing a single <uuid>/zonemap.lance file covering the union of all fragments - driver commits exactly one IndexMetadata entry via the same AddIndexOperation path used by runZonemapDistributed Default off: preserves the multi-segment distributed shape the read path has served for the entire history of this code. sf=100 store_sales A/B (ss_sold_date_sk, local[*], Spark 4.0): | metric | distributed | consolidated | |---------------------|-------------|--------------| | wall-clock | 15.0 s | 28.1 s | | index segments | 234 | 1 | | manifest-referenced | 1,099,920 B | 137,835 B | The 8x footprint shrink comes from amortising Lance file overhead (header + footer + schema metadata) across one consolidated file instead of paying it 234 times. Wall-clock regression is the expected trade-off: parallel per-fragment writes become a single driver-side write. At larger scales and on object stores with high per-PUT latency, manifest- and listing-cost wins on the read side should pay this back. Depends on the new lance-core APIs landing upstream (see lance-format/lance#6779 and #6780).
1 parent 8ca3232 commit 2959c55

2 files changed

Lines changed: 1119 additions & 11 deletions

File tree

0 commit comments

Comments
 (0)