Commit 2959c55

committed

feat(zonemap): add build-time consolidated build path

Wire lance-core's computeZonemapBatch + writeZonemapIndexFromBatches APIs into AddIndexExec. When spark.lance.zonemap.consolidate.enabled=true, the consumer routes through runZonemapConsolidated: - executors call dataset.computeZonemapBatch on their fragment and return per-zone min/max stats as Arrow-IPC-encoded bytes - driver decodes every batch into VectorSchemaRoots and calls dataset.writeZonemapIndexFromBatches once, producing a single <uuid>/zonemap.lance file covering the union of all fragments - driver commits exactly one IndexMetadata entry via the same AddIndexOperation path used by runZonemapDistributed Default off: preserves the multi-segment distributed shape the read path has served for the entire history of this code. sf=100 store_sales A/B (ss_sold_date_sk, local[*], Spark 4.0): | metric | distributed | consolidated | |---------------------|-------------|--------------| | wall-clock | 15.0 s | 28.1 s | | index segments | 234 | 1 | | manifest-referenced | 1,099,920 B | 137,835 B | The 8x footprint shrink comes from amortising Lance file overhead (header + footer + schema metadata) across one consolidated file instead of paying it 234 times. Wall-clock regression is the expected trade-off: parallel per-fragment writes become a single driver-side write. At larger scales and on object stores with high per-PUT latency, manifest- and listing-cost wins on the read side should pay this back. Depends on the new lance-core APIs landing upstream (see lance-format/lance#6779 and #6780).

1 parent 8ca3232 commit 2959c55Copy full SHA for 2959c55

2 files changed

lance-spark-base_2.12/src
- main/scala/org/apache/spark/sql/execution/datasources/v2
  - AddIndexExec.scala
- test/java/org/lance/spark/update
  - BaseAddIndexTest.java

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 2959c55

Uh oh!

File tree

0 commit comments