Commit 2959c55
committed
feat(zonemap): add build-time consolidated build path
Wire lance-core's computeZonemapBatch + writeZonemapIndexFromBatches
APIs into AddIndexExec. When spark.lance.zonemap.consolidate.enabled=true,
the consumer routes through runZonemapConsolidated:
- executors call dataset.computeZonemapBatch on their fragment and
return per-zone min/max stats as Arrow-IPC-encoded bytes
- driver decodes every batch into VectorSchemaRoots and calls
dataset.writeZonemapIndexFromBatches once, producing a single
<uuid>/zonemap.lance file covering the union of all fragments
- driver commits exactly one IndexMetadata entry via the same
AddIndexOperation path used by runZonemapDistributed
Default off: preserves the multi-segment distributed shape the read
path has served for the entire history of this code.
sf=100 store_sales A/B (ss_sold_date_sk, local[*], Spark 4.0):
| metric | distributed | consolidated |
|---------------------|-------------|--------------|
| wall-clock | 15.0 s | 28.1 s |
| index segments | 234 | 1 |
| manifest-referenced | 1,099,920 B | 137,835 B |
The 8x footprint shrink comes from amortising Lance file overhead
(header + footer + schema metadata) across one consolidated file
instead of paying it 234 times. Wall-clock regression is the expected
trade-off: parallel per-fragment writes become a single driver-side
write. At larger scales and on object stores with high per-PUT
latency, manifest- and listing-cost wins on the read side should
pay this back.
Depends on the new lance-core APIs landing upstream (see
lance-format/lance#6779 and #6780).1 parent 8ca3232 commit 2959c55
2 files changed
Lines changed: 1119 additions & 11 deletions
File tree
- lance-spark-base_2.12/src
- main/scala/org/apache/spark/sql/execution/datasources/v2
- test/java/org/lance/spark/update
0 commit comments