What is the problem the feature request solves?
Comet has native scan support for formats such as Iceberg, but Lance tables planned by Spark through Lance Spark currently execute through Spark's Lance reader. A native Lance path would let Comet read ordinary Lance table scans directly in Rust while preserving Spark/Lance Spark as the planning contract.
This should be optional and dependency-free for default Comet builds.
Describe the potential solution
Add Lance as an experimental, opt-in Comet contrib reader:
- Keep Spark planning Lance tables through Lance Spark.
- Detect Lance V2
BatchScanExec plans by reflection in Comet core.
- Extract a stable native-read descriptor exposed by Lance Spark.
- Add a build-gated
contrib-lance profile / Cargo feature so default builds do not depend on Lance.
- Add a typed
lance_scan = 118 native proto payload with common scan invariants and per-partition split descriptors.
- Execute the assigned Lance fragments through the native Rust Lance API.
- Gate runtime activation behind
spark.comet.scan.lanceNative.enabled=false by default.
Minimal v1 should target ordinary Lance table reads only: local/direct object-store storage options, projection, filter SQL, limit/offset, batch size, fragment splits, and Comet-supported Spark types. Search, hybrid search, index-backed planning, namespace-backed credential refresh, metadata/version columns, and aggregation pushdown should be added in later phases after separate semantic review.
Architecture sketch
The key contract is that Lance Spark keeps ownership of Spark planning and scan semantics, while Comet only consumes an explicit native-read descriptor and executes that descriptor with Rust Lance when the optional contrib is present and enabled.
Spark SQL / DataFrame read.format("lance")
|
v
+------------------------------+
| Spark planner + lance-spark |
| |
| org.lance.spark.read.LanceScan|
| - resolves dataset version |
| - applies projection/filter |
| - computes partition splits |
| - owns Lance Spark semantics |
+---------------+--------------+
|
| BatchScanExec(scan = LanceScan)
| LanceScan.nativeScanPlan()
v
+------------------------------+ absent / disabled / unsupported
| Comet core |--------------------------------------+
| CometScanRule | |
| LanceIntegration | |
| - reflection only | |
| - no default Lance dep | |
| - checks config/build gate | |
+---------------+--------------+ |
| |
| contrib-lance present and enabled |
v |
+------------------------------+ |
| Comet contrib-lance (Scala) | |
| CometLanceSupport | |
| CometLanceNativeScan | |
| CometLanceNativeScanExec | |
| - validates Comet schema | |
| - serializes descriptor | |
| - injects split payloads | |
+---------------+--------------+ |
| |
| native proto: lance_scan = 118 |
| LanceScanCommon |
| LanceSplit(fragment_ids) |
v |
+------------------------------+ |
| Comet native Rust | |
| LanceScanExec | |
| - opens DatasetBuilder | |
| - pins resolved version | |
| - applies storage options | |
| - Scanner::with_fragments | |
| - projection/filter/limit | |
+---------------+--------------+ |
| |
v |
+------------------------------+ |
| Rust Lance crate + storage | |
| local/object store dataset | |
| Arrow RecordBatch stream | |
+---------------+--------------+ |
| |
v |
+------------------------------+ |
| Comet columnar execution | |
| returns Spark-compatible rows| |
+------------------------------+ |
|
+----------------------------------------------------------------------+
|
| fallback path
v
Spark executes the original Lance BatchScanExec through lance-spark.
Descriptor boundary:
lance-spark PR
LanceScan.nativeScanPlan(): Optional[LanceNativeScanPlan]
- dataset URI
- resolved version
- Spark/projected schema JSON
- projection and pushed filter SQL
- limit/offset and batch size
- storage options
- per-partition fragment IDs
- fallback reasons
Comet PR
Reflection extracts LanceNativeScanPlan
Scala serde converts it to lance_scan proto
Rust LanceScanExec executes exactly those fragments/version/options
Additional context
The phased roadmap is:
- Lance Spark native descriptor for ordinary reads.
- Comet contrib scaffold and reflection bridge.
- Minimal native Rust Lance scan v1.
- V1 hardening and CI parity tests.
- Advanced table-read parity and metrics.
- Lance index/search read support.
- Remote namespace and credential refresh support.
Known prototype blocker to resolve before this can be considered merge-ready: packaged Comet currently contains org.apache.arrow.c classes rewritten against Comet's shaded Arrow allocator, while Lance Spark expects the normal Arrow C Data ABI. An end-to-end packaged smoke test with both jars exposes this classpath conflict. The final design needs an explicit Arrow C Data packaging/classloader strategy and CI coverage for Comet + Lance Spark together.
What is the problem the feature request solves?
Comet has native scan support for formats such as Iceberg, but Lance tables planned by Spark through Lance Spark currently execute through Spark's Lance reader. A native Lance path would let Comet read ordinary Lance table scans directly in Rust while preserving Spark/Lance Spark as the planning contract.
This should be optional and dependency-free for default Comet builds.
Describe the potential solution
Add Lance as an experimental, opt-in Comet contrib reader:
BatchScanExecplans by reflection in Comet core.contrib-lanceprofile / Cargo feature so default builds do not depend on Lance.lance_scan = 118native proto payload with common scan invariants and per-partition split descriptors.spark.comet.scan.lanceNative.enabled=falseby default.Minimal v1 should target ordinary Lance table reads only: local/direct object-store storage options, projection, filter SQL, limit/offset, batch size, fragment splits, and Comet-supported Spark types. Search, hybrid search, index-backed planning, namespace-backed credential refresh, metadata/version columns, and aggregation pushdown should be added in later phases after separate semantic review.
Architecture sketch
The key contract is that Lance Spark keeps ownership of Spark planning and scan semantics, while Comet only consumes an explicit native-read descriptor and executes that descriptor with Rust Lance when the optional contrib is present and enabled.
Descriptor boundary:
Additional context
The phased roadmap is:
Known prototype blocker to resolve before this can be considered merge-ready: packaged Comet currently contains
org.apache.arrow.cclasses rewritten against Comet's shaded Arrow allocator, while Lance Spark expects the normal Arrow C Data ABI. An end-to-end packaged smoke test with both jars exposes this classpath conflict. The final design needs an explicit Arrow C Data packaging/classloader strategy and CI coverage for Comet + Lance Spark together.