Background
Comet has a CodegenDispatchFallback mechanism for expression serdes that have a native implementation which is Incompatible with Spark for some (or all) inputs. When such an expression reports Incompatible and the user has not enabled that expression's allowIncompatible, it routes through the JVM codegen dispatcher (Spark's own doGenCode compiled into an Arrow-direct batch kernel) instead of falling the operator back to Spark. The dispatcher is on by default (spark.comet.exec.scalaUDF.codegen.enabled=true).
A consequence of this design: for these expressions the incompatible native code is only reachable when allowIncompatible=true. By default everyone already runs through the (correct) codegen dispatcher. So for the "always incompatible" expressions, the native impl exists purely to give opt-in users a faster-but-wrong result instead of a correct dispatcher result that is already near-native speed.
This issue is to discuss whether we should remove some of these incompatible native implementations entirely and rely on codegen dispatch.
Survey
There are currently 14 serdes mixing in CodegenDispatchFallback. They split into two groups.
Group A: always Incompatible (native path is opt-in / wrong-by-default)
getSupportLevel returns Incompatible for every non-error input. Default users always route through codegen dispatch; the native impl only runs under allowIncompatible.
| Expression |
File |
Reason for incompatibility |
CometFromUTCTimestamp |
serde/datetime.scala |
DataFusion timezone-string parsing diverges from Spark (e.g. GMT+1, PST throw a native parse error) — #2013 |
CometToUTCTimestamp |
serde/datetime.scala |
same timezone-parse divergence — #2013 |
CometConvertTimezone |
serde/datetime.scala |
same timezone-parse divergence — #2013 |
CometFromUnixTime |
serde/unixtime.scala |
only supports the default pattern; DataFusion's valid timestamp range differs from Spark — apache/datafusion#16594 |
CometArrayExcept |
serde/arrays.scala |
null handling and ordering may differ from Spark |
CometArrayJoin |
serde/arrays.scala |
null handling may differ from Spark |
CometArrayIntersect |
serde/arrays.scala |
result element order may differ when the right array is longer (DataFusion probes the longer side) |
Group B: conditionally Incompatible (native path is Compatible for the common case)
The native impl is correct and fast for the majority case; only a narrow edge case reports Incompatible and routes to dispatch. The native code is not separable from a "compatible impl" — it is the same code serving both.
| Expression |
File |
Compatible case / incompatible edge |
CometHour |
serde/datetime.scala |
compatible for all types except TimestampNTZ — #3180 |
CometMinute |
serde/datetime.scala |
compatible for all types except TimestampNTZ — #3180 |
CometSecond |
serde/datetime.scala |
compatible for all types except TimestampNTZ — #3180 |
CometTruncDate |
serde/datetime.scala |
compatible for literal supported formats; incompatible only for non-literal format strings |
CometTruncTimestamp |
serde/datetime.scala |
compatible for UTC + literal format; incompatible for non-UTC timezones (#2649) or non-literal format strings |
CometReverse |
serde/collectionOperations.scala |
compatible except for non-default string collation — #2190 |
CometMapFromEntries |
serde/maps.scala |
compatible except for BinaryType keys/values |
Discussion
For Group B, there is nothing meaningful to remove: the native code is the compatible fast path for the dominant case, and the Incompatible branch only routes a rare edge to the dispatcher, which is the desired behavior.
For Group A, removing the incompatible native impls (converting them to plain codegen dispatch) would:
- change nothing for default users (already dispatching),
- remove a correctness footgun, since
allowIncompatible would no longer silently produce wrong timezone/array results,
- allow deleting the dedicated native code where it is not shared with other expressions.
Costs / things to weigh:
- With the dispatcher disabled (
spark.comet.exec.scalaUDF.codegen.enabled=false), these would fall all the way back to Spark instead of running the opt-in native path.
- Some Group A natives are shared DataFusion/Comet scalar functions (
array_except, array_intersect, array_to_string, from_utc_timestamp, to_utc_timestamp) used elsewhere, so removal saves serde code but not necessarily much Rust.
Questions for the community
- Should we remove the Group A incompatible native impls and rely on codegen dispatch?
- Is there a real-world use case for keeping the opt-in incompatible native path (faster-but-wrong via
allowIncompatible) for any of these?
- Should we keep Group B exactly as-is?
Background
Comet has a
CodegenDispatchFallbackmechanism for expression serdes that have a native implementation which isIncompatiblewith Spark for some (or all) inputs. When such an expression reportsIncompatibleand the user has not enabled that expression'sallowIncompatible, it routes through the JVM codegen dispatcher (Spark's owndoGenCodecompiled into an Arrow-direct batch kernel) instead of falling the operator back to Spark. The dispatcher is on by default (spark.comet.exec.scalaUDF.codegen.enabled=true).A consequence of this design: for these expressions the incompatible native code is only reachable when
allowIncompatible=true. By default everyone already runs through the (correct) codegen dispatcher. So for the "always incompatible" expressions, the native impl exists purely to give opt-in users a faster-but-wrong result instead of a correct dispatcher result that is already near-native speed.This issue is to discuss whether we should remove some of these incompatible native implementations entirely and rely on codegen dispatch.
Survey
There are currently 14 serdes mixing in
CodegenDispatchFallback. They split into two groups.Group A: always
Incompatible(native path is opt-in / wrong-by-default)getSupportLevelreturnsIncompatiblefor every non-error input. Default users always route through codegen dispatch; the native impl only runs underallowIncompatible.CometFromUTCTimestampserde/datetime.scalaGMT+1,PSTthrow a native parse error) — #2013CometToUTCTimestampserde/datetime.scalaCometConvertTimezoneserde/datetime.scalaCometFromUnixTimeserde/unixtime.scalaCometArrayExceptserde/arrays.scalaCometArrayJoinserde/arrays.scalaCometArrayIntersectserde/arrays.scalaGroup B: conditionally
Incompatible(native path isCompatiblefor the common case)The native impl is correct and fast for the majority case; only a narrow edge case reports
Incompatibleand routes to dispatch. The native code is not separable from a "compatible impl" — it is the same code serving both.CometHourserde/datetime.scalaTimestampNTZ— #3180CometMinuteserde/datetime.scalaTimestampNTZ— #3180CometSecondserde/datetime.scalaTimestampNTZ— #3180CometTruncDateserde/datetime.scalaCometTruncTimestampserde/datetime.scalaCometReverseserde/collectionOperations.scalaCometMapFromEntriesserde/maps.scalaBinaryTypekeys/valuesDiscussion
For Group B, there is nothing meaningful to remove: the native code is the compatible fast path for the dominant case, and the
Incompatiblebranch only routes a rare edge to the dispatcher, which is the desired behavior.For Group A, removing the incompatible native impls (converting them to plain codegen dispatch) would:
allowIncompatiblewould no longer silently produce wrong timezone/array results,Costs / things to weigh:
spark.comet.exec.scalaUDF.codegen.enabled=false), these would fall all the way back to Spark instead of running the opt-in native path.array_except,array_intersect,array_to_string,from_utc_timestamp,to_utc_timestamp) used elsewhere, so removal saves serde code but not necessarily much Rust.Questions for the community
allowIncompatible) for any of these?