Skip to content

Discussion: remove incompatible native implementations that have a codegen dispatch fallback? #4654

@andygrove

Description

@andygrove

Background

Comet has a CodegenDispatchFallback mechanism for expression serdes that have a native implementation which is Incompatible with Spark for some (or all) inputs. When such an expression reports Incompatible and the user has not enabled that expression's allowIncompatible, it routes through the JVM codegen dispatcher (Spark's own doGenCode compiled into an Arrow-direct batch kernel) instead of falling the operator back to Spark. The dispatcher is on by default (spark.comet.exec.scalaUDF.codegen.enabled=true).

A consequence of this design: for these expressions the incompatible native code is only reachable when allowIncompatible=true. By default everyone already runs through the (correct) codegen dispatcher. So for the "always incompatible" expressions, the native impl exists purely to give opt-in users a faster-but-wrong result instead of a correct dispatcher result that is already near-native speed.

This issue is to discuss whether we should remove some of these incompatible native implementations entirely and rely on codegen dispatch.

Survey

There are currently 14 serdes mixing in CodegenDispatchFallback. They split into two groups.

Group A: always Incompatible (native path is opt-in / wrong-by-default)

getSupportLevel returns Incompatible for every non-error input. Default users always route through codegen dispatch; the native impl only runs under allowIncompatible.

Expression File Reason for incompatibility
CometFromUTCTimestamp serde/datetime.scala DataFusion timezone-string parsing diverges from Spark (e.g. GMT+1, PST throw a native parse error) — #2013
CometToUTCTimestamp serde/datetime.scala same timezone-parse divergence — #2013
CometConvertTimezone serde/datetime.scala same timezone-parse divergence — #2013
CometFromUnixTime serde/unixtime.scala only supports the default pattern; DataFusion's valid timestamp range differs from Spark — apache/datafusion#16594
CometArrayExcept serde/arrays.scala null handling and ordering may differ from Spark
CometArrayJoin serde/arrays.scala null handling may differ from Spark
CometArrayIntersect serde/arrays.scala result element order may differ when the right array is longer (DataFusion probes the longer side)

Group B: conditionally Incompatible (native path is Compatible for the common case)

The native impl is correct and fast for the majority case; only a narrow edge case reports Incompatible and routes to dispatch. The native code is not separable from a "compatible impl" — it is the same code serving both.

Expression File Compatible case / incompatible edge
CometHour serde/datetime.scala compatible for all types except TimestampNTZ#3180
CometMinute serde/datetime.scala compatible for all types except TimestampNTZ#3180
CometSecond serde/datetime.scala compatible for all types except TimestampNTZ#3180
CometTruncDate serde/datetime.scala compatible for literal supported formats; incompatible only for non-literal format strings
CometTruncTimestamp serde/datetime.scala compatible for UTC + literal format; incompatible for non-UTC timezones (#2649) or non-literal format strings
CometReverse serde/collectionOperations.scala compatible except for non-default string collation — #2190
CometMapFromEntries serde/maps.scala compatible except for BinaryType keys/values

Discussion

For Group B, there is nothing meaningful to remove: the native code is the compatible fast path for the dominant case, and the Incompatible branch only routes a rare edge to the dispatcher, which is the desired behavior.

For Group A, removing the incompatible native impls (converting them to plain codegen dispatch) would:

  • change nothing for default users (already dispatching),
  • remove a correctness footgun, since allowIncompatible would no longer silently produce wrong timezone/array results,
  • allow deleting the dedicated native code where it is not shared with other expressions.

Costs / things to weigh:

  • With the dispatcher disabled (spark.comet.exec.scalaUDF.codegen.enabled=false), these would fall all the way back to Spark instead of running the opt-in native path.
  • Some Group A natives are shared DataFusion/Comet scalar functions (array_except, array_intersect, array_to_string, from_utc_timestamp, to_utc_timestamp) used elsewhere, so removal saves serde code but not necessarily much Rust.

Questions for the community

  1. Should we remove the Group A incompatible native impls and rely on codegen dispatch?
  2. Is there a real-world use case for keeping the opt-in incompatible native path (faster-but-wrong via allowIncompatible) for any of these?
  3. Should we keep Group B exactly as-is?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions