Discussion: remove incompatible native implementations that have a codegen dispatch fallback?

## Background

Comet has a `CodegenDispatchFallback` mechanism for expression serdes that have a native implementation which is `Incompatible` with Spark for some (or all) inputs. When such an expression reports `Incompatible` and the user has not enabled that expression's `allowIncompatible`, it routes through the JVM codegen dispatcher (Spark's own `doGenCode` compiled into an Arrow-direct batch kernel) instead of falling the operator back to Spark. The dispatcher is on by default (`spark.comet.exec.scalaUDF.codegen.enabled=true`).

A consequence of this design: for these expressions the incompatible native code is **only reachable when `allowIncompatible=true`**. By default everyone already runs through the (correct) codegen dispatcher. So for the "always incompatible" expressions, the native impl exists purely to give opt-in users a faster-but-wrong result instead of a correct dispatcher result that is already near-native speed.

This issue is to discuss whether we should remove some of these incompatible native implementations entirely and rely on codegen dispatch.

## Survey

There are currently 14 serdes mixing in `CodegenDispatchFallback`. They split into two groups.

### Group A: always `Incompatible` (native path is opt-in / wrong-by-default)

`getSupportLevel` returns `Incompatible` for every non-error input. Default users always route through codegen dispatch; the native impl only runs under `allowIncompatible`.

| Expression | File | Reason for incompatibility |
|---|---|---|
| `CometFromUTCTimestamp` | `serde/datetime.scala` | DataFusion timezone-string parsing diverges from Spark (e.g. `GMT+1`, `PST` throw a native parse error) — #2013 |
| `CometToUTCTimestamp` | `serde/datetime.scala` | same timezone-parse divergence — #2013 |
| `CometConvertTimezone` | `serde/datetime.scala` | same timezone-parse divergence — #2013 |
| `CometFromUnixTime` | `serde/unixtime.scala` | only supports the default pattern; DataFusion's valid timestamp range differs from Spark — apache/datafusion#16594 |
| `CometArrayExcept` | `serde/arrays.scala` | null handling and ordering may differ from Spark |
| `CometArrayJoin` | `serde/arrays.scala` | null handling may differ from Spark |
| `CometArrayIntersect` | `serde/arrays.scala` | result element order may differ when the right array is longer (DataFusion probes the longer side) |

### Group B: conditionally `Incompatible` (native path is `Compatible` for the common case)

The native impl is correct and fast for the majority case; only a narrow edge case reports `Incompatible` and routes to dispatch. The native code is not separable from a "compatible impl" — it is the same code serving both.

| Expression | File | Compatible case / incompatible edge |
|---|---|---|
| `CometHour` | `serde/datetime.scala` | compatible for all types except `TimestampNTZ` — #3180 |
| `CometMinute` | `serde/datetime.scala` | compatible for all types except `TimestampNTZ` — #3180 |
| `CometSecond` | `serde/datetime.scala` | compatible for all types except `TimestampNTZ` — #3180 |
| `CometTruncDate` | `serde/datetime.scala` | compatible for literal supported formats; incompatible only for non-literal format strings |
| `CometTruncTimestamp` | `serde/datetime.scala` | compatible for UTC + literal format; incompatible for non-UTC timezones (#2649) or non-literal format strings |
| `CometReverse` | `serde/collectionOperations.scala` | compatible except for non-default string collation — #2190 |
| `CometMapFromEntries` | `serde/maps.scala` | compatible except for `BinaryType` keys/values |

## Discussion

For **Group B**, there is nothing meaningful to remove: the native code is the compatible fast path for the dominant case, and the `Incompatible` branch only routes a rare edge to the dispatcher, which is the desired behavior.

For **Group A**, removing the incompatible native impls (converting them to plain codegen dispatch) would:

- change nothing for default users (already dispatching),
- remove a correctness footgun, since `allowIncompatible` would no longer silently produce wrong timezone/array results,
- allow deleting the dedicated native code where it is not shared with other expressions.

Costs / things to weigh:

- With the dispatcher disabled (`spark.comet.exec.scalaUDF.codegen.enabled=false`), these would fall all the way back to Spark instead of running the opt-in native path.
- Some Group A natives are shared DataFusion/Comet scalar functions (`array_except`, `array_intersect`, `array_to_string`, `from_utc_timestamp`, `to_utc_timestamp`) used elsewhere, so removal saves serde code but not necessarily much Rust.

### Questions for the community

1. Should we remove the Group A incompatible native impls and rely on codegen dispatch?
2. Is there a real-world use case for keeping the opt-in incompatible native path (faster-but-wrong via `allowIncompatible`) for any of these?
3. Should we keep Group B exactly as-is?


Expression	File	Reason for incompatibility
`CometFromUTCTimestamp`	`serde/datetime.scala`	DataFusion timezone-string parsing diverges from Spark (e.g. `GMT+1`, `PST` throw a native parse error) — #2013
`CometToUTCTimestamp`	`serde/datetime.scala`	same timezone-parse divergence — #2013
`CometConvertTimezone`	`serde/datetime.scala`	same timezone-parse divergence — #2013
`CometFromUnixTime`	`serde/unixtime.scala`	only supports the default pattern; DataFusion's valid timestamp range differs from Spark — apache/datafusion#16594
`CometArrayExcept`	`serde/arrays.scala`	null handling and ordering may differ from Spark
`CometArrayJoin`	`serde/arrays.scala`	null handling may differ from Spark
`CometArrayIntersect`	`serde/arrays.scala`	result element order may differ when the right array is longer (DataFusion probes the longer side)

Expression	File	Compatible case / incompatible edge
`CometHour`	`serde/datetime.scala`	compatible for all types except `TimestampNTZ` — #3180
`CometMinute`	`serde/datetime.scala`	compatible for all types except `TimestampNTZ` — #3180
`CometSecond`	`serde/datetime.scala`	compatible for all types except `TimestampNTZ` — #3180
`CometTruncDate`	`serde/datetime.scala`	compatible for literal supported formats; incompatible only for non-literal format strings
`CometTruncTimestamp`	`serde/datetime.scala`	compatible for UTC + literal format; incompatible for non-UTC timezones (#2649) or non-literal format strings
`CometReverse`	`serde/collectionOperations.scala`	compatible except for non-default string collation — #2190
`CometMapFromEntries`	`serde/maps.scala`	compatible except for `BinaryType` keys/values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: remove incompatible native implementations that have a codegen dispatch fallback? #4654

Background

Survey

Group A: always `Incompatible` (native path is opt-in / wrong-by-default)

Group B: conditionally `Incompatible` (native path is `Compatible` for the common case)

Discussion

Questions for the community

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discussion: remove incompatible native implementations that have a codegen dispatch fallback? #4654

Description

Background

Survey

Group A: always Incompatible (native path is opt-in / wrong-by-default)

Group B: conditionally Incompatible (native path is Compatible for the common case)

Discussion

Questions for the community

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Group A: always `Incompatible` (native path is opt-in / wrong-by-default)

Group B: conditionally `Incompatible` (native path is `Compatible` for the common case)