feat: enable JVM Scala UDF codegen dispatch by default#4514
Conversation
Flip `spark.comet.exec.scalaUDF.codegen.enabled` to default `true` so that eligible Spark `ScalaUDF` expressions are routed through Comet's Arrow-direct codegen dispatcher without requiring opt-in. The feature is no longer marked experimental. Update the Scala/Java UDF and Iceberg user guides to reflect that the dispatcher is on by default and document how to disable it.
87fbb79 to
1f5de3e
Compare
|
|
Disable UDF codegen dispatch within the ArrayInsertUnsupportedArgs test so the UDF-derived position stays non-convertible and continues to exercise the ArrayInsert fallback path. With codegen now enabled by default, the UDF was converted to proto, making ArrayInsert fully native and removing the expected fallback reason. Drop the per-compile log in CometBatchKernelCodegen from info to debug now that codegen dispatch is on by default and the message would otherwise be noisy.
Good point. I dropped it down to |
There was a problem hiding this comment.
Really excited to see this code path exercised more! Thanks @andygrove!
Thanks @mbutrovich. I went with a different approach of disabling the codegen dispatch for these tests rather than changing the test expectations. Let me know what you think when you have time. |
Hm, assuming the tests checked the query results too I'd prefer to keep the feature enabled, since then we'd just be changing the plan structure checks. Not having dug into the tests, if it's just a plan structure check it's fine to disable the tests. |
Thanks for the quick response. Ok, will revert to your approach. |
…patch Enabling the JVM Scala UDF codegen dispatcher by default routes supported ScalaUDF expressions into native execution, changing observable behavior in two upstream Spark tests. Rather than disabling the dispatcher for these tests, adjust the expectations so the feature stays exercised: - SQLQuerySuite "Common subexpression elimination": CometProject and CometHashAggregate do not implement cross-sibling subexpression elimination over ScalaUDF, so the aggregate case invokes the UDF once per reference (count 3 vs 1). The query result is unchanged. Tracking: apache#4516. - SQLQueryTestSuite "udf/postgreSQL/udf-select_having.sql": the divide-by-zero in 1/udf(a) evaluates natively and surfaces as CometNativeException instead of SparkArithmeticException; normalize both to a DIVIDE_BY_ZERO placeholder for the literal compare. Tracking: apache#4517.
962e3d7 to
72286f5
Compare
|
Good call. I dug into the two failing tests and you are right that both check query results, not just plan structure:
So I switched away from disabling the dispatcher and went with adjusting the expectations instead, keeping the feature exercised:
I filed follow-ons for the underlying gaps and linked them in the diff comments: #4516 (cross-sibling CSE over ScalaUDF) and #4517 (native divide-by-zero surfaces the wrong exception type). Regenerated all four diffs (3.4.3 / 3.5.8 / 4.0.2 / 4.1.1) via the clone-apply-modify-regenerate workflow and merged latest main. |
After merging main, apache#4499 routes InitCap through the JVM codegen dispatcher when allowIncompatible is off. With this PR enabling the dispatcher by default, `select initcap(_1)` now runs natively without allowIncompatible, so the "enable incompat expression using dynamic config" test no longer sees the Spark fallback it asserts. Disable the dispatcher within the test so it exercises the serde-level incompatible fallback path it is meant to cover.
Which issue does this PR close?
Closes #4513.
Rationale for this change
Comet's Arrow-direct codegen dispatcher routes eligible Spark
ScalaUDFexpressions through native execution, keeping the project, exchange, and sort operators around a UDF on the Comet path instead of forcing a fall back to Spark and a columnar-to-row roundtrip. Until now this was gated behindspark.comet.exec.scalaUDF.codegen.enabled=falseand documented as experimental.The feature has broad type coverage (scalars, complex types with arbitrary nesting, higher-order functions) and is backed by end-to-end correctness, fuzz, and Iceberg test coverage. Enabling it by default lets users benefit without opt-in.
What changes are included in this PR?
spark.comet.exec.scalaUDF.codegen.enabledtotrue.Users can set
spark.comet.exec.scalaUDF.codegen.enabled=falseto restore the previous fall-back-to-Spark behavior.How are these changes tested?
Covered by existing tests for the codegen dispatcher (
CometCodegenSuite,CometCodegenFuzzSuite,CometCodegenHOFSuite,CometTemporalExpressionSuite, the datetime SQL file tests, andCometIcebergRewriteActionSuite), which exercise both the enabled and disabled paths.