feat: route date_format through codegen dispatcher for non-native cases by andygrove · Pull Request #4373 · apache/datafusion-comet

andygrove · 2026-05-20T17:37:50Z

Which issue does this PR close?

Part of #3202.

Rationale for this change

Comet's date_format serde only emits the native to_char path for UTC sessions with a format literal in a small strftime-mappable whitelist. Every other case (non-UTC session, non-literal format, format outside the whitelist) falls back to Spark, which breaks up the Comet pipeline. Most production workloads run with a non-UTC session timezone, so the fallback is the common case in practice.

The Arrow-direct codegen dispatcher already in tree (CometScalaUDFCodegen, behind spark.comet.exec.scalaUDF.codegen.enabled) closure-serializes a bound Catalyst expression, ships it through a JvmScalarUdf proto, and Janino-compiles Spark's own doGenCode into a per-batch kernel that reads Arrow vectors and writes an Arrow output vector directly. DateFormatClass already has a real doGenCode (not CodegenFallback), is NullIntolerant, and has timeZoneId stamped on it by ResolveTimeZone during analysis. Routing it through the dispatcher therefore reproduces Spark behavior for any timezone or format, without a bespoke UDF class.

What changes are included in this PR?

Two commits.

fix(codegen) — Pre-existing bug surfaced while wiring date_format through the dispatcher. CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every NullIntolerant input, but primitive Arrow vectors (TimeStampMicroTZVector, IntVector, Float8Vector, etc.) are wrapped in CometPlainVector at input-cast time and expose isNullAt, not isNull. Janino rejected the kernel with "method isNull not declared". emitTypedGetters already knew the right method name via nullCheckMethod. The fix exposes that helper so defaultBody picks the same name per column ordinal. New source test pins the chosen method for TimeStampMicroTZVector so the regression can't recur.

feat — Extract the closure-serialize + JvmScalarUdf emission from CometScalaUDF.convert into CometScalaUDF.emitJvmCodegenDispatch(expr, inputs, binding). CometDateFormat.convert keeps the native to_char path for UTC + whitelisted-format cases and now calls the helper for everything else. Gated by spark.comet.exec.scalaUDF.codegen.enabled (default false, marked experimental), so default behavior is unchanged. Non-UTC and non-whitelisted formats still fall back to Spark unless the user opts in.

The dispatcher's kernel cache is keyed on (serialized expression bytes, input column shapes), so distinct (format literal, timezone) combinations get distinct cached kernels just like a bespoke (format, tz) -> formatter cache would. No DateFormatUDF.scala is introduced.

How are these changes tested?

CometTemporalExpressionSuite date_format tests: 10/10 pass. The three "falls back to Spark" tests are paired with a "routes via codegen dispatcher" sibling that enables the flag and asserts in-Comet execution. The disabled-flag variants still assert checkSparkAnswerAndFallbackReason with the dispatcher flag as the reason. date_format - timestamp_ntz input runs checkSparkAnswerAndOperator for every timezone under the codegen flag.
CometSqlFileTestSuite date_format: 8/8 pass. date_format.sql now pins America/Los_Angeles and enables the codegen flag at file scope, with all queries plain query (no expect_fallback).
CometCodegenSourceSuite: 30/30 pass, including the new NullIntolerant short-circuit uses isNullAt for CometPlainVector-wrapped columns regression test.
CometCodegenSuite: 40+/40+ pass. No regressions in the dispatcher's existing surface.

…l short-circuit CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every NullIntolerant input, but primitive Arrow vectors (timestamp / int / float / date / bool / ...) are wrapped in CometPlainVector at input-cast time and expose isNullAt rather than the raw Arrow isNull. The short-circuit therefore failed to compile for any primitive-typed column with a Janino "method isNull not declared" error. Share the existing nullCheckMethod helper between emitTypedGetters and defaultBody so both sites pick the right method name per column. Add a source test that pins the chosen method for TimeStampMicroTZVector inputs.

CometDateFormat keeps the native to_char path for UTC sessions with a format literal in the strftime-mappable whitelist, and now routes every other case through the Arrow-direct codegen dispatcher (CometScalaUDFCodegen) so that non-UTC sessions, non-literal formats, and formats outside the whitelist stay inside the Comet pipeline running Spark's own DateFormatClass.doGenCode. Refactor: extract the closure-serialize + JvmScalarUdf-proto emission from CometScalaUDF.convert into a reusable CometScalaUDF.emitJvmCodegenDispatch helper. Any serde that wants to fall back to a Spark built-in expression through the dispatcher can call it. Gated by COMET_SCALA_UDF_CODEGEN_ENABLED so the default remains a clean Spark fallback for those cases until the dispatcher graduates from experimental. Reasoning notes: - DateFormatClass already has a proper doGenCode (not CodegenFallback), NullIntolerant, and ResolveTimeZone stamps the timeZoneId on it during analysis. Closure-serializing the bound tree therefore reproduces Spark-identical behavior for every timezone. - The kernel cache key already encodes the literal format and timezone via the serialized expression bytes, so (format, tz) combinations get distinct cached kernels just like a bespoke (format, tz) -> formatter cache would. Saves an entire DateFormatUDF.scala class. Tests: - date_format - timestamp_ntz input: now runs checkSparkAnswerAndOperator for every timezone under the codegen flag instead of falling back for non-UTC. - Split each previous "falls back to Spark" Scala test into two: one asserting the codegen-on path stays in Comet, one asserting the codegen-off path falls back with the dispatcher flag as the reason. - date_format.sql now pins a non-UTC session timezone and enables the codegen flag at file scope; all queries are plain query and assert in-Comet execution.

andygrove force-pushed the feat/date-format-jvm-udf branch from aca31e7 to a11231f Compare May 23, 2026 15:11

andygrove changed the title ~~feat: JVM UDF fallback for date_format~~ feat: route date_format through codegen dispatcher for non-native cases May 23, 2026

andygrove added 2 commits May 23, 2026 09:47

andygrove force-pushed the feat/date-format-jvm-udf branch from d46609b to d344ec0 Compare May 23, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: route date_format through codegen dispatcher for non-native cases#4373

feat: route date_format through codegen dispatcher for non-native cases#4373
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/date-format-jvm-udf

andygrove commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 20, 2026 •

edited

Loading