Skip to content

feat: route date_format through codegen dispatcher for non-native cases#4373

Draft
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/date-format-jvm-udf
Draft

feat: route date_format through codegen dispatcher for non-native cases#4373
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/date-format-jvm-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 20, 2026

Which issue does this PR close?

Part of #3202.

Rationale for this change

Comet's date_format serde only emits the native to_char path for UTC sessions with a format literal in a small strftime-mappable whitelist. Every other case (non-UTC session, non-literal format, format outside the whitelist) falls back to Spark, which breaks up the Comet pipeline. Most production workloads run with a non-UTC session timezone, so the fallback is the common case in practice.

The Arrow-direct codegen dispatcher already in tree (CometScalaUDFCodegen, behind spark.comet.exec.scalaUDF.codegen.enabled) closure-serializes a bound Catalyst expression, ships it through a JvmScalarUdf proto, and Janino-compiles Spark's own doGenCode into a per-batch kernel that reads Arrow vectors and writes an Arrow output vector directly. DateFormatClass already has a real doGenCode (not CodegenFallback), is NullIntolerant, and has timeZoneId stamped on it by ResolveTimeZone during analysis. Routing it through the dispatcher therefore reproduces Spark behavior for any timezone or format, without a bespoke UDF class.

What changes are included in this PR?

Two commits.

fix(codegen) — Pre-existing bug surfaced while wiring date_format through the dispatcher. CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every NullIntolerant input, but primitive Arrow vectors (TimeStampMicroTZVector, IntVector, Float8Vector, etc.) are wrapped in CometPlainVector at input-cast time and expose isNullAt, not isNull. Janino rejected the kernel with "method isNull not declared". emitTypedGetters already knew the right method name via nullCheckMethod. The fix exposes that helper so defaultBody picks the same name per column ordinal. New source test pins the chosen method for TimeStampMicroTZVector so the regression can't recur.

feat — Extract the closure-serialize + JvmScalarUdf emission from CometScalaUDF.convert into CometScalaUDF.emitJvmCodegenDispatch(expr, inputs, binding). CometDateFormat.convert keeps the native to_char path for UTC + whitelisted-format cases and now calls the helper for everything else. Gated by spark.comet.exec.scalaUDF.codegen.enabled (default false, marked experimental), so default behavior is unchanged. Non-UTC and non-whitelisted formats still fall back to Spark unless the user opts in.

The dispatcher's kernel cache is keyed on (serialized expression bytes, input column shapes), so distinct (format literal, timezone) combinations get distinct cached kernels just like a bespoke (format, tz) -> formatter cache would. No DateFormatUDF.scala is introduced.

How are these changes tested?

  • CometTemporalExpressionSuite date_format tests: 10/10 pass. The three "falls back to Spark" tests are paired with a "routes via codegen dispatcher" sibling that enables the flag and asserts in-Comet execution. The disabled-flag variants still assert checkSparkAnswerAndFallbackReason with the dispatcher flag as the reason. date_format - timestamp_ntz input runs checkSparkAnswerAndOperator for every timezone under the codegen flag.
  • CometSqlFileTestSuite date_format: 8/8 pass. date_format.sql now pins America/Los_Angeles and enables the codegen flag at file scope, with all queries plain query (no expect_fallback).
  • CometCodegenSourceSuite: 30/30 pass, including the new NullIntolerant short-circuit uses isNullAt for CometPlainVector-wrapped columns regression test.
  • CometCodegenSuite: 40+/40+ pass. No regressions in the dispatcher's existing surface.

@andygrove andygrove force-pushed the feat/date-format-jvm-udf branch from aca31e7 to a11231f Compare May 23, 2026 15:11
@andygrove andygrove changed the title feat: JVM UDF fallback for date_format feat: route date_format through codegen dispatcher for non-native cases May 23, 2026
andygrove added 2 commits May 23, 2026 09:47
…l short-circuit

CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every
NullIntolerant input, but primitive Arrow vectors (timestamp / int / float /
date / bool / ...) are wrapped in CometPlainVector at input-cast time and
expose isNullAt rather than the raw Arrow isNull. The short-circuit therefore
failed to compile for any primitive-typed column with a Janino "method isNull
not declared" error.

Share the existing nullCheckMethod helper between emitTypedGetters and
defaultBody so both sites pick the right method name per column. Add a source
test that pins the chosen method for TimeStampMicroTZVector inputs.
CometDateFormat keeps the native to_char path for UTC sessions with a format
literal in the strftime-mappable whitelist, and now routes every other case
through the Arrow-direct codegen dispatcher (CometScalaUDFCodegen) so that
non-UTC sessions, non-literal formats, and formats outside the whitelist
stay inside the Comet pipeline running Spark's own DateFormatClass.doGenCode.

Refactor: extract the closure-serialize + JvmScalarUdf-proto emission from
CometScalaUDF.convert into a reusable CometScalaUDF.emitJvmCodegenDispatch
helper. Any serde that wants to fall back to a Spark built-in expression
through the dispatcher can call it. Gated by COMET_SCALA_UDF_CODEGEN_ENABLED
so the default remains a clean Spark fallback for those cases until the
dispatcher graduates from experimental.

Reasoning notes:
  - DateFormatClass already has a proper doGenCode (not CodegenFallback),
    NullIntolerant, and ResolveTimeZone stamps the timeZoneId on it during
    analysis. Closure-serializing the bound tree therefore reproduces
    Spark-identical behavior for every timezone.
  - The kernel cache key already encodes the literal format and timezone via
    the serialized expression bytes, so (format, tz) combinations get
    distinct cached kernels just like a bespoke (format, tz) -> formatter
    cache would. Saves an entire DateFormatUDF.scala class.

Tests:
  - date_format - timestamp_ntz input: now runs checkSparkAnswerAndOperator
    for every timezone under the codegen flag instead of falling back for
    non-UTC.
  - Split each previous "falls back to Spark" Scala test into two: one
    asserting the codegen-on path stays in Comet, one asserting the
    codegen-off path falls back with the dispatcher flag as the reason.
  - date_format.sql now pins a non-UTC session timezone and enables the
    codegen flag at file scope; all queries are plain query and assert
    in-Comet execution.
@andygrove andygrove force-pushed the feat/date-format-jvm-udf branch from d46609b to d344ec0 Compare May 23, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant