feat: route date_format through codegen dispatcher for non-native cases#4373
Draft
andygrove wants to merge 2 commits into
Draft
feat: route date_format through codegen dispatcher for non-native cases#4373andygrove wants to merge 2 commits into
andygrove wants to merge 2 commits into
Conversation
aca31e7 to
a11231f
Compare
…l short-circuit CometBatchKernelCodegen.defaultBody emitted this.col$ord.isNull(i) for every NullIntolerant input, but primitive Arrow vectors (timestamp / int / float / date / bool / ...) are wrapped in CometPlainVector at input-cast time and expose isNullAt rather than the raw Arrow isNull. The short-circuit therefore failed to compile for any primitive-typed column with a Janino "method isNull not declared" error. Share the existing nullCheckMethod helper between emitTypedGetters and defaultBody so both sites pick the right method name per column. Add a source test that pins the chosen method for TimeStampMicroTZVector inputs.
CometDateFormat keeps the native to_char path for UTC sessions with a format
literal in the strftime-mappable whitelist, and now routes every other case
through the Arrow-direct codegen dispatcher (CometScalaUDFCodegen) so that
non-UTC sessions, non-literal formats, and formats outside the whitelist
stay inside the Comet pipeline running Spark's own DateFormatClass.doGenCode.
Refactor: extract the closure-serialize + JvmScalarUdf-proto emission from
CometScalaUDF.convert into a reusable CometScalaUDF.emitJvmCodegenDispatch
helper. Any serde that wants to fall back to a Spark built-in expression
through the dispatcher can call it. Gated by COMET_SCALA_UDF_CODEGEN_ENABLED
so the default remains a clean Spark fallback for those cases until the
dispatcher graduates from experimental.
Reasoning notes:
- DateFormatClass already has a proper doGenCode (not CodegenFallback),
NullIntolerant, and ResolveTimeZone stamps the timeZoneId on it during
analysis. Closure-serializing the bound tree therefore reproduces
Spark-identical behavior for every timezone.
- The kernel cache key already encodes the literal format and timezone via
the serialized expression bytes, so (format, tz) combinations get
distinct cached kernels just like a bespoke (format, tz) -> formatter
cache would. Saves an entire DateFormatUDF.scala class.
Tests:
- date_format - timestamp_ntz input: now runs checkSparkAnswerAndOperator
for every timezone under the codegen flag instead of falling back for
non-UTC.
- Split each previous "falls back to Spark" Scala test into two: one
asserting the codegen-on path stays in Comet, one asserting the
codegen-off path falls back with the dispatcher flag as the reason.
- date_format.sql now pins a non-UTC session timezone and enables the
codegen flag at file scope; all queries are plain query and assert
in-Comet execution.
d46609b to
d344ec0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #3202.
Rationale for this change
Comet's
date_formatserde only emits the nativeto_charpath for UTC sessions with a format literal in a small strftime-mappable whitelist. Every other case (non-UTC session, non-literal format, format outside the whitelist) falls back to Spark, which breaks up the Comet pipeline. Most production workloads run with a non-UTC session timezone, so the fallback is the common case in practice.The Arrow-direct codegen dispatcher already in tree (
CometScalaUDFCodegen, behindspark.comet.exec.scalaUDF.codegen.enabled) closure-serializes a bound Catalyst expression, ships it through aJvmScalarUdfproto, and Janino-compiles Spark's owndoGenCodeinto a per-batch kernel that reads Arrow vectors and writes an Arrow output vector directly.DateFormatClassalready has a realdoGenCode(notCodegenFallback), isNullIntolerant, and hastimeZoneIdstamped on it byResolveTimeZoneduring analysis. Routing it through the dispatcher therefore reproduces Spark behavior for any timezone or format, without a bespoke UDF class.What changes are included in this PR?
Two commits.
fix(codegen)— Pre-existing bug surfaced while wiringdate_formatthrough the dispatcher.CometBatchKernelCodegen.defaultBodyemittedthis.col$ord.isNull(i)for everyNullIntolerantinput, but primitive Arrow vectors (TimeStampMicroTZVector,IntVector,Float8Vector, etc.) are wrapped inCometPlainVectorat input-cast time and exposeisNullAt, notisNull. Janino rejected the kernel with "methodisNullnot declared".emitTypedGettersalready knew the right method name vianullCheckMethod. The fix exposes that helper sodefaultBodypicks the same name per column ordinal. New source test pins the chosen method forTimeStampMicroTZVectorso the regression can't recur.feat— Extract the closure-serialize +JvmScalarUdfemission fromCometScalaUDF.convertintoCometScalaUDF.emitJvmCodegenDispatch(expr, inputs, binding).CometDateFormat.convertkeeps the nativeto_charpath for UTC + whitelisted-format cases and now calls the helper for everything else. Gated byspark.comet.exec.scalaUDF.codegen.enabled(default false, marked experimental), so default behavior is unchanged. Non-UTC and non-whitelisted formats still fall back to Spark unless the user opts in.The dispatcher's kernel cache is keyed on
(serialized expression bytes, input column shapes), so distinct(format literal, timezone)combinations get distinct cached kernels just like a bespoke(format, tz) -> formattercache would. NoDateFormatUDF.scalais introduced.How are these changes tested?
CometTemporalExpressionSuitedate_formattests: 10/10 pass. The three "falls back to Spark" tests are paired with a "routes via codegen dispatcher" sibling that enables the flag and asserts in-Comet execution. The disabled-flag variants still assertcheckSparkAnswerAndFallbackReasonwith the dispatcher flag as the reason.date_format - timestamp_ntz inputrunscheckSparkAnswerAndOperatorfor every timezone under the codegen flag.CometSqlFileTestSuitedate_format: 8/8 pass.date_format.sqlnow pinsAmerica/Los_Angelesand enables the codegen flag at file scope, with all queries plainquery(noexpect_fallback).CometCodegenSourceSuite: 30/30 pass, including the newNullIntolerant short-circuit uses isNullAt for CometPlainVector-wrapped columnsregression test.CometCodegenSuite: 40+/40+ pass. No regressions in the dispatcher's existing surface.