[SPARK-57557][SQL] Support the TIME data type in quantile aggregates#56889
[SPARK-57557][SQL] Support the TIME data type in quantile aggregates#56889vboo123 wants to merge 2 commits into
Conversation
### What changes were proposed in this pull request? Add `TimeType` as a valid input to the quantile aggregate expressions: `percentile`, `median`, `percentile_cont`/`percentile_disc`, `percentile_approx`/`approx_percentile`, and `histogram_numeric`. `TIME`'s internal value is a `Long` (nanoseconds of day), so it flows through the existing value<->double conversions exactly like `TIMESTAMP` and `DAY-TIME INTERVAL`. The aggregates return `TimeType`, preserving the input precision. The Apache DataSketches aggregates (HLL/Theta/KLL) are intentionally left out of this change and can be added in a follow-up. ### Why are the changes needed? Sub-task of SPARK-57550 (extend TIME support). `TIME` (SPARK-51162, shipped 4.1.0) is an ordered type but was rejected by these aggregates. ### Does this PR introduce any user-facing change? Yes. `percentile`, `median`, `percentile_approx` and `histogram_numeric` now accept `TIME` columns and return `TIME`. ### How was this patch tested? Unit tests in ApproximatePercentileSuite, PercentileSuite and HistogramNumericSuite; end-to-end tests in ApproximatePercentileQuerySuite; golden-file coverage in percentiles.sql. Generated using Kiro (Claude Opus 4.8).
MaxGekk
left a comment
There was a problem hiding this comment.
0 blocking, 1 non-blocking, 0 nits.
Clean, precision-correct TIME extension mirroring the existing Long-internal type handling (DAY-TIME INTERVAL for the exact aggregates, TIMESTAMP for approx/histogram). Conversions are lossless over the full TIME domain, the return type preserves TIME(n), and there's no feature-gating bypass.
Suggestions (1)
percentiles.scala:78: reuse the namedNumericAndAnsiIntervalconstant instead of re-inlining its three types — see inline.
Verification
- Lossless Long<->Double: TIME is nanos-of-day in
[0, 8.64e13)<2^53, soLong -> Double -> Longround-trips exactly (input Long -> double for interpolation -> outputtoLong->dataType = TIME). Strictly safer than DAY-TIME INTERVAL, which can exceed2^53. - Ordering: the exact-percentile key sort uses
PhysicalDataType.ordering(TimeType), which resolves toPhysicalLongTypeviaTimeTypeOps. - Gating:
spark.sql.timeType.enabledis enforced centrally at TIME construction, so acceptingTimeTypeininputTypesdoesn't bypass it.
| case _ => DoubleType | ||
| } | ||
| Seq(NumericAndAnsiInterval, percentageExpType, IntegralType) | ||
| Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType), |
There was a problem hiding this comment.
Non-blocking: this re-inlines the three types that NumericAndAnsiInterval already groups (NumericType, DayTimeIntervalType, YearMonthIntervalType). Since TypeCollection nests/flattens, you can keep the named abstraction and just add the new type:
| Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType), | |
| Seq(TypeCollection(NumericAndAnsiInterval, AnyTimeType), |
That keeps this call site in sync if NumericAndAnsiInterval ever changes. (You'll also keep the existing import ...TypeCollection.NumericAndAnsiInterval rather than removing it.)
| @@ -102,10 +102,10 @@ case class ApproximatePercentile( | |||
| private lazy val accuracy: Long = accuracyNum.longValue | |||
|
|
|||
There was a problem hiding this comment.
The function docs are not updated. The @ExpressionDescription usage text for percentile/median/percentile_cont/percentile_disc still says "numeric or ANSI interval column". These feed the SQL function reference. Adding TIME there accordingly would be more accurate.
| assert(Median(BoundReference(0, TimeType(3), nullable = false)).dataType === TimeType(3)) | ||
|
|
||
| // The exact median of two TIME values interpolates (continuous distribution), returning the | ||
| // midpoint as Long nanos-of-day: median(00:00:00, 00:00:00.00000001) -> 00:00:00.000000005. |
There was a problem hiding this comment.
Nit: misleading test comment. In PercentileSuite, the comment median(00:00:00, 00:00:00.00000001) -> 00:00:00.000000005 uses 9-digit (nanosecond) display, but TimeType max precision is microseconds (6), so that literal isn't representable. The actual assertion (=== 5L on the internal Long) is correct; just the comment's notation is off.
- percentiles.scala: nest the named NumericAndAnsiInterval constant in inputTypes rather than re-inlining its member types (MaxGekk) - Update @ExpressionDescription usage docs for percentile, median, percentile_cont, percentile_disc and percentile_approx to mention TIME (uros-b) - Fix misleading nanosecond-literal comment in PercentileSuite; TimeType max precision is microseconds, so describe the internal Long values (uros-b) - Sync the PercentileSuite accepted-types assertion to the nested TypeCollection message Generated using Kiro (Claude Opus 4.8).
What changes were proposed in this pull request?
This PR adds support for the TIME data type (
TimeType) as an input to the core quantile aggregate expressions. SPARK-57557 is a sub-task of the umbrella SPARK-57550 (extend TIME type support across Spark).The following expressions now accept TIME columns and return TIME:
percentile,median,percentile_cont,percentile_disc(exact quantiles,percentiles.scala)percentile_approx/approx_percentile(ApproximatePercentile)histogram_numeric(HistogramNumeric)TIME's internal representation is a
Long(nanoseconds of day), so it flows through the existing value↔double conversion paths exactly like the other ordered,Long-backed types. The change is mechanical:TimeType(viaAnyTimeType) added to each expression'sinputTypes/TypeCollectionTimeTypebranch added to the update (Long→Double) and eval (Double→Long) conversionschild.dataType, preserving the input precision (e.g.medianover aTIME(3)column returnsTIME(3))Design notes (called out so they can be challenged in review):
percentile/median(continuous distribution), interpolated results mirror existing numeric/interval behavior — e.g. the median of two TIME values returns their midpoint, which may be a value not present in the data; sub-nanosecond fractions are truncated.percentile/medianandhistogram_numeric, TIMESTAMP is not currently supported either, so the implementation mirrors DAY-TIME INTERVAL (alsoLong-internal). Forpercentile_approxandhistogram_numeric, TIMESTAMP was already supported and TIME mirrors it directly.Why are the changes needed?
The TIME data type was added by SPIP SPARK-51162 and shipped in Spark 4.1.0, but the quantile aggregate functions did not accept it. Calling
percentile,median,percentile_approx, orhistogram_numericon a TIME column failed with a type-check error, even though the operation is well-defined: TIME is internally aLong(nanoseconds of day) and quantile computation over an ordered numeric value applies directly. This PR closes that gap.Does this PR introduce any user-facing change?
Yes.
percentile,median,percentile_cont,percentile_disc,percentile_approx/approx_percentile, andhistogram_numericnow accept TIME columns and return TIME values. Previously these expressions rejected TIME input with a type-check error.How was this patch tested?
All scoped suites were run locally and passed:
catalyst/testOnly org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentileSuite org.apache.spark.sql.catalyst.expressions.aggregate.PercentileSuite org.apache.spark.sql.catalyst.expressions.aggregate.HistogramNumericSuite— all tests passed (each suite has a newtest("SPARK-57557: ...")covering TIME input, return type, precision preservation, and median interpolation)sql/testOnly org.apache.spark.sql.ApproximatePercentileQuerySuite— 20 tests passed (newSPARK-57557: percentile_approx supports TIME typetest, scalar and array-of-percentiles paths)SPARK_GENERATE_GOLDEN_FILES=1 sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z percentiles.sql— golden files regenerated and manually reviewed; a new TIME section was added topercentiles.sqlPercentileSuiteassertion (the accepted-types error message) was updated to include TIMEThe full lint/scalastyle and complete test matrix run on GitHub Actions.
Was this patch authored or co-authored using generative AI tooling?
Yes. Generated using Kiro (Claude Opus 4.8)