[SPARK-57557][SQL] Support the TIME data type in quantile aggregates by vboo123 · Pull Request #56889 · apache/spark

vboo123 · 2026-06-30T05:52:55Z

What changes were proposed in this pull request?

This PR adds support for the TIME data type (TimeType) as an input to the core quantile aggregate expressions. SPARK-57557 is a sub-task of the umbrella SPARK-57550 (extend TIME type support across Spark).

The following expressions now accept TIME columns and return TIME:

percentile, median, percentile_cont, percentile_disc (exact quantiles, percentiles.scala)
percentile_approx / approx_percentile (ApproximatePercentile)
histogram_numeric (HistogramNumeric)

TIME's internal representation is a Long (nanoseconds of day), so it flows through the existing value↔double conversion paths exactly like the other ordered, Long-backed types. The change is mechanical:

TimeType (via AnyTimeType) added to each expression's inputTypes / TypeCollection
A TimeType branch added to the update (Long → Double) and eval (Double → Long) conversions
The return type uses child.dataType, preserving the input precision (e.g. median over a TIME(3) column returns TIME(3))

Design notes (called out so they can be challenged in review):

Scope: This PR covers only the core quantile aggregates above. The Apache DataSketches aggregates (HLL, Theta, and the KLL quantile sketches) are intentionally deferred and can be added in a follow-up.
Interpolation: For the exact percentile/median (continuous distribution), interpolated results mirror existing numeric/interval behavior — e.g. the median of two TIME values returns their midpoint, which may be a value not present in the data; sub-nanosecond fractions are truncated.
Template mirrored: For the exact percentile/median and histogram_numeric, TIMESTAMP is not currently supported either, so the implementation mirrors DAY-TIME INTERVAL (also Long-internal). For percentile_approx and histogram_numeric, TIMESTAMP was already supported and TIME mirrors it directly.

Why are the changes needed?

The TIME data type was added by SPIP SPARK-51162 and shipped in Spark 4.1.0, but the quantile aggregate functions did not accept it. Calling percentile, median, percentile_approx, or histogram_numeric on a TIME column failed with a type-check error, even though the operation is well-defined: TIME is internally a Long (nanoseconds of day) and quantile computation over an ordered numeric value applies directly. This PR closes that gap.

Does this PR introduce any user-facing change?

Yes. percentile, median, percentile_cont, percentile_disc, percentile_approx / approx_percentile, and histogram_numeric now accept TIME columns and return TIME values. Previously these expressions rejected TIME input with a type-check error.

How was this patch tested?

All scoped suites were run locally and passed:

catalyst/testOnly org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentileSuite org.apache.spark.sql.catalyst.expressions.aggregate.PercentileSuite org.apache.spark.sql.catalyst.expressions.aggregate.HistogramNumericSuite — all tests passed (each suite has a new test("SPARK-57557: ...") covering TIME input, return type, precision preservation, and median interpolation)
sql/testOnly org.apache.spark.sql.ApproximatePercentileQuerySuite — 20 tests passed (new SPARK-57557: percentile_approx supports TIME type test, scalar and array-of-percentiles paths)
SPARK_GENERATE_GOLDEN_FILES=1 sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z percentiles.sql — golden files regenerated and manually reviewed; a new TIME section was added to percentiles.sql
One pre-existing PercentileSuite assertion (the accepted-types error message) was updated to include TIME

The full lint/scalastyle and complete test matrix run on GitHub Actions.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated using Kiro (Claude Opus 4.8)

### What changes were proposed in this pull request? Add `TimeType` as a valid input to the quantile aggregate expressions: `percentile`, `median`, `percentile_cont`/`percentile_disc`, `percentile_approx`/`approx_percentile`, and `histogram_numeric`. `TIME`'s internal value is a `Long` (nanoseconds of day), so it flows through the existing value<->double conversions exactly like `TIMESTAMP` and `DAY-TIME INTERVAL`. The aggregates return `TimeType`, preserving the input precision. The Apache DataSketches aggregates (HLL/Theta/KLL) are intentionally left out of this change and can be added in a follow-up. ### Why are the changes needed? Sub-task of SPARK-57550 (extend TIME support). `TIME` (SPARK-51162, shipped 4.1.0) is an ordered type but was rejected by these aggregates. ### Does this PR introduce any user-facing change? Yes. `percentile`, `median`, `percentile_approx` and `histogram_numeric` now accept `TIME` columns and return `TIME`. ### How was this patch tested? Unit tests in ApproximatePercentileSuite, PercentileSuite and HistogramNumericSuite; end-to-end tests in ApproximatePercentileQuerySuite; golden-file coverage in percentiles.sql. Generated using Kiro (Claude Opus 4.8).

vboo123 · 2026-06-30T07:22:27Z

@MaxGekk

MaxGekk

0 blocking, 1 non-blocking, 0 nits.
Clean, precision-correct TIME extension mirroring the existing Long-internal type handling (DAY-TIME INTERVAL for the exact aggregates, TIMESTAMP for approx/histogram). Conversions are lossless over the full TIME domain, the return type preserves TIME(n), and there's no feature-gating bypass.

Suggestions (1)

percentiles.scala:78: reuse the named NumericAndAnsiInterval constant instead of re-inlining its three types — see inline.

Verification

Lossless Long<->Double: TIME is nanos-of-day in [0, 8.64e13) < 2^53, so Long -> Double -> Long round-trips exactly (input Long -> double for interpolation -> output toLong -> dataType = TIME). Strictly safer than DAY-TIME INTERVAL, which can exceed 2^53.
Ordering: the exact-percentile key sort uses PhysicalDataType.ordering(TimeType), which resolves to PhysicalLongType via TimeTypeOps.
Gating: spark.sql.timeType.enabled is enforced centrally at TIME construction, so accepting TimeType in inputTypes doesn't bypass it.

MaxGekk · 2026-06-30T08:44:41Z

      case _ => DoubleType
    }
-    Seq(NumericAndAnsiInterval, percentageExpType, IntegralType)
+    Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType),


Non-blocking: this re-inlines the three types that NumericAndAnsiInterval already groups (NumericType, DayTimeIntervalType, YearMonthIntervalType). Since TypeCollection nests/flattens, you can keep the named abstraction and just add the new type:

Suggested change

Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType),

Seq(TypeCollection(NumericAndAnsiInterval, AnyTimeType),

That keeps this call site in sync if NumericAndAnsiInterval ever changes. (You'll also keep the existing import ...TypeCollection.NumericAndAnsiInterval rather than removing it.)

uros-b · 2026-06-30T10:08:25Z

@@ -102,10 +102,10 @@ case class ApproximatePercentile(
  private lazy val accuracy: Long = accuracyNum.longValue



The function docs are not updated. The @ExpressionDescription usage text for percentile/median/percentile_cont/percentile_disc still says "numeric or ANSI interval column". These feed the SQL function reference. Adding TIME there accordingly would be more accurate.

uros-b · 2026-06-30T10:08:47Z

+    assert(Median(BoundReference(0, TimeType(3), nullable = false)).dataType === TimeType(3))
+
+    // The exact median of two TIME values interpolates (continuous distribution), returning the
+    // midpoint as Long nanos-of-day: median(00:00:00, 00:00:00.00000001) -> 00:00:00.000000005.


Nit: misleading test comment. In PercentileSuite, the comment median(00:00:00, 00:00:00.00000001) -> 00:00:00.000000005 uses 9-digit (nanosecond) display, but TimeType max precision is microseconds (6), so that literal isn't representable. The actual assertion (=== 5L on the internal Long) is correct; just the comment's notation is off.

- percentiles.scala: nest the named NumericAndAnsiInterval constant in inputTypes rather than re-inlining its member types (MaxGekk) - Update @ExpressionDescription usage docs for percentile, median, percentile_cont, percentile_disc and percentile_approx to mention TIME (uros-b) - Fix misleading nanosecond-literal comment in PercentileSuite; TimeType max precision is microseconds, so describe the internal Long values (uros-b) - Sync the PercentileSuite accepted-types assertion to the nested TypeCollection message Generated using Kiro (Claude Opus 4.8).

MaxGekk reviewed Jun 30, 2026

View reviewed changes

uros-b reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57557][SQL] Support the TIME data type in quantile aggregates#56889

[SPARK-57557][SQL] Support the TIME data type in quantile aggregates#56889
vboo123 wants to merge 2 commits into
apache:masterfrom
vboo123:SPARK-57557

vboo123 commented Jun 30, 2026

Uh oh!

vboo123 commented Jun 30, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 30, 2026

Uh oh!

uros-b Jun 30, 2026

Uh oh!

uros-b Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType),
	Seq(TypeCollection(NumericAndAnsiInterval, AnyTimeType),

		@@ -102,10 +102,10 @@ case class ApproximatePercentile(
		private lazy val accuracy: Long = accuracyNum.longValue

Uh oh!

Conversation

vboo123 commented Jun 30, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

vboo123 commented Jun 30, 2026

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Suggestions (1)

Verification

Uh oh!

MaxGekk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants