Skip to content

[SPARK-57557][SQL] Support the TIME data type in quantile aggregates#56889

Open
vboo123 wants to merge 2 commits into
apache:masterfrom
vboo123:SPARK-57557
Open

[SPARK-57557][SQL] Support the TIME data type in quantile aggregates#56889
vboo123 wants to merge 2 commits into
apache:masterfrom
vboo123:SPARK-57557

Conversation

@vboo123

@vboo123 vboo123 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds support for the TIME data type (TimeType) as an input to the core quantile aggregate expressions. SPARK-57557 is a sub-task of the umbrella SPARK-57550 (extend TIME type support across Spark).

The following expressions now accept TIME columns and return TIME:

  • percentile, median, percentile_cont, percentile_disc (exact quantiles, percentiles.scala)
  • percentile_approx / approx_percentile (ApproximatePercentile)
  • histogram_numeric (HistogramNumeric)

TIME's internal representation is a Long (nanoseconds of day), so it flows through the existing value↔double conversion paths exactly like the other ordered, Long-backed types. The change is mechanical:

  • TimeType (via AnyTimeType) added to each expression's inputTypes / TypeCollection
  • A TimeType branch added to the update (LongDouble) and eval (DoubleLong) conversions
  • The return type uses child.dataType, preserving the input precision (e.g. median over a TIME(3) column returns TIME(3))

Design notes (called out so they can be challenged in review):

  • Scope: This PR covers only the core quantile aggregates above. The Apache DataSketches aggregates (HLL, Theta, and the KLL quantile sketches) are intentionally deferred and can be added in a follow-up.
  • Interpolation: For the exact percentile/median (continuous distribution), interpolated results mirror existing numeric/interval behavior — e.g. the median of two TIME values returns their midpoint, which may be a value not present in the data; sub-nanosecond fractions are truncated.
  • Template mirrored: For the exact percentile/median and histogram_numeric, TIMESTAMP is not currently supported either, so the implementation mirrors DAY-TIME INTERVAL (also Long-internal). For percentile_approx and histogram_numeric, TIMESTAMP was already supported and TIME mirrors it directly.

Why are the changes needed?

The TIME data type was added by SPIP SPARK-51162 and shipped in Spark 4.1.0, but the quantile aggregate functions did not accept it. Calling percentile, median, percentile_approx, or histogram_numeric on a TIME column failed with a type-check error, even though the operation is well-defined: TIME is internally a Long (nanoseconds of day) and quantile computation over an ordered numeric value applies directly. This PR closes that gap.

Does this PR introduce any user-facing change?

Yes. percentile, median, percentile_cont, percentile_disc, percentile_approx / approx_percentile, and histogram_numeric now accept TIME columns and return TIME values. Previously these expressions rejected TIME input with a type-check error.

How was this patch tested?

All scoped suites were run locally and passed:

  • catalyst/testOnly org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentileSuite org.apache.spark.sql.catalyst.expressions.aggregate.PercentileSuite org.apache.spark.sql.catalyst.expressions.aggregate.HistogramNumericSuite — all tests passed (each suite has a new test("SPARK-57557: ...") covering TIME input, return type, precision preservation, and median interpolation)
  • sql/testOnly org.apache.spark.sql.ApproximatePercentileQuerySuite — 20 tests passed (new SPARK-57557: percentile_approx supports TIME type test, scalar and array-of-percentiles paths)
  • SPARK_GENERATE_GOLDEN_FILES=1 sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z percentiles.sql — golden files regenerated and manually reviewed; a new TIME section was added to percentiles.sql
  • One pre-existing PercentileSuite assertion (the accepted-types error message) was updated to include TIME

The full lint/scalastyle and complete test matrix run on GitHub Actions.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated using Kiro (Claude Opus 4.8)

### What changes were proposed in this pull request?

Add `TimeType` as a valid input to the quantile aggregate expressions:
`percentile`, `median`, `percentile_cont`/`percentile_disc`,
`percentile_approx`/`approx_percentile`, and `histogram_numeric`.

`TIME`'s internal value is a `Long` (nanoseconds of day), so it flows
through the existing value<->double conversions exactly like `TIMESTAMP`
and `DAY-TIME INTERVAL`. The aggregates return `TimeType`, preserving the
input precision.

The Apache DataSketches aggregates (HLL/Theta/KLL) are intentionally left
out of this change and can be added in a follow-up.

### Why are the changes needed?

Sub-task of SPARK-57550 (extend TIME support). `TIME` (SPARK-51162,
shipped 4.1.0) is an ordered type but was rejected by these aggregates.

### Does this PR introduce any user-facing change?

Yes. `percentile`, `median`, `percentile_approx` and `histogram_numeric`
now accept `TIME` columns and return `TIME`.

### How was this patch tested?

Unit tests in ApproximatePercentileSuite, PercentileSuite and
HistogramNumericSuite; end-to-end tests in ApproximatePercentileQuerySuite;
golden-file coverage in percentiles.sql.

Generated using Kiro (Claude Opus 4.8).
@vboo123

vboo123 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@MaxGekk

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 1 non-blocking, 0 nits.
Clean, precision-correct TIME extension mirroring the existing Long-internal type handling (DAY-TIME INTERVAL for the exact aggregates, TIMESTAMP for approx/histogram). Conversions are lossless over the full TIME domain, the return type preserves TIME(n), and there's no feature-gating bypass.

Suggestions (1)

  • percentiles.scala:78: reuse the named NumericAndAnsiInterval constant instead of re-inlining its three types — see inline.

Verification

  • Lossless Long<->Double: TIME is nanos-of-day in [0, 8.64e13) < 2^53, so Long -> Double -> Long round-trips exactly (input Long -> double for interpolation -> output toLong -> dataType = TIME). Strictly safer than DAY-TIME INTERVAL, which can exceed 2^53.
  • Ordering: the exact-percentile key sort uses PhysicalDataType.ordering(TimeType), which resolves to PhysicalLongType via TimeTypeOps.
  • Gating: spark.sql.timeType.enabled is enforced centrally at TIME construction, so accepting TimeType in inputTypes doesn't bypass it.

case _ => DoubleType
}
Seq(NumericAndAnsiInterval, percentageExpType, IntegralType)
Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: this re-inlines the three types that NumericAndAnsiInterval already groups (NumericType, DayTimeIntervalType, YearMonthIntervalType). Since TypeCollection nests/flattens, you can keep the named abstraction and just add the new type:

Suggested change
Seq(TypeCollection(NumericType, DayTimeIntervalType, YearMonthIntervalType, AnyTimeType),
Seq(TypeCollection(NumericAndAnsiInterval, AnyTimeType),

That keeps this call site in sync if NumericAndAnsiInterval ever changes. (You'll also keep the existing import ...TypeCollection.NumericAndAnsiInterval rather than removing it.)

@@ -102,10 +102,10 @@ case class ApproximatePercentile(
private lazy val accuracy: Long = accuracyNum.longValue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function docs are not updated. The @ExpressionDescription usage text for percentile/median/percentile_cont/percentile_disc still says "numeric or ANSI interval column". These feed the SQL function reference. Adding TIME there accordingly would be more accurate.

assert(Median(BoundReference(0, TimeType(3), nullable = false)).dataType === TimeType(3))

// The exact median of two TIME values interpolates (continuous distribution), returning the
// midpoint as Long nanos-of-day: median(00:00:00, 00:00:00.00000001) -> 00:00:00.000000005.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: misleading test comment. In PercentileSuite, the comment median(00:00:00, 00:00:00.00000001) -> 00:00:00.000000005 uses 9-digit (nanosecond) display, but TimeType max precision is microseconds (6), so that literal isn't representable. The actual assertion (=== 5L on the internal Long) is correct; just the comment's notation is off.

- percentiles.scala: nest the named NumericAndAnsiInterval constant in
  inputTypes rather than re-inlining its member types (MaxGekk)
- Update @ExpressionDescription usage docs for percentile, median,
  percentile_cont, percentile_disc and percentile_approx to mention TIME
  (uros-b)
- Fix misleading nanosecond-literal comment in PercentileSuite; TimeType
  max precision is microseconds, so describe the internal Long values (uros-b)
- Sync the PercentileSuite accepted-types assertion to the nested
  TypeCollection message

Generated using Kiro (Claude Opus 4.8).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants