[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation by yadavay-amzn · Pull Request #56910 · apache/spark

yadavay-amzn · 2026-06-30T20:50:50Z

What changes were proposed in this pull request?

Add TimeType support to cost-based optimizer (CBO) statistics estimation so range-based selectivity, join, and union estimation work for TIME columns. TimeType (stored as nanoseconds-of-day in a Long) is handled at every site the other fixed-width temporal types use - EstimationUtils.toDouble/fromDouble, FilterEstimation.evaluateBinary/evaluateInSet, UnionEstimation.isTypeSupported, and CatalogColumnStat min/max external-string (de)serialization (via TimeFormatter) - mirroring the existing TimestampType (micros-as-Long) handling.

Why are the changes needed?

Before this, filters/joins/unions over a TIME column fell back to default selectivity because TimeType was not among the types CBO estimates, yielding poor row-count estimates and plans. TIME maps cleanly to a numeric scale (nanos-of-day), so it can be estimated the same way as DATE/TIMESTAMP.

Does this PR introduce any user-facing change?

No. (Affects plan costing/estimates only, not query results.)

How was this patch tested?

New FilterEstimationSuite tests for TIME range/equality/IN selectivity (asserting exact rowCount/distinctCount/min/max), and extended JoinEstimationSuite/UnionEstimation tests with TIME columns. All existing CBO estimation tests pass; scalastyle clean.

Was this patch authored or co-authored using generative AI tooling?

Authored with assistance by Claude Opus 4.8.

MaxGekk

1 blocking, 1 non-blocking, 1 nit.
The CBO estimation extension itself is complete and consistent — every TimestampType site has a TimeType analogue, the version == 1 legacy path is correctly skipped, JoinEstimation is covered via the shared toDouble/fromDouble, and the Long↔Double round-trip is exact for all TIME values. One blocking issue at the serialization site, plus the test gap that hides it.

Findings

Correctness (1)

(blocking) CatalogColumnStat.toExternalString truncates sub-second precision in TIME min/max — see inline. Persisted min/max for any fractional TIME column is corrupted, defeating the range estimates this PR adds.

Suggestions (1)

Add a sub-second TIME round-trip test for the min/max external-string (de)serialization (would catch the blocking issue) — see inline.

Nits (1)

FilterEstimationSuite selectivity comment: "4/9 hours" is a dimensionless ratio (units cancel) and "rounded to 5" is actually a ceiling — see inline.

MaxGekk · 2026-06-30T22:19:13Z

      case TimestampNTZType =>
        getTimestampFormatter(isParsing = false, forTimestampNTZ = true)
          .format(v.asInstanceOf[Long])
+      case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long])


Blocking: this truncates sub-second precision. TimeFormatter(isParsing = false) resolves to DefaultTimeFormatter, which overrides only parse (→ stringToTimeAnsi, preserves nanos) and inherits format from Iso8601TimeFormatter with the fractionless "HH:mm:ss" pattern. So a TIME min/max of 12:00:00.123456 (nanos 43200123456000) is serialized as "12:00:00" and read back by fromExternalString as 12:00:00.000000 — a wrong, truncated min/max persisted for any fractional TIME column, which silently corrupts the range estimates this PR is adding. (fromExternalString's parse side is fine; only this format side is lossy.)

Contrast TimestampType just above, which formats with the .SSSSSS pattern (lossless to micros). Use a fraction-preserving formatter here:

Suggested change

case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long])

case _: TimeType => TimeFormatter.getFractionFormatter().format(v.asInstanceOf[Long])

FractionTimeFormatter preserves up to nanosecond resolution (trailing zeros trimmed) and round-trips with the stringToTimeAnsi parse on the read side.

This path is reachable and net-new: AnalyzeColumnCommand.supportsType accepts _: DatetimeType (TimeType extends it), so ANALYZE … FOR COLUMNS over a TIME column collects min/max and persists through here. Pre-PR there was no TimeType arm, so it hit the case _ => throw columnStatisticsSerializationNotSupportedError fallthrough — i.e. this PR is what enables ANALYZE-on-TIME, so the lossy serialization ships as new behavior. (Histogram bins serialize as raw Double and are fine; only these min/max companions truncate.)

Fixed - switched the TimeType arm to TimeFormatter.getFractionFormatter(), which preserves up to nanosecond resolution and round-trips losslessly with the stringToTimeAnsi parse on the read side. Verified 12:00:00.123456 (nanos 43200123456000) now survives toExternalString -> fromExternalString unchanged (previously it serialized to "12:00:00" and read back as 43200000000000).

MaxGekk · 2026-06-30T22:19:13Z

      case TimestampType => getTimestampFormatter(isParsing = true).parse(s)
      case TimestampNTZType =>
        getTimestampFormatter(isParsing = true, forTimestampNTZ = true).parse(s)
+      case _: TimeType => TimeFormatter(isParsing = true).parse(s)


These two TIME (de)serialization arms aren't covered by the new tests — the FilterEstimationSuite TIME cases use whole-second values and build ColumnStat in-memory, so they never round-trip through toExternalString/fromExternalString (which is why the truncation on the format side goes unnoticed). Worth a sub-second TIME min/max round-trip in the existing temporal stats coverage (StatisticsCollectionSuite — its type list and checkTimestampStats already cover DateType/TimestampType); a fractional value there would catch the truncation.

Added a sub-second TIME min/max round-trip test in StatisticsCollectionSuite using 12:00:00.123456. Confirmed it FAILS with the old fractionless formatter (deserializes to 43200000000000) and passes with the fraction formatter, so it guards this truncation going forward.

MaxGekk · 2026-06-30T22:19:13Z

+
+  test("ctime < cast('12:00:00' AS TIME)") {
+    // 12:00 is 43200000000000L nanos. Range is [08:00, 17:00] = 10 distinct values.
+    // Fraction: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 hours => ~4.44 => rounded to 5


Minor: 4/9 is a dimensionless selectivity ratio — the hour spans in numerator and denominator cancel, so "hours" is misleading. And ~4.44 => rounded to 5 is actually a ceiling (4.44 wouldn't round to 5). Maybe: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 of 10 rows => ~4.44 => ceil => 5.

Reworded to: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 of 10 rows => ~4.44 => ceil => 5.

… statistics estimation

MaxGekk requested changes Jun 30, 2026

View reviewed changes

[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer…

c003f86

… statistics estimation

yadavay-amzn force-pushed the SPARK-57784 branch from 68b31af to c003f86 Compare June 30, 2026 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation#56910

[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation#56910
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:SPARK-57784

yadavay-amzn commented Jun 30, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 30, 2026

Uh oh!

yadavay-amzn Jun 30, 2026

Uh oh!

MaxGekk Jun 30, 2026

Uh oh!

yadavay-amzn Jun 30, 2026

Uh oh!

MaxGekk Jun 30, 2026

Uh oh!

yadavay-amzn Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long])
	case _: TimeType => TimeFormatter.getFractionFormatter().format(v.asInstanceOf[Long])

Uh oh!

Conversation

yadavay-amzn commented Jun 30, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Findings

Correctness (1)

Suggestions (1)

Nits (1)

Uh oh!

MaxGekk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants