Skip to content

[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation#56910

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:SPARK-57784
Open

[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation#56910
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:SPARK-57784

Conversation

@yadavay-amzn

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add TimeType support to cost-based optimizer (CBO) statistics estimation so range-based selectivity, join, and union estimation work for TIME columns. TimeType (stored as nanoseconds-of-day in a Long) is handled at every site the other fixed-width temporal types use - EstimationUtils.toDouble/fromDouble, FilterEstimation.evaluateBinary/evaluateInSet, UnionEstimation.isTypeSupported, and CatalogColumnStat min/max external-string (de)serialization (via TimeFormatter) - mirroring the existing TimestampType (micros-as-Long) handling.

Why are the changes needed?

Before this, filters/joins/unions over a TIME column fell back to default selectivity because TimeType was not among the types CBO estimates, yielding poor row-count estimates and plans. TIME maps cleanly to a numeric scale (nanos-of-day), so it can be estimated the same way as DATE/TIMESTAMP.

Does this PR introduce any user-facing change?

No. (Affects plan costing/estimates only, not query results.)

How was this patch tested?

New FilterEstimationSuite tests for TIME range/equality/IN selectivity (asserting exact rowCount/distinctCount/min/max), and extended JoinEstimationSuite/UnionEstimation tests with TIME columns. All existing CBO estimation tests pass; scalastyle clean.

Was this patch authored or co-authored using generative AI tooling?

Authored with assistance by Claude Opus 4.8.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 1 non-blocking, 1 nit.
The CBO estimation extension itself is complete and consistent — every TimestampType site has a TimeType analogue, the version == 1 legacy path is correctly skipped, JoinEstimation is covered via the shared toDouble/fromDouble, and the LongDouble round-trip is exact for all TIME values. One blocking issue at the serialization site, plus the test gap that hides it.

Findings

Correctness (1)

  • (blocking) CatalogColumnStat.toExternalString truncates sub-second precision in TIME min/max — see inline. Persisted min/max for any fractional TIME column is corrupted, defeating the range estimates this PR adds.

Suggestions (1)

  • Add a sub-second TIME round-trip test for the min/max external-string (de)serialization (would catch the blocking issue) — see inline.

Nits (1)

  • FilterEstimationSuite selectivity comment: "4/9 hours" is a dimensionless ratio (units cancel) and "rounded to 5" is actually a ceiling — see inline.

case TimestampNTZType =>
getTimestampFormatter(isParsing = false, forTimestampNTZ = true)
.format(v.asInstanceOf[Long])
case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: this truncates sub-second precision. TimeFormatter(isParsing = false) resolves to DefaultTimeFormatter, which overrides only parse (→ stringToTimeAnsi, preserves nanos) and inherits format from Iso8601TimeFormatter with the fractionless "HH:mm:ss" pattern. So a TIME min/max of 12:00:00.123456 (nanos 43200123456000) is serialized as "12:00:00" and read back by fromExternalString as 12:00:00.000000 — a wrong, truncated min/max persisted for any fractional TIME column, which silently corrupts the range estimates this PR is adding. (fromExternalString's parse side is fine; only this format side is lossy.)

Contrast TimestampType just above, which formats with the .SSSSSS pattern (lossless to micros). Use a fraction-preserving formatter here:

Suggested change
case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long])
case _: TimeType => TimeFormatter.getFractionFormatter().format(v.asInstanceOf[Long])

FractionTimeFormatter preserves up to nanosecond resolution (trailing zeros trimmed) and round-trips with the stringToTimeAnsi parse on the read side.

This path is reachable and net-new: AnalyzeColumnCommand.supportsType accepts _: DatetimeType (TimeType extends it), so ANALYZE … FOR COLUMNS over a TIME column collects min/max and persists through here. Pre-PR there was no TimeType arm, so it hit the case _ => throw columnStatisticsSerializationNotSupportedError fallthrough — i.e. this PR is what enables ANALYZE-on-TIME, so the lossy serialization ships as new behavior. (Histogram bins serialize as raw Double and are fine; only these min/max companions truncate.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - switched the TimeType arm to TimeFormatter.getFractionFormatter(), which preserves up to nanosecond resolution and round-trips losslessly with the stringToTimeAnsi parse on the read side. Verified 12:00:00.123456 (nanos 43200123456000) now survives toExternalString -> fromExternalString unchanged (previously it serialized to "12:00:00" and read back as 43200000000000).

case TimestampType => getTimestampFormatter(isParsing = true).parse(s)
case TimestampNTZType =>
getTimestampFormatter(isParsing = true, forTimestampNTZ = true).parse(s)
case _: TimeType => TimeFormatter(isParsing = true).parse(s)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two TIME (de)serialization arms aren't covered by the new tests — the FilterEstimationSuite TIME cases use whole-second values and build ColumnStat in-memory, so they never round-trip through toExternalString/fromExternalString (which is why the truncation on the format side goes unnoticed). Worth a sub-second TIME min/max round-trip in the existing temporal stats coverage (StatisticsCollectionSuite — its type list and checkTimestampStats already cover DateType/TimestampType); a fractional value there would catch the truncation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sub-second TIME min/max round-trip test in StatisticsCollectionSuite using 12:00:00.123456. Confirmed it FAILS with the old fractionless formatter (deserializes to 43200000000000) and passes with the fraction formatter, so it guards this truncation going forward.


test("ctime < cast('12:00:00' AS TIME)") {
// 12:00 is 43200000000000L nanos. Range is [08:00, 17:00] = 10 distinct values.
// Fraction: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 hours => ~4.44 => rounded to 5

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: 4/9 is a dimensionless selectivity ratio — the hour spans in numerator and denominator cancel, so "hours" is misleading. And ~4.44 => rounded to 5 is actually a ceiling (4.44 wouldn't round to 5). Maybe: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 of 10 rows => ~4.44 => ceil => 5.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 of 10 rows => ~4.44 => ceil => 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants