[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation#56910
[SPARK-57784][SQL] Support the TIME data type in cost-based optimizer statistics estimation#56910yadavay-amzn wants to merge 1 commit into
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
1 blocking, 1 non-blocking, 1 nit.
The CBO estimation extension itself is complete and consistent — every TimestampType site has a TimeType analogue, the version == 1 legacy path is correctly skipped, JoinEstimation is covered via the shared toDouble/fromDouble, and the Long↔Double round-trip is exact for all TIME values. One blocking issue at the serialization site, plus the test gap that hides it.
Findings
Correctness (1)
- (blocking)
CatalogColumnStat.toExternalStringtruncates sub-second precision in TIME min/max — see inline. Persisted min/max for any fractional TIME column is corrupted, defeating the range estimates this PR adds.
Suggestions (1)
- Add a sub-second TIME round-trip test for the min/max external-string (de)serialization (would catch the blocking issue) — see inline.
Nits (1)
FilterEstimationSuiteselectivity comment: "4/9 hours" is a dimensionless ratio (units cancel) and "rounded to 5" is actually a ceiling — see inline.
| case TimestampNTZType => | ||
| getTimestampFormatter(isParsing = false, forTimestampNTZ = true) | ||
| .format(v.asInstanceOf[Long]) | ||
| case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long]) |
There was a problem hiding this comment.
Blocking: this truncates sub-second precision. TimeFormatter(isParsing = false) resolves to DefaultTimeFormatter, which overrides only parse (→ stringToTimeAnsi, preserves nanos) and inherits format from Iso8601TimeFormatter with the fractionless "HH:mm:ss" pattern. So a TIME min/max of 12:00:00.123456 (nanos 43200123456000) is serialized as "12:00:00" and read back by fromExternalString as 12:00:00.000000 — a wrong, truncated min/max persisted for any fractional TIME column, which silently corrupts the range estimates this PR is adding. (fromExternalString's parse side is fine; only this format side is lossy.)
Contrast TimestampType just above, which formats with the .SSSSSS pattern (lossless to micros). Use a fraction-preserving formatter here:
| case _: TimeType => TimeFormatter(isParsing = false).format(v.asInstanceOf[Long]) | |
| case _: TimeType => TimeFormatter.getFractionFormatter().format(v.asInstanceOf[Long]) |
FractionTimeFormatter preserves up to nanosecond resolution (trailing zeros trimmed) and round-trips with the stringToTimeAnsi parse on the read side.
This path is reachable and net-new: AnalyzeColumnCommand.supportsType accepts _: DatetimeType (TimeType extends it), so ANALYZE … FOR COLUMNS over a TIME column collects min/max and persists through here. Pre-PR there was no TimeType arm, so it hit the case _ => throw columnStatisticsSerializationNotSupportedError fallthrough — i.e. this PR is what enables ANALYZE-on-TIME, so the lossy serialization ships as new behavior. (Histogram bins serialize as raw Double and are fine; only these min/max companions truncate.)
There was a problem hiding this comment.
Fixed - switched the TimeType arm to TimeFormatter.getFractionFormatter(), which preserves up to nanosecond resolution and round-trips losslessly with the stringToTimeAnsi parse on the read side. Verified 12:00:00.123456 (nanos 43200123456000) now survives toExternalString -> fromExternalString unchanged (previously it serialized to "12:00:00" and read back as 43200000000000).
| case TimestampType => getTimestampFormatter(isParsing = true).parse(s) | ||
| case TimestampNTZType => | ||
| getTimestampFormatter(isParsing = true, forTimestampNTZ = true).parse(s) | ||
| case _: TimeType => TimeFormatter(isParsing = true).parse(s) |
There was a problem hiding this comment.
These two TIME (de)serialization arms aren't covered by the new tests — the FilterEstimationSuite TIME cases use whole-second values and build ColumnStat in-memory, so they never round-trip through toExternalString/fromExternalString (which is why the truncation on the format side goes unnoticed). Worth a sub-second TIME min/max round-trip in the existing temporal stats coverage (StatisticsCollectionSuite — its type list and checkTimestampStats already cover DateType/TimestampType); a fractional value there would catch the truncation.
There was a problem hiding this comment.
Added a sub-second TIME min/max round-trip test in StatisticsCollectionSuite using 12:00:00.123456. Confirmed it FAILS with the old fractionless formatter (deserializes to 43200000000000) and passes with the fraction formatter, so it guards this truncation going forward.
|
|
||
| test("ctime < cast('12:00:00' AS TIME)") { | ||
| // 12:00 is 43200000000000L nanos. Range is [08:00, 17:00] = 10 distinct values. | ||
| // Fraction: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 hours => ~4.44 => rounded to 5 |
There was a problem hiding this comment.
Minor: 4/9 is a dimensionless selectivity ratio — the hour spans in numerator and denominator cancel, so "hours" is misleading. And ~4.44 => rounded to 5 is actually a ceiling (4.44 wouldn't round to 5). Maybe: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 of 10 rows => ~4.44 => ceil => 5.
There was a problem hiding this comment.
Reworded to: (12:00 - 08:00) / (17:00 - 08:00) = 4/9 of 10 rows => ~4.44 => ceil => 5.
… statistics estimation
68b31af to
c003f86
Compare
What changes were proposed in this pull request?
Add
TimeTypesupport to cost-based optimizer (CBO) statistics estimation so range-based selectivity, join, and union estimation work for TIME columns.TimeType(stored as nanoseconds-of-day in a Long) is handled at every site the other fixed-width temporal types use -EstimationUtils.toDouble/fromDouble,FilterEstimation.evaluateBinary/evaluateInSet,UnionEstimation.isTypeSupported, andCatalogColumnStatmin/max external-string (de)serialization (viaTimeFormatter) - mirroring the existingTimestampType(micros-as-Long) handling.Why are the changes needed?
Before this, filters/joins/unions over a TIME column fell back to default selectivity because
TimeTypewas not among the types CBO estimates, yielding poor row-count estimates and plans. TIME maps cleanly to a numeric scale (nanos-of-day), so it can be estimated the same way as DATE/TIMESTAMP.Does this PR introduce any user-facing change?
No. (Affects plan costing/estimates only, not query results.)
How was this patch tested?
New
FilterEstimationSuitetests for TIME range/equality/IN selectivity (asserting exactrowCount/distinctCount/min/max), and extendedJoinEstimationSuite/UnionEstimationtests with TIME columns. All existing CBO estimation tests pass; scalastyle clean.Was this patch authored or co-authored using generative AI tooling?
Authored with assistance by Claude Opus 4.8.