[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource#56854
[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource#56854vinodkc wants to merge 3 commits into
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
1 blocking, 2 non-blocking, 0 nits.
Clean extension that mirrors the CSV nanos support. One blocking item on the NTZ-nano read path; the rest is an over-broad inference catch-all and the matching test gaps.
Correctness (2)
- StaxXmlParser.scala:606: NTZ-nano read uses the 2-arg
parseWithoutTimeZoneNanos(allowTimeZone=true) and silently accepts+drops a zone component, unlike the micro NTZ path, inference, and CSV (blocking) — see inline. - XmlInferSchema.scala:667: the
AnyTimestampNanoTypecatch-all matches nano paired with any type, so nano + non-datetime infers asTimestampTypeinstead of theStringTypefallback the micro path uses (non-blocking) — see inline.
Suggestions (1)
- Test coverage: none of the new tests exercise (a) NTZ-nano zone rejection, (b) nano + non-timestamp inference, or (c) the actual round-tripped value in the
from_xmltest (it now asserts only non-null). Worth a case for (a)/(b) and tightening (c).
Verification
Traced the compatibleType restructuring (moving the StructType/ArrayType recursive cases ahead of findTightestCommonType): equivalence-preserving for the inputs it actually sees. compatibleType is private[xml] and reached only from inference, whose structs are built via a name-sorted TreeMap (addOrUpdateType), so the new groupBy+sortBy(name) merge yields the same result as the old findTightestCommonType/findTypeForComplex path, and element/field recursion falls through to findTightestCommonType for non-nano types. The motive is valid: AnyTimestampNanoType extends DatetimeType, so findTightestCommonType would route a nested nano pair to findWiderDateTimeType, which has no nano case and would MatchError. The write path, LTZ-nano read, and supportDataType gate change all match the CSV analogue.
MaxGekk
left a comment
There was a problem hiding this comment.
3 addressed, 0 remaining, 1 new.
Thanks for the fixes — all three points from the last round are resolved and re-verified:
StaxXmlParser.scala:606: NTZ-nano now passesallowTimeZone=false(rejects zoned input, matching the micro NTZ path and CSV). ✓XmlInferSchema.scala:667: catch-all narrowed toDatetimeType, so nano + non-datetime falls back toStringType— and nofindWiderDateTimeTypeMatchErroris reintroduced. ✓- Tests:
from_xmlnow asserts the exact truncated round-trip value (UTC-pinned), plus new zone-rejection and nano-vs-non-datetime inference tests. ✓
Suggestions (1)
- One coverage-parity gap vs the sibling JSON PR (#56865): there's no LEGACY time-parser-policy rejection test for nanos. The behavior is already correct — the shared
TimestampFormatter(parseNanos/parseWithoutTimeZoneNanos/formatNanos) raisesUNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSERunder the legacy policy, and XML uses it unchanged — so this is non-blocking. A test asserting it (write + read), mirroring JSON's"...rejects nanosecond timestamps under the LEGACY time parser policy", would lock it in. (XML is v1-only, so JSON's v2JsonTablegate has no XML analogue to add.)
Rewrote the test to compute the expected LocalDateTime/Instant per precision using nanoOfSecTruncator and replaced the != null check with checkAnswer. |
|
For the new coverage gap: |
MaxGekk
left a comment
There was a problem hiding this comment.
1 addressed, 0 remaining, 0 new.
Thanks @vinodkc — the one open item from the last round is resolved and re-verified, so this LGTM.
The suggested LEGACY-time-parser nanos rejection test is added in FileBasedDataSourceSuite, mirroring the JSON #56865 sibling (same suite, same UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER condition + config param, both write and read; read uses FAILFAST so the error surfaces instead of being swallowed as a permissive null). The three round-1 fixes (NTZ-nano allowTimeZone=false, inference catch-all narrowed to DatetimeType, exact-value from_xml round-trip) remain in place.
Verification
- Test scope is correct (LTZ only):
TimestampFormatter.getFormatterselects the legacy formatter only whenlegacyTimeParserPolicy == LEGACY && !forTimestampNTZ, so NTZ-nanos always takes the ISO-8601 path and never hits the legacy-rejection path — there is no untested NTZ rejection to add, and the test comment's rationale is accurate. - XML is v1-only, so JSON's v2
JsonTablegate has no XML analogue — correctly noted.
No new findings.
|
+1, LGTM. Merging to master/4.x. |
…e XML datasource ### What changes were proposed in this pull request? This PR adds support for nanosecond-precision timestamp types (`TimestampLTZNanosType` and `TimestampNTZNanosType`) in the XML datasource, covering: - Writing: `StaxXmlGenerator` handles the two nano timestamp types by calling the nanosecond-aware formatter methods (formatNanos, formatWithoutTimeZoneNanos). - Reading: `StaxXmlParser` routes nano timestamp types to the corresponding `parseNanos` / `parseWithoutTimeZoneNanos` formatter methods in the schema-directed path, and delegates to castTo in the type-coercion path. - Schema inference: `XmlInferSchema` now infers `TimestampNTZNanosType(9)` when a field value carries sub-microsecond fractional seconds (>6 digits) and `TIMESTAMP_NANOS_TYPES_ENABLED` is on. The compatibleType widening function is extended to merge two nano timestamp types (taking the higher precision), downgrade `TimestampNTZNanosType` + `TimestampNTZType` to `TimestampNTZType`, and fall back to `TimestampType` for any other nano/non-nano combination. The `StructType` and `ArrayType` recursive cases are moved into the pre-TypeCoercion block so that nested fields also benefit from the nano-widening logic. ### Why are the changes needed? `xml` rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through `xml`. ### Does this PR introduce _any_ user-facing change? Yes. With `spark.sql.timestampNanosTypes.enabled` set to `true`: - XML files can now be written and read back with `TimestampLTZNanosType` and `TimestampNTZNanosType` columns without error. - Schema inference promotes a timestamp field to `TimestampNTZNanosType(9)` when its string value contains more than 6 fractional-second digits . ### How was this patch tested? `XmlSuite`: Added two new inference tests — one verifying that a 9-digit NTZ timestamp string is inferred as `TimestampNTZNanosType(9)`, and one verifying that a mix of micro-precision and nano-precision NTZ rows in the same file degrades to `TimestampNTZType` Updated `FileBasedDataSourceSuite` and `XmlFunctionsSuite` ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude (Sonnet 4.6) Closes #56854 from vinodkc/SPARK-57458. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit fe6bc95) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
This PR adds support for nanosecond-precision timestamp types (
TimestampLTZNanosTypeandTimestampNTZNanosType) in the XML datasource, covering:StaxXmlGeneratorhandles the two nano timestamp types by calling the nanosecond-aware formatter methods (formatNanos, formatWithoutTimeZoneNanos).StaxXmlParserroutes nano timestamp types to the correspondingparseNanos/parseWithoutTimeZoneNanosformatter methods in the schema-directed path, and delegates to castTo in the type-coercion path.XmlInferSchemanow infersTimestampNTZNanosType(9)when a field value carries sub-microsecond fractional seconds (>6 digits) andTIMESTAMP_NANOS_TYPES_ENABLEDis on.The compatibleType widening function is extended to merge two nano timestamp types (taking the higher precision), downgrade
TimestampNTZNanosType+TimestampNTZTypetoTimestampNTZType, and fall back toTimestampTypefor any other nano/non-nano combination. TheStructTypeandArrayTyperecursive cases are moved into the pre-TypeCoercion block so that nested fields also benefit from the nano-widening logic.Why are the changes needed?
xmlrejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read throughxml.Does this PR introduce any user-facing change?
Yes. With
spark.sql.timestampNanosTypes.enabledset totrue:TimestampLTZNanosTypeandTimestampNTZNanosTypecolumns without error.TimestampNTZNanosType(9)when its string value contains more than 6 fractional-second digits .How was this patch tested?
XmlSuite: Added two new inference tests — one verifying that a 9-digit NTZ timestamp string is inferred asTimestampNTZNanosType(9), and one verifying that a mix of micro-precision and nano-precision NTZ rows in the same file degrades toTimestampNTZTypeUpdated
FileBasedDataSourceSuiteandXmlFunctionsSuiteWas this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude (Sonnet 4.6)