[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource by vinodkc · Pull Request #56854 · apache/spark

vinodkc · 2026-06-28T19:02:24Z

What changes were proposed in this pull request?

This PR adds support for nanosecond-precision timestamp types (TimestampLTZNanosType and TimestampNTZNanosType) in the XML datasource, covering:

Writing: StaxXmlGenerator handles the two nano timestamp types by calling the nanosecond-aware formatter methods (formatNanos, formatWithoutTimeZoneNanos).
Reading: StaxXmlParser routes nano timestamp types to the corresponding parseNanos / parseWithoutTimeZoneNanos formatter methods in the schema-directed path, and delegates to castTo in the type-coercion path.
Schema inference: XmlInferSchema now infers TimestampNTZNanosType(9) when a field value carries sub-microsecond fractional seconds (>6 digits) and TIMESTAMP_NANOS_TYPES_ENABLED is on.
The compatibleType widening function is extended to merge two nano timestamp types (taking the higher precision), downgrade TimestampNTZNanosType + TimestampNTZType to TimestampNTZType, and fall back to TimestampType for any other nano/non-nano combination. The StructType and ArrayType recursive cases are moved into the pre-TypeCoercion block so that nested fields also benefit from the nano-widening logic.

Why are the changes needed?

xml rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through xml.

Does this PR introduce any user-facing change?

Yes. With spark.sql.timestampNanosTypes.enabled set to true:

XML files can now be written and read back with TimestampLTZNanosType and TimestampNTZNanosType columns without error.
Schema inference promotes a timestamp field to TimestampNTZNanosType(9) when its string value contains more than 6 fractional-second digits .

How was this patch tested?

XmlSuite: Added two new inference tests — one verifying that a 9-digit NTZ timestamp string is inferred as TimestampNTZNanosType(9), and one verifying that a mix of micro-precision and nano-precision NTZ rows in the same file degrades to TimestampNTZType
Updated FileBasedDataSourceSuite and XmlFunctionsSuite

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude (Sonnet 4.6)

MaxGekk

1 blocking, 2 non-blocking, 0 nits.
Clean extension that mirrors the CSV nanos support. One blocking item on the NTZ-nano read path; the rest is an over-broad inference catch-all and the matching test gaps.

Correctness (2)

StaxXmlParser.scala:606: NTZ-nano read uses the 2-arg parseWithoutTimeZoneNanos (allowTimeZone=true) and silently accepts+drops a zone component, unlike the micro NTZ path, inference, and CSV (blocking) — see inline.
XmlInferSchema.scala:667: the AnyTimestampNanoType catch-all matches nano paired with any type, so nano + non-datetime infers as TimestampType instead of the StringType fallback the micro path uses (non-blocking) — see inline.

Suggestions (1)

Test coverage: none of the new tests exercise (a) NTZ-nano zone rejection, (b) nano + non-timestamp inference, or (c) the actual round-tripped value in the from_xml test (it now asserts only non-null). Worth a case for (a)/(b) and tightening (c).

Verification

Traced the compatibleType restructuring (moving the StructType/ArrayType recursive cases ahead of findTightestCommonType): equivalence-preserving for the inputs it actually sees. compatibleType is private[xml] and reached only from inference, whose structs are built via a name-sorted TreeMap (addOrUpdateType), so the new groupBy+sortBy(name) merge yields the same result as the old findTightestCommonType/findTypeForComplex path, and element/field recursion falls through to findTightestCommonType for non-nano types. The motive is valid: AnyTimestampNanoType extends DatetimeType, so findTightestCommonType would route a nested nano pair to findWiderDateTimeType, which has no nano case and would MatchError. The write path, LTZ-nano read, and supportDataType gate change all match the CSV analogue.

MaxGekk

3 addressed, 0 remaining, 1 new.
Thanks for the fixes — all three points from the last round are resolved and re-verified:

StaxXmlParser.scala:606: NTZ-nano now passes allowTimeZone=false (rejects zoned input, matching the micro NTZ path and CSV). ✓
XmlInferSchema.scala:667: catch-all narrowed to DatetimeType, so nano + non-datetime falls back to StringType — and no findWiderDateTimeType MatchError is reintroduced. ✓
Tests: from_xml now asserts the exact truncated round-trip value (UTC-pinned), plus new zone-rejection and nano-vs-non-datetime inference tests. ✓

Suggestions (1)

One coverage-parity gap vs the sibling JSON PR (#56865): there's no LEGACY time-parser-policy rejection test for nanos. The behavior is already correct — the shared TimestampFormatter (parseNanos / parseWithoutTimeZoneNanos / formatNanos) raises UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER under the legacy policy, and XML uses it unchanged — so this is non-blocking. A test asserting it (write + read), mirroring JSON's "...rejects nanosecond timestamps under the LEGACY time parser policy", would lock it in. (XML is v1-only, so JSON's v2 JsonTable gate has no XML analogue to add.)

vinodkc · 2026-06-29T21:44:36Z

Comment 3 (test coverage): from_xml test only checks non-null, not actual value
Fixed.

Rewrote the test to compute the expected LocalDateTime/Instant per precision using nanoOfSecTruncator and replaced the != null check with checkAnswer.

vinodkc · 2026-06-29T22:02:19Z

For the new coverage gap:
added SPARK-57458: XML rejects nanosecond timestamps under the LEGACY time parser policy in FileBasedDataSourceSuite, covering both the write and read paths with TimestampLTZNanosType(9) under LEGACY_TIME_PARSER_POLICY = LEGACY.

MaxGekk

1 addressed, 0 remaining, 0 new.
Thanks @vinodkc — the one open item from the last round is resolved and re-verified, so this LGTM.

The suggested LEGACY-time-parser nanos rejection test is added in FileBasedDataSourceSuite, mirroring the JSON #56865 sibling (same suite, same UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER condition + config param, both write and read; read uses FAILFAST so the error surfaces instead of being swallowed as a permissive null). The three round-1 fixes (NTZ-nano allowTimeZone=false, inference catch-all narrowed to DatetimeType, exact-value from_xml round-trip) remain in place.

Verification

Test scope is correct (LTZ only): TimestampFormatter.getFormatter selects the legacy formatter only when legacyTimeParserPolicy == LEGACY && !forTimestampNTZ, so NTZ-nanos always takes the ISO-8601 path and never hits the legacy-rejection path — there is no untested NTZ rejection to add, and the test comment's rationale is accurate.
XML is v1-only, so JSON's v2 JsonTable gate has no XML analogue — correctly noted.

No new findings.

MaxGekk · 2026-06-30T09:45:51Z

+1, LGTM. Merging to master/4.x.
Thank you, @vinodkc.

…e XML datasource ### What changes were proposed in this pull request? This PR adds support for nanosecond-precision timestamp types (`TimestampLTZNanosType` and `TimestampNTZNanosType`) in the XML datasource, covering: - Writing: `StaxXmlGenerator` handles the two nano timestamp types by calling the nanosecond-aware formatter methods (formatNanos, formatWithoutTimeZoneNanos). - Reading: `StaxXmlParser` routes nano timestamp types to the corresponding `parseNanos` / `parseWithoutTimeZoneNanos` formatter methods in the schema-directed path, and delegates to castTo in the type-coercion path. - Schema inference: `XmlInferSchema` now infers `TimestampNTZNanosType(9)` when a field value carries sub-microsecond fractional seconds (>6 digits) and `TIMESTAMP_NANOS_TYPES_ENABLED` is on. The compatibleType widening function is extended to merge two nano timestamp types (taking the higher precision), downgrade `TimestampNTZNanosType` + `TimestampNTZType` to `TimestampNTZType`, and fall back to `TimestampType` for any other nano/non-nano combination. The `StructType` and `ArrayType` recursive cases are moved into the pre-TypeCoercion block so that nested fields also benefit from the nano-widening logic. ### Why are the changes needed? `xml` rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through `xml`. ### Does this PR introduce _any_ user-facing change? Yes. With `spark.sql.timestampNanosTypes.enabled` set to `true`: - XML files can now be written and read back with `TimestampLTZNanosType` and `TimestampNTZNanosType` columns without error. - Schema inference promotes a timestamp field to `TimestampNTZNanosType(9)` when its string value contains more than 6 fractional-second digits . ### How was this patch tested? `XmlSuite`: Added two new inference tests — one verifying that a 9-digit NTZ timestamp string is inferred as `TimestampNTZNanosType(9)`, and one verifying that a mix of micro-precision and nano-precision NTZ rows in the same file degrades to `TimestampNTZType` Updated `FileBasedDataSourceSuite` and `XmlFunctionsSuite` ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude (Sonnet 4.6) Closes #56854 from vinodkc/SPARK-57458. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit fe6bc95) Signed-off-by: Max Gekk <max.gekk@gmail.com>

Support nanosecond-precision timestamp types in the XML datasource

0f03f54

MaxGekk reviewed Jun 29, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala Outdated

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala Outdated

Fix review comments

f88906c

MaxGekk reviewed Jun 29, 2026

View reviewed changes

Fix review comments 2

bcfef10

MaxGekk approved these changes Jun 30, 2026

View reviewed changes

MaxGekk closed this in fe6bc95 Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource#56854

[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource#56854
vinodkc wants to merge 3 commits into
apache:masterfrom
vinodkc:SPARK-57458

vinodkc commented Jun 28, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

Uh oh!

Uh oh!

MaxGekk left a comment

Uh oh!

vinodkc commented Jun 29, 2026

Uh oh!

vinodkc commented Jun 29, 2026 •

edited

Loading

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vinodkc commented Jun 28, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Correctness (2)

Suggestions (1)

Verification

Uh oh!

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Suggestions (1)

Uh oh!

vinodkc commented Jun 29, 2026

Uh oh!

vinodkc commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Verification

Uh oh!

MaxGekk commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinodkc commented Jun 29, 2026 •

edited

Loading