Skip to content

[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource#56854

Closed
vinodkc wants to merge 3 commits into
apache:masterfrom
vinodkc:SPARK-57458
Closed

[SPARK-57458][SQL] Support nanosecond-precision timestamp types in the XML datasource#56854
vinodkc wants to merge 3 commits into
apache:masterfrom
vinodkc:SPARK-57458

Conversation

@vinodkc

@vinodkc vinodkc commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds support for nanosecond-precision timestamp types (TimestampLTZNanosType and TimestampNTZNanosType) in the XML datasource, covering:

  • Writing: StaxXmlGenerator handles the two nano timestamp types by calling the nanosecond-aware formatter methods (formatNanos, formatWithoutTimeZoneNanos).
  • Reading: StaxXmlParser routes nano timestamp types to the corresponding parseNanos / parseWithoutTimeZoneNanos formatter methods in the schema-directed path, and delegates to castTo in the type-coercion path.
  • Schema inference: XmlInferSchema now infers TimestampNTZNanosType(9) when a field value carries sub-microsecond fractional seconds (>6 digits) and TIMESTAMP_NANOS_TYPES_ENABLED is on.
    The compatibleType widening function is extended to merge two nano timestamp types (taking the higher precision), downgrade TimestampNTZNanosType + TimestampNTZType to TimestampNTZType, and fall back to TimestampType for any other nano/non-nano combination. The StructType and ArrayType recursive cases are moved into the pre-TypeCoercion block so that nested fields also benefit from the nano-widening logic.

Why are the changes needed?

xml rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through xml.

Does this PR introduce any user-facing change?

Yes. With spark.sql.timestampNanosTypes.enabled set to true:

  • XML files can now be written and read back with TimestampLTZNanosType and TimestampNTZNanosType columns without error.
  • Schema inference promotes a timestamp field to TimestampNTZNanosType(9) when its string value contains more than 6 fractional-second digits .

How was this patch tested?

XmlSuite: Added two new inference tests — one verifying that a 9-digit NTZ timestamp string is inferred as TimestampNTZNanosType(9), and one verifying that a mix of micro-precision and nano-precision NTZ rows in the same file degrades to TimestampNTZType
Updated FileBasedDataSourceSuite and XmlFunctionsSuite

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude (Sonnet 4.6)

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 2 non-blocking, 0 nits.
Clean extension that mirrors the CSV nanos support. One blocking item on the NTZ-nano read path; the rest is an over-broad inference catch-all and the matching test gaps.

Correctness (2)

  • StaxXmlParser.scala:606: NTZ-nano read uses the 2-arg parseWithoutTimeZoneNanos (allowTimeZone=true) and silently accepts+drops a zone component, unlike the micro NTZ path, inference, and CSV (blocking) — see inline.
  • XmlInferSchema.scala:667: the AnyTimestampNanoType catch-all matches nano paired with any type, so nano + non-datetime infers as TimestampType instead of the StringType fallback the micro path uses (non-blocking) — see inline.

Suggestions (1)

  • Test coverage: none of the new tests exercise (a) NTZ-nano zone rejection, (b) nano + non-timestamp inference, or (c) the actual round-tripped value in the from_xml test (it now asserts only non-null). Worth a case for (a)/(b) and tightening (c).

Verification

Traced the compatibleType restructuring (moving the StructType/ArrayType recursive cases ahead of findTightestCommonType): equivalence-preserving for the inputs it actually sees. compatibleType is private[xml] and reached only from inference, whose structs are built via a name-sorted TreeMap (addOrUpdateType), so the new groupBy+sortBy(name) merge yields the same result as the old findTightestCommonType/findTypeForComplex path, and element/field recursion falls through to findTightestCommonType for non-nano types. The motive is valid: AnyTimestampNanoType extends DatetimeType, so findTightestCommonType would route a nested nano pair to findWiderDateTimeType, which has no nano case and would MatchError. The write path, LTZ-nano read, and supportDataType gate change all match the CSV analogue.

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala Outdated

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 addressed, 0 remaining, 1 new.
Thanks for the fixes — all three points from the last round are resolved and re-verified:

  • StaxXmlParser.scala:606: NTZ-nano now passes allowTimeZone=false (rejects zoned input, matching the micro NTZ path and CSV). ✓
  • XmlInferSchema.scala:667: catch-all narrowed to DatetimeType, so nano + non-datetime falls back to StringType — and no findWiderDateTimeType MatchError is reintroduced. ✓
  • Tests: from_xml now asserts the exact truncated round-trip value (UTC-pinned), plus new zone-rejection and nano-vs-non-datetime inference tests. ✓

Suggestions (1)

  • One coverage-parity gap vs the sibling JSON PR (#56865): there's no LEGACY time-parser-policy rejection test for nanos. The behavior is already correct — the shared TimestampFormatter (parseNanos / parseWithoutTimeZoneNanos / formatNanos) raises UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER under the legacy policy, and XML uses it unchanged — so this is non-blocking. A test asserting it (write + read), mirroring JSON's "...rejects nanosecond timestamps under the LEGACY time parser policy", would lock it in. (XML is v1-only, so JSON's v2 JsonTable gate has no XML analogue to add.)

@vinodkc

vinodkc commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Comment 3 (test coverage): from_xml test only checks non-null, not actual value
Fixed.

Rewrote the test to compute the expected LocalDateTime/Instant per precision using nanoOfSecTruncator and replaced the != null check with checkAnswer.

@vinodkc

vinodkc commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

For the new coverage gap:
added SPARK-57458: XML rejects nanosecond timestamps under the LEGACY time parser policy in FileBasedDataSourceSuite, covering both the write and read paths with TimestampLTZNanosType(9) under LEGACY_TIME_PARSER_POLICY = LEGACY.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 addressed, 0 remaining, 0 new.
Thanks @vinodkc — the one open item from the last round is resolved and re-verified, so this LGTM.

The suggested LEGACY-time-parser nanos rejection test is added in FileBasedDataSourceSuite, mirroring the JSON #56865 sibling (same suite, same UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER condition + config param, both write and read; read uses FAILFAST so the error surfaces instead of being swallowed as a permissive null). The three round-1 fixes (NTZ-nano allowTimeZone=false, inference catch-all narrowed to DatetimeType, exact-value from_xml round-trip) remain in place.

Verification

  • Test scope is correct (LTZ only): TimestampFormatter.getFormatter selects the legacy formatter only when legacyTimeParserPolicy == LEGACY && !forTimestampNTZ, so NTZ-nanos always takes the ISO-8601 path and never hits the legacy-rejection path — there is no untested NTZ rejection to add, and the test comment's rationale is accurate.
  • XML is v1-only, so JSON's v2 JsonTable gate has no XML analogue — correctly noted.

No new findings.

@MaxGekk

MaxGekk commented Jun 30, 2026

Copy link
Copy Markdown
Member

+1, LGTM. Merging to master/4.x.
Thank you, @vinodkc.

@MaxGekk MaxGekk closed this in fe6bc95 Jun 30, 2026
MaxGekk pushed a commit that referenced this pull request Jun 30, 2026
…e XML datasource

### What changes were proposed in this pull request?

This PR adds support for nanosecond-precision timestamp types (`TimestampLTZNanosType` and `TimestampNTZNanosType`) in the XML datasource, covering:

- Writing: `StaxXmlGenerator` handles the two nano timestamp types by calling the nanosecond-aware formatter methods (formatNanos, formatWithoutTimeZoneNanos).
- Reading: `StaxXmlParser` routes nano timestamp types to the corresponding `parseNanos` / `parseWithoutTimeZoneNanos` formatter methods in the schema-directed path, and delegates to castTo in the type-coercion path.
- Schema inference: `XmlInferSchema` now infers `TimestampNTZNanosType(9)` when a field value carries sub-microsecond fractional seconds (>6 digits) and `TIMESTAMP_NANOS_TYPES_ENABLED` is on.
The compatibleType widening function is extended to merge two nano timestamp types (taking the higher precision), downgrade `TimestampNTZNanosType` + `TimestampNTZType` to `TimestampNTZType`, and fall back to `TimestampType` for any other nano/non-nano combination. The `StructType` and `ArrayType` recursive cases are moved into the pre-TypeCoercion block so that nested fields also benefit from the nano-widening logic.

### Why are the changes needed?

`xml` rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through `xml`.

### Does this PR introduce _any_ user-facing change?

Yes. With `spark.sql.timestampNanosTypes.enabled` set to `true`:

- XML files can now be written and read back with `TimestampLTZNanosType` and `TimestampNTZNanosType` columns without error.
- Schema inference promotes a timestamp field to `TimestampNTZNanosType(9)` when its string value contains more than 6 fractional-second digits .

### How was this patch tested?

 `XmlSuite`: Added two new inference tests — one verifying that a 9-digit NTZ timestamp string is inferred as `TimestampNTZNanosType(9)`, and one verifying that a mix of micro-precision and nano-precision NTZ rows in the same file degrades to `TimestampNTZType`
Updated `FileBasedDataSourceSuite`  and `XmlFunctionsSuite`

### Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude (Sonnet 4.6)

Closes #56854 from vinodkc/SPARK-57458.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit fe6bc95)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants