Skip to content

[SPARK-57456][SQL] Support nanosecond-precision timestamp types in the JSON datasource (v1 and v2)#56865

Closed
MaxGekk wants to merge 2 commits into
apache:masterfrom
MaxGekk:nanos-json-ds
Closed

[SPARK-57456][SQL] Support nanosecond-precision timestamp types in the JSON datasource (v1 and v2)#56865
MaxGekk wants to merge 2 commits into
apache:masterfrom
MaxGekk:nanos-json-ds

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Umbrella: SPARK-56822 (Timestamps with nanosecond precision).

This PR adds read and write support for the nanosecond-capable timestamp types TIMESTAMP_NTZ(p) and TIMESTAMP_LTZ(p) (p in 7-9) to the JSON datasource, for both the v1 (JsonFileFormat) and v2 (JsonTable) paths, reaching parity with the microsecond TimestampType / TimestampNTZType, and removes the SPARK-57166 rejection guardrail.

Specifically:

  • JacksonParser: adds TimestampLTZNanosType / TimestampNTZNanosType read cases that delegate to the existing parseNanos / parseWithoutTimeZoneNanos formatter methods with the column precision.
  • JacksonGenerator: adds the corresponding write cases that delegate to formatNanos / formatWithoutTimeZoneNanos.
  • JsonFileFormat (v1) and JsonTable (v2): drop the AnyTimestampNanoType rejection in supportDataType / supportsDataType.

Notes:

  • Schema inference (JsonInferSchema) keeps inferring microsecond TimestampType / TimestampNTZType by default; nanosecond types are reached only via an explicit user schema.
  • No new options: the existing timestampFormat / timestampNTZFormat options drive the nanos path. The column type carries the precision, and the count of S letters in the pattern controls how many fractional-second digits are emitted on write (text output needs up to 9 S for full precision; reads with the default formatter parse the full fraction and truncate to the declared precision).
  • The legacy time parser policy rejects nanos: the legacy LTZ formatter cannot represent sub-microsecond digits, so it raises UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER (the NTZ formatter always uses the ISO-8601 path).

Why are the changes needed?

JSON rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through JSON. This extends nanosecond-precision timestamp support (umbrella SPARK-56822) to the JSON datasource, matching the existing microsecond timestamp behavior and the Parquet/ORC/Avro/CSV nanosecond support.

Does this PR introduce any user-facing change?

Yes. With spark.sql.timestampNanosTypes.enabled=true, columns of type TIMESTAMP_NTZ(7-9) / TIMESTAMP_LTZ(7-9) can now be written to and read from JSON files, and parsed/generated by from_json / to_json. Previously such columns were rejected with UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE. This is a change within the unreleased master/branch only.

How was this patch tested?

  • JsonExpressionsSuite: JsonToStructs nanosecond parsing at the catalyst expression level.
  • JsonFunctionsSuite: flipped the existing from_json nanosecond test to assert successful parsing and the truncated value (instead of an unsupported-type error); added to_json and to_json / from_json round-trip tests.
  • FileBasedDataSourceSuite: removed JSON from the SPARK-57166 rejection list; added end-to-end round-trip (precisions 7-9, NTZ and LTZ, v1 and v2), a nested struct/array/map round-trip, and a LEGACY time-parser-policy rejection test (write and read).
  • JsonSuite: DataFrameReader.json(Dataset[String]) read, a custom-schema file round-trip, and a mixed microsecond/nanosecond schema round-trip; these run under the JsonV1Suite, JsonV2Suite, JsonLegacyTimeParserSuite, and JsonUnsafeRowSuite variants.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 2.1, Claude Opus 4.8

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @MaxGekk and @dongjoon-hyun!

MaxGekk added 2 commits June 30, 2026 12:38
…e JSON datasource (v1 and v2)

Umbrella: SPARK-56822 (Timestamps with nanosecond precision).

This PR adds read and write support for the nanosecond-capable timestamp types `TIMESTAMP_NTZ(p)` and `TIMESTAMP_LTZ(p)` (`p` in 7-9) to the JSON datasource (v1 `JsonFileFormat` and v2 `JsonTable`), reaching parity with the microsecond `TimestampType` / `TimestampNTZType`, and removes the SPARK-57166 rejection guardrail.

- `JacksonParser`: adds `TimestampLTZNanosType` / `TimestampNTZNanosType` read cases that delegate to `parseNanos` / `parseWithoutTimeZoneNanos` with the column precision.
- `JacksonGenerator`: adds the corresponding write cases that delegate to `formatNanos` / `formatWithoutTimeZoneNanos`.
- `JsonFileFormat` (v1) and `JsonTable` (v2): drop the `AnyTimestampNanoType` rejection so the types are accepted by the read/write paths.

Schema inference (`JsonInferSchema`) keeps inferring microsecond types by default; nanos are reached only via an explicit user schema. The existing `timestampFormat` / `timestampNTZFormat` options drive the nanos path (no new options): the type carries the precision and the count of `S` letters in the pattern controls the fractional digits emitted on write. Under the LEGACY time parser policy the legacy LTZ formatter cannot represent sub-microsecond digits, so nanos are rejected with `TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER`.

JSON rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through JSON.

Yes. With `spark.sql.timestampNanosTypes.enabled=true`, columns of type `TIMESTAMP_NTZ(7-9)` / `TIMESTAMP_LTZ(7-9)` can now be written to and read from JSON, and parsed/generated by `from_json` / `to_json`. This is a change within the unreleased master/branch only.

- `JsonExpressionsSuite`: `JsonToStructs` nanos parsing.
- `JsonFunctionsSuite`: flipped the existing `from_json` nanos test to assert success + truncation; added `to_json` and `to_json` / `from_json` round-trip tests.
- `FileBasedDataSourceSuite`: removed JSON from the SPARK-57166 rejection list; added v1/v2 round-trip, nested struct/map/array round-trip, and a LEGACY-policy rejection test.
- `JsonSuite`: `DataFrameReader.json(Dataset[String])` read, custom-schema round-trip, and a mixed microsecond/nanosecond schema round-trip (run under v1, v2, legacy, and unsafe-row variants).

Generated-by: Cursor 2.1, Claude Opus 4.8
… accept only string input

The numeric-epoch shorthand (a JSON integer parsed as epoch seconds) is legacy
TimestampType behavior and is intentionally not carried over to the nanosecond
timestamp types, which accept only string input.

Co-authored-by: Isaac
@MaxGekk

MaxGekk commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Merging to master/4.x. Thank you, @dongjoon-hyun and @uros-b for review.

@MaxGekk MaxGekk closed this in 59fdb3e Jun 30, 2026
MaxGekk added a commit that referenced this pull request Jun 30, 2026
…e JSON datasource (v1 and v2)

### What changes were proposed in this pull request?
Umbrella: [SPARK-56822](https://issues.apache.org/jira/browse/SPARK-56822) (Timestamps with nanosecond precision).

This PR adds read and write support for the nanosecond-capable timestamp types `TIMESTAMP_NTZ(p)` and `TIMESTAMP_LTZ(p)` (`p` in 7-9) to the JSON datasource, for both the v1 (`JsonFileFormat`) and v2 (`JsonTable`) paths, reaching parity with the microsecond `TimestampType` / `TimestampNTZType`, and removes the [SPARK-57166](https://issues.apache.org/jira/browse/SPARK-57166) rejection guardrail.

Specifically:
- `JacksonParser`: adds `TimestampLTZNanosType` / `TimestampNTZNanosType` read cases that delegate to the existing `parseNanos` / `parseWithoutTimeZoneNanos` formatter methods with the column precision.
- `JacksonGenerator`: adds the corresponding write cases that delegate to `formatNanos` / `formatWithoutTimeZoneNanos`.
- `JsonFileFormat` (v1) and `JsonTable` (v2): drop the `AnyTimestampNanoType` rejection in `supportDataType` / `supportsDataType`.

Notes:
- Schema inference (`JsonInferSchema`) keeps inferring microsecond `TimestampType` / `TimestampNTZType` by default; nanosecond types are reached only via an explicit user schema.
- No new options: the existing `timestampFormat` / `timestampNTZFormat` options drive the nanos path. The column type carries the precision, and the count of `S` letters in the pattern controls how many fractional-second digits are emitted on write (text output needs up to 9 `S` for full precision; reads with the default formatter parse the full fraction and truncate to the declared precision).
- The legacy time parser policy rejects nanos: the legacy LTZ formatter cannot represent sub-microsecond digits, so it raises `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER` (the NTZ formatter always uses the ISO-8601 path).

### Why are the changes needed?
JSON rejected nanos timestamp types in its datasource capability checks and lacked the conversions to round-trip them, so these columns could not be written or read through JSON. This extends nanosecond-precision timestamp support (umbrella SPARK-56822) to the JSON datasource, matching the existing microsecond timestamp behavior and the Parquet/ORC/Avro/CSV nanosecond support.

### Does this PR introduce _any_ user-facing change?
Yes. With `spark.sql.timestampNanosTypes.enabled=true`, columns of type `TIMESTAMP_NTZ(7-9)` / `TIMESTAMP_LTZ(7-9)` can now be written to and read from JSON files, and parsed/generated by `from_json` / `to_json`. Previously such columns were rejected with `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`. This is a change within the unreleased master/branch only.

### How was this patch tested?
- `JsonExpressionsSuite`: `JsonToStructs` nanosecond parsing at the catalyst expression level.
- `JsonFunctionsSuite`: flipped the existing `from_json` nanosecond test to assert successful parsing and the truncated value (instead of an unsupported-type error); added `to_json` and `to_json` / `from_json` round-trip tests.
- `FileBasedDataSourceSuite`: removed JSON from the SPARK-57166 rejection list; added end-to-end round-trip (precisions 7-9, NTZ and LTZ, v1 and v2), a nested struct/array/map round-trip, and a LEGACY time-parser-policy rejection test (write and read).
- `JsonSuite`: `DataFrameReader.json(Dataset[String])` read, a custom-schema file round-trip, and a mixed microsecond/nanosecond schema round-trip; these run under the `JsonV1Suite`, `JsonV2Suite`, `JsonLegacyTimeParserSuite`, and `JsonUnsafeRowSuite` variants.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 2.1, Claude Opus 4.8

Closes #56865 from MaxGekk/nanos-json-ds.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 59fdb3e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants