[SPARK-57160][CONNECT] Add Spark Connect protocol support for nanosecond-capable timestamp types and literals#56909
Open
MaxGekk wants to merge 1 commit into
Open
[SPARK-57160][CONNECT] Add Spark Connect protocol support for nanosecond-capable timestamp types and literals#56909MaxGekk wants to merge 1 commit into
MaxGekk wants to merge 1 commit into
Conversation
…ond-capable timestamp types and literals Extend the Spark Connect protobuf protocol to represent TimestampNTZNanos(p) and TimestampLTZNanos(p) (p in [7, 9]) both as data types and as literal values, and regenerate the Python stubs.
yadavay-amzn
approved these changes
Jun 30, 2026
yadavay-amzn
left a comment
Contributor
There was a problem hiding this comment.
LGTM with a small nit
Clean, well-scoped protocol addition - field numbers are allocated correctly and the encoding faithfully mirrors the Catalyst value type.
Nit: the DataType TimestampNTZNanos/TimestampLTZNanos messages don't state the omitted-precision default, while the literal arms do ("defaults to 9"). Since the type defaults to 9 too (TimestampNTZNanosType.apply()), mirroring that one-liner would keep them symmetric.
What I verified:
- Field numbers append past the literal oneof's
reserved 27, 28(Geometry/Geography) and the DataType oneof's used range - additive and wire-compatible, no reuse or renumber. epoch_micros+nanos_within_microin[0, 999]matches CatalystTimestampNanosVal(epochMicros,MAX_NANOS_WITHIN_MICRO = 999); the two-component encoding matches why a single int64 of nanos can't span the year range.- Precision 7/8/9 and default 9 match
Timestamp{NTZ,LTZ}NanosType(MIN_PRECISION/MAX_PRECISION/DEFAULT_PRECISION). - NTZ and LTZ as separate kinds/arms is consistent with the existing
timestampvstimestamp_ntzsplit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds the Spark Connect protocol surface for nanosecond timestamps so they can travel over the wire, both as types and as literals. There is no behavior change yet -- the converters that consume these messages land in follow-up sub-tasks of SPARK-56822.
types.proto: two new data-type kinds,TimestampNTZNanosandTimestampLTZNanos, each with an optionalprecision(7..9).expressions.proto: matching literal arms that carry the value asepoch_micros+nanos_within_micro(0..999) plus an optionalprecision. Two components are used instead of a single int64 of nanoseconds because nanoseconds-since-epoch cannot cover the full0001..9999year range; this mirrors the Catalyst valueTimestampNanosVal.python/pyspark/sql/connect/proto/.NTZ and LTZ are kept as separate kinds/arms (like
timestampvstimestamp_ntz), and non-negative fields useuint32.Why are the changes needed?
Today the Connect
DataTypemessage has only microsecond timestamp kinds (timestamp,timestamp_ntz) with no precision field, and theExpression.Literalmessage encodes timestamp literals as a single int64 of microseconds. There is no way to express a nanosecond-capable timestamp type or a sub-microsecond literal over the wire, so no Connect client/server path can carry the new types. The protocol must be extended before any converter, Arrow, or client work can proceed.Does this PR introduce any user-facing change?
No. This only adds protobuf message definitions; the new types remain gated behind
spark.sql.timestampNanosTypes.enabledonce the consuming paths are implemented.How was this patch tested?
buf build/buf lintsucceed for the modified protos (field numbers appended, no reuse/renumber)../dev/connect-gen-protos.shregenerates the committed Python stubs;./dev/check-protos.pyreports no drift (pyspark-connect and pyspark-streaming: SUCCESS).build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"(44 tests) andbuild/sbt "connect-client-jvm/testOnly *ColumnNodeToProtoConverterSuite"(18 tests) pass, confirming the additive proto fields do not break existing proto plumbing.No functional tests in this PR (there are no consumers of the new fields yet); behavior is covered by the converter and end-to-end sub-tasks.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)