Skip to content

[SPARK-57160][CONNECT] Add Spark Connect protocol support for nanosecond-capable timestamp types and literals#56909

Open
MaxGekk wants to merge 1 commit into
apache:masterfrom
MaxGekk:nanos-proto
Open

[SPARK-57160][CONNECT] Add Spark Connect protocol support for nanosecond-capable timestamp types and literals#56909
MaxGekk wants to merge 1 commit into
apache:masterfrom
MaxGekk:nanos-proto

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Jun 30, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR adds the Spark Connect protocol surface for nanosecond timestamps so they can travel over the wire, both as types and as literals. There is no behavior change yet -- the converters that consume these messages land in follow-up sub-tasks of SPARK-56822.

  • types.proto: two new data-type kinds, TimestampNTZNanos and TimestampLTZNanos, each with an optional precision (7..9).
  • expressions.proto: matching literal arms that carry the value as epoch_micros + nanos_within_micro (0..999) plus an optional precision. Two components are used instead of a single int64 of nanoseconds because nanoseconds-since-epoch cannot cover the full 0001..9999 year range; this mirrors the Catalyst value TimestampNanosVal.
  • Regenerated the Python stubs under python/pyspark/sql/connect/proto/.

NTZ and LTZ are kept as separate kinds/arms (like timestamp vs timestamp_ntz), and non-negative fields use uint32.

Why are the changes needed?

Today the Connect DataType message has only microsecond timestamp kinds (timestamp, timestamp_ntz) with no precision field, and the Expression.Literal message encodes timestamp literals as a single int64 of microseconds. There is no way to express a nanosecond-capable timestamp type or a sub-microsecond literal over the wire, so no Connect client/server path can carry the new types. The protocol must be extended before any converter, Arrow, or client work can proceed.

Does this PR introduce any user-facing change?

No. This only adds protobuf message definitions; the new types remain gated behind spark.sql.timestampNanosTypes.enabled once the consuming paths are implemented.

How was this patch tested?

  • buf build / buf lint succeed for the modified protos (field numbers appended, no reuse/renumber).
  • ./dev/connect-gen-protos.sh regenerates the committed Python stubs; ./dev/check-protos.py reports no drift (pyspark-connect and pyspark-streaming: SUCCESS).
  • build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite" (44 tests) and build/sbt "connect-client-jvm/testOnly *ColumnNodeToProtoConverterSuite" (18 tests) pass, confirming the additive proto fields do not break existing proto plumbing.

No functional tests in this PR (there are no consumers of the new fields yet); behavior is covered by the converter and end-to-end sub-tasks.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.8)

…ond-capable timestamp types and literals

Extend the Spark Connect protobuf protocol to represent TimestampNTZNanos(p)
and TimestampLTZNanos(p) (p in [7, 9]) both as data types and as literal
values, and regenerate the Python stubs.

@yadavay-amzn yadavay-amzn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a small nit

Clean, well-scoped protocol addition - field numbers are allocated correctly and the encoding faithfully mirrors the Catalyst value type.

Nit: the DataType TimestampNTZNanos/TimestampLTZNanos messages don't state the omitted-precision default, while the literal arms do ("defaults to 9"). Since the type defaults to 9 too (TimestampNTZNanosType.apply()), mirroring that one-liner would keep them symmetric.

What I verified:

  • Field numbers append past the literal oneof's reserved 27, 28 (Geometry/Geography) and the DataType oneof's used range - additive and wire-compatible, no reuse or renumber.
  • epoch_micros + nanos_within_micro in [0, 999] matches Catalyst TimestampNanosVal (epochMicros, MAX_NANOS_WITHIN_MICRO = 999); the two-component encoding matches why a single int64 of nanos can't span the year range.
  • Precision 7/8/9 and default 9 match Timestamp{NTZ,LTZ}NanosType (MIN_PRECISION/MAX_PRECISION/DEFAULT_PRECISION).
  • NTZ and LTZ as separate kinds/arms is consistent with the existing timestamp vs timestamp_ntz split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants