Skip to content

Fix timezone-shifted TIMESTAMP in nested complex types (ES-1978662)#1492

Open
sreekanth-db wants to merge 1 commit into
databricks:mainfrom
sreekanth-db:fix/es-1978662-nested-timestamp-tz
Open

Fix timezone-shifted TIMESTAMP in nested complex types (ES-1978662)#1492
sreekanth-db wants to merge 1 commit into
databricks:mainfrom
sreekanth-db:fix/es-1978662-nested-timestamp-tz

Conversation

@sreekanth-db

Copy link
Copy Markdown
Collaborator

Problem

With EnableComplexDatatypeSupport=1, retrieving a TIMESTAMP field nested inside a complex type (STRUCT/ARRAY/MAP) returned a timezone-shifted value, while scalar TIMESTAMP retrieval was correct.

For example, on a JVM with a UTC-5 default timezone, a scalar timestamp returned 2017-03-26 01:01:02.345 (correct), but the same value inside a STRUCT returned 2017-03-25 20:01:02.345 (a -5h shift). Both getString() and getObject() were affected.

Root cause

Arrow serializes nested TIMESTAMP fields as epoch microseconds. ComplexDataTypeParser.convertPrimitive() built the value with Timestamp.from(instant), which anchors an absolute instant. getString()/getObject() then re-render it in the JVM default timezone, producing the offset. The scalar path (ArrowToJavaObjectConverter.convertToTimestamp) avoids this by rebuilding a LocalDateTime and using Timestamp.valueOf(...), so the JVM zone cancels out on render.

Fix

Build the Timestamp from the UTC wall-clock (Timestamp.valueOf(LocalDateTime.ofInstant(instant, ZoneOffset.UTC))), mirroring the scalar path. Since convertPrimitive() is shared by struct/array/map parsing, this fixes nested timestamps in all three.

This is a no-op for UTC JVMs (where Timestamp.from already produced the correct result), so there is no regression for the previously-correct case.

Tests

Added unit tests for the STRUCT and ARRAY paths that force a non-UTC JVM default timezone (America/Bogota, UTC-5, no DST) and assert no shift. Verified they fail against the old code (expected: <2017-03-26 01:01:02.345> but was: <2017-03-25 20:01:02.345>) and pass with the fix.

This pull request and its description were written by Isaac.

TIMESTAMP fields inside nested complex types (STRUCT/ARRAY/MAP) are
serialized by Arrow as epoch microseconds. ComplexDataTypeParser built
the value via Timestamp.from(instant), which anchors an absolute instant
that getString()/getObject() then re-render in the JVM default timezone,
producing a spurious offset (e.g. a -5h shift) for nested timestamps
while scalar TIMESTAMP retrieval was unaffected.

Build the Timestamp from the UTC wall-clock instead, mirroring the scalar
conversion path (ArrowToJavaObjectConverter.convertToTimestamp), so the
JVM zone cancels out on render. This also fixes nested timestamps in
ARRAY and MAP, which share the same parsing path.

Adds unit tests (STRUCT and ARRAY) that force a non-UTC JVM zone and
assert no shift.

Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant