You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Which issue does this PR close?
N/A
## Rationale for this change
`coerce_int96_to_resolution` currently produces `Timestamp(unit, None)`
for every INT96-derived column. Some downstream readers need the
resulting Arrow type to carry a timezone, because the *absence* of a
timezone is itself meaningful.
The motivating case is Apache DataFusion Comet (a Spark accelerator)
trying to enforce [SPARK-36182\: pre-Spark-4 Spark rejects reading a
Parquet TimestampLTZ column as
TimestampNTZ](https://issues.apache.org/jira/browse/SPARK-36182).
Comet's schema adapter pattern-matches `Timestamp(_, Some(_)) ->
Timestamp(_, None)` to detect this case, but for INT96 columns the
post-coerce type is `Timestamp(unit, None)` — indistinguishable from a
true TimestampNTZ source. The LTZ signal is destroyed at the wrong
layer.
Spark and other systems write INT96 as UTC-adjusted instants, so a
caller can ask for the column to surface as `Timestamp(unit,
Some(\"UTC\"))`, preserving the LTZ semantic at the Arrow level.
## What changes are included in this PR?
- New `TableParquetOptions.global.coerce_int96_tz: Option<String>`
config field (defaults to `None`).
- `coerce_int96_to_resolution` gains a `timezone: Option<Arc<str>>`
parameter and threads it into the constructed `Timestamp` type.
- The new option is plumbed through `ParquetSource` -> `ParquetOpener` /
`ParquetMorselizer` -> `DFParquetMetadata`.
- `with_coerce_int96_tz` builder method on `DFParquetMetadata`.
- Default behavior is unchanged when the option is unset.
## Are these changes tested?
Yes, see apache/datafusion-comet#4357
## Are there any user-facing changes?
A new \`coerce_int96_tz\` config option. No change in behavior for the
default value.
---------
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
0 commit comments