|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Supported Spark Configurations |
| 21 | + |
| 22 | +This document tracks Spark SQL configurations that affect Comet's behavior. For each |
| 23 | +configuration we record which Comet expressions or operators are influenced, what |
| 24 | +verification has been performed, and any known gaps. |
| 25 | + |
| 26 | +## How to Read This Document |
| 27 | + |
| 28 | +The status column uses these values: |
| 29 | + |
| 30 | +- **Supported** -- Comet runs the affected expressions natively under every value of |
| 31 | + the config, and produces results matching Spark. |
| 32 | +- **Partial** -- Comet runs natively for some values of the config but falls back to |
| 33 | + Spark for others, or runs natively but with documented incompatibilities. |
| 34 | +- **Falls back** -- Comet does not run the affected expressions natively under this |
| 35 | + config and always defers to Spark. |
| 36 | +- **Unaudited** -- the config's interaction with Comet has not yet been verified. |
| 37 | + |
| 38 | +## Audited Configurations |
| 39 | + |
| 40 | +- `spark.sql.legacy.timeParserPolicy` |
| 41 | + - Default: `EXCEPTION` |
| 42 | + - Status: Falls back (see notes) |
| 43 | + - Affected expressions: `date_format`, `from_unixtime`, `unix_timestamp`, `to_unix_timestamp`, `to_timestamp`, `to_timestamp_ntz`, `to_date`, `try_to_timestamp` (Spark 4+) |
| 44 | + - Spark versions checked: 3.4.3, 3.5.8, 4.0.1 |
| 45 | + - Date: 2026-05-02 |
| 46 | + |
| 47 | +## Audit Notes |
| 48 | + |
| 49 | +### `spark.sql.legacy.timeParserPolicy` |
| 50 | + |
| 51 | +**Source.** `SQLConf.LEGACY_TIME_PARSER_POLICY` selects the formatter used by |
| 52 | +`TimestampFormatter` and `DateFormatter`: |
| 53 | + |
| 54 | +- `LEGACY` -- `java.text.SimpleDateFormat` / `FastDateFormat`. Lenient parsing. |
| 55 | +- `CORRECTED` -- `java.time.DateTimeFormatter` via `Iso8601TimestampFormatter`. Strict. |
| 56 | +- `EXCEPTION` (default) -- same parser as `CORRECTED`, plus |
| 57 | + `DateTimeFormatterHelper.checkParsedDiff` raises `SparkUpgradeException` |
| 58 | + (`INCONSISTENT_BEHAVIOR_CROSS_VERSION`) when the new parser fails on input that the |
| 59 | + legacy parser would have accepted. Pattern validation also raises |
| 60 | + `SparkUpgradeException` when a pattern is recognized only by the legacy formatter |
| 61 | + (this check applies under both `CORRECTED` and `EXCEPTION`). |
| 62 | + |
| 63 | +**Affected expressions.** Determined by tracing `TimestampFormatterHelper`, |
| 64 | +`TimestampFormatter(...)`, and `DateFormatter(...)` usage in |
| 65 | +`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala` |
| 66 | +across Spark 3.4, 3.5, 4.0, and 4.1. Three expression classes mix in |
| 67 | +`TimestampFormatterHelper`: |
| 68 | + |
| 69 | +- `DateFormatClass` -- `date_format` |
| 70 | +- `FromUnixTime` -- `from_unixtime` |
| 71 | +- `ToTimestamp` (abstract) -- `UnixTimestamp` (`unix_timestamp`), |
| 72 | + `ToUnixTimestamp` (`to_unix_timestamp`), `GetTimestamp` (used by |
| 73 | + `ParseToTimestamp` for `to_timestamp` / `to_timestamp_ntz`, `ParseToDate` for |
| 74 | + `to_date`, and Spark 4's `try_to_timestamp`) |
| 75 | + |
| 76 | +`Cast` between strings and date / timestamp also reads the policy via the default |
| 77 | +formatters but is tested separately by `CometCastSuite` and is out of scope here. |
| 78 | + |
| 79 | +**Comet status.** None of the listed expressions consult `legacyTimeParserPolicy` in |
| 80 | +their Comet serde. The native implementations of `date_format`, `from_unixtime`, and |
| 81 | +`unix_timestamp` use a fixed strftime-style mapping that does not vary with policy; |
| 82 | +the remaining four (`to_unix_timestamp`, `to_timestamp`, `to_date`, |
| 83 | +`try_to_timestamp`) have no native implementation and fall back to Spark. Today this |
| 84 | +works because: |
| 85 | + |
| 86 | +- `date_format` is `Compatible` only for a small whitelist of formats under UTC; the |
| 87 | + whitelisted formats happen to produce identical output under all three policies. |
| 88 | +- `from_unixtime` is marked `Incompatible` and falls back unless |
| 89 | + `spark.comet.expression.FromUnixTime.allowIncompatible=true` is set. |
| 90 | +- `unix_timestamp(<timestamp_or_date>)` does not call the formatter at all; the |
| 91 | + string-input overload falls back. |
| 92 | + |
| 93 | +If a Comet contributor adds native string-format parsing or extends the date_format |
| 94 | +whitelist, this audit should be revisited and the policy must be honored explicitly. |
| 95 | + |
| 96 | +**Test coverage.** `spark/src/test/resources/sql-tests/expressions/datetime/`: |
| 97 | + |
| 98 | +- One ConfigMatrix file per expression covering convergent inputs under |
| 99 | + `LEGACY,CORRECTED,EXCEPTION` (`*_time_parser_policy.sql`). |
| 100 | +- Per-policy files locking in divergent behavior: |
| 101 | + - `_legacy.sql` -- lenient inputs (single-digit fields, out-of-range values, |
| 102 | + trailing characters) and legacy-only pattern tokens (`'aaaa'`). |
| 103 | + - `_corrected.sql` -- the same inputs return null; legacy-only tokens raise |
| 104 | + `INCONSISTENT_BEHAVIOR_CROSS_VERSION.DATETIME_PATTERN_RECOGNITION` at formatter |
| 105 | + creation. |
| 106 | + - `_exception.sql` -- the same inputs raise |
| 107 | + `INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER` at parse time. |
| 108 | + |
| 109 | +**Findings.** All 42 generated test cases pass on Spark 3.4.3, 3.5.8, and 4.0.1. No |
| 110 | +Comet bugs were uncovered by the audit. The tests use `query spark_answer_only` so |
| 111 | +that result-correctness is enforced regardless of whether Comet runs the expression |
| 112 | +natively or falls back. |
0 commit comments