[SPARK-57556][SQL] Raise a clear error for the TIME data type in Hive SerDe interop#56850
[SPARK-57556][SQL] Raise a clear error for the TIME data type in Hive SerDe interop#56850MaxGekk wants to merge 3 commits into
Conversation
… SerDe interop ### What changes were proposed in this pull request? Apache Hive has no TIME type, so `TimeType` has no faithful representation in Hive SerDe interop. This PR makes `TimeType` produce a clear `AnalysisException` instead of a `scala.MatchError` or an internal error when it reaches the `HiveInspectors` mapping functions, and rejects it in the Hive SerDe write path: - `HiveInspectors.toInspector(dataType)`, `toInspector(expr)` (TIME literal) and `toTypeInfo` now throw `UNSUPPORTED_DATATYPE` via a shared helper. - `HiveFileFormat.supportDataType` rejects `TimeType` (recursing into nested types) so Hive SerDe writes raise `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`. ### Why are the changes needed? `HiveInspectors` had no `TimeType` case, so object-inspector creation and TypeInfo mapping fell through to a `MatchError`/internal error when a TIME column or literal reached Hive SerDe paths (e.g. a Hive UDF argument). This makes the behavior explicit and documented, consistent with the existing TIME rejection for Hive ORC (SPARK-51590). ### Does this PR introduce any user-facing change? Yes. Using TIME with Hive UDFs or Hive SerDe writes now fails with a clear error message naming the unsupported TIME type rather than a MatchError/internal error. ### How was this patch tested? Added tests in `HiveInspectorSuite`, `HiveUDFSuite` and `InsertSuite`, and documented the limitation in `docs/sql-ref-datatypes.md`.
|
Could you review when you have a moment? cc @cloud-fan @dongjoon-hyun |
|
@uros-b Could you review this PR, please. |
| assert(e.getMessage.contains("UNSUPPORTED_DATATYPE")) | ||
| assert(e.getMessage.contains(TimeType().sql)) |
There was a problem hiding this comment.
Nit: please use checkError instead of assert/contains.
There was a problem hiding this comment.
I looked into this but checkError does not cleanly apply here. The Hive UDF resolver in HiveSessionStateBuilder.makeFunctionExpression catches the inner UNSUPPORTED_DATATYPE and re-wraps it into a new _LEGACY_ERROR_TEMP_3084 AnalysisException, capturing the original only as the e string parameter (e.toString) and copying its stack trace -- it does not attach it as a cause (the 2-arg AnalysisException(errorClass, messageParameters) constructor sets cause = None). So getCause is null and there is no inner condition to assert against; the only structured option would be checkError on _LEGACY_ERROR_TEMP_3084 with a brittle full-message e param.
I kept assert(...contains...) and filed SPARK-57750 (sub-task of SPARK-37935) to give that legacy error a proper name and attach the cause, after which this can become a clean nested checkError.
There was a problem hiding this comment.
It would be better to merge this #56867 first of all.
| condition = "UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE", | ||
| parameters = Map( | ||
| "columnName" -> "`c`", | ||
| "columnType" -> s"\"${ArrayType(TimeType()).sql}\"", |
There was a problem hiding this comment.
Nit: how about map/struct? Shall we do those too?
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit: how about a more direct table-write case, e.g. INSERT INTO <metastore Hive serde table>?
There was a problem hiding this comment.
Good idea, but it turns out a Hive serde table cannot hold a TIME column in the first place, so a managed-table INSERT INTO never reaches the new write-path check. CREATE TABLE t (c TIME) STORED AS PARQUET fails at metastore creation:
AnalysisException: HiveException: IllegalArgumentException:
Error: type expected at the position 0 of 'time(6)' but 'time' is found.
at HiveExternalCatalog.createTable ... at CreateTableCommand.run
HiveClientImpl.toHiveColumn emits the catalog string time(6) as the Hive type, which Hive cannot parse. So the INSERT OVERWRITE DIRECTORY ... STORED AS cases (scalar + nested) are the ones that exercise HiveFileFormat.supportDataType directly; I left this as-is.
| time-zone. | ||
| - `TimeType(precision)`: Represents values comprising values of fields hour, minute and second with the number of decimal digits `precision` following the decimal point in the seconds field, without a time-zone. | ||
| The range of values is from `00:00:00` to `23:59:59` for min precision `0`, and to `23:59:59.999999999` for max precision `9`. The default precision is `6`. | ||
| - Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value. |
There was a problem hiding this comment.
| - Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value. | |
| - Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value. |
…UDF assert - De-indent the Hive-serde TIME note to align with other datetime types. - Cover nested TIME in map and struct (not just array) in the directory-write test. - Keep assert/contains for the wrapped Hive UDF error; the inner UNSUPPORTED_DATATYPE has no cause attached on _LEGACY_ERROR_TEMP_3084 (tracked by SPARK-57750). Co-authored-by: Isaac
|
Thank you for the reviews, @uros-b! Addressed your comments: applied the doc note suggestion, extended the nested-type coverage to map and struct, and replied inline on the direct table-write case and the |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @MaxGekk .
|
Follow-up for the |
What changes were proposed in this pull request?
Apache Hive has no TIME type, so
TimeTypehas no faithful representation in Hive SerDe interop. This PR (the Option B / "clear, documented error" path from SPARK-57556) makesTimeTypeproduce a clearAnalysisExceptioninstead of ascala.MatchError/internal error when it reaches theHiveInspectorsmapping functions, and rejects it in the Hive SerDe write path:HiveInspectors.toInspector(dataType),toInspector(expr)(TIME literal) andtoTypeInfonow throwUNSUPPORTED_DATATYPEvia a sharedunsupportedHiveTypehelper. PreviouslytoInspector(dataType)had noTimeTypecase and no default branch, so a TIME column hit a rawscala.MatchError.HiveFileFormat.supportDataTyperejectsTimeType(recursing into nested struct/array/map/UDT types, preserving the prior default for all other types) so Hive SerDe writes raiseUNSUPPORTED_DATA_TYPE_FOR_DATASOURCE(formatHive) viaFileFormatWriter.verifySchema.docs/sql-ref-datatypes.md.Why are the changes needed?
HiveInspectorshad noTimeTypecase, so object-inspector creation and TypeInfo mapping fell through to aMatchError/internal error when a TIME column or literal reached Hive SerDe paths (for example, a TIME argument to a Hive UDF/UDAF/UDTF). This makes the behavior explicit and documented, consistent with the existing TIME rejection for Hive ORC (SPARK-51590).Does this PR introduce any user-facing change?
Yes. Using TIME with Hive UDFs or in a Hive SerDe write now fails with a clear error that names the unsupported TIME type, instead of a
MatchError/internal error. For example,SELECT myHiveUDF(TIME'12:01:02')now reports[UNSUPPORTED_DATATYPE] Unsupported data type "TIME(6)"(wrapped by the Hive UDF resolver), and writing a TIME column through the Hive SerDe write path reports[UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The Hive datasource doesn't support the column ... of the type "TIME(6)".How was this patch tested?
Added tests and ran them locally (
build/sbt 'hive/testOnly *HiveInspectorSuite *HiveUDFSuite *InsertSuite'):HiveInspectorSuite:toInspector(TimeType()), a TIME literal, andTimeType().toTypeInforaiseUNSUPPORTED_DATATYPE.HiveUDFSuite: passingTIME'12:01:02'to a HiveGenericUDFHashfails with a message naming the unsupported TIME type.InsertSuite:INSERT OVERWRITE LOCAL DIRECTORY ... STORED AS PARQUET SELECT TIME'...'(withspark.sql.hive.convertMetastoreInsertDir=false) raisesUNSUPPORTED_DATA_TYPE_FOR_DATASOURCE.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)