Skip to content

[SPARK-57556][SQL] Raise a clear error for the TIME data type in Hive SerDe interop#56850

Open
MaxGekk wants to merge 3 commits into
apache:masterfrom
MaxGekk:time-hive-serde
Open

[SPARK-57556][SQL] Raise a clear error for the TIME data type in Hive SerDe interop#56850
MaxGekk wants to merge 3 commits into
apache:masterfrom
MaxGekk:time-hive-serde

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Jun 28, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Apache Hive has no TIME type, so TimeType has no faithful representation in Hive SerDe interop. This PR (the Option B / "clear, documented error" path from SPARK-57556) makes TimeType produce a clear AnalysisException instead of a scala.MatchError/internal error when it reaches the HiveInspectors mapping functions, and rejects it in the Hive SerDe write path:

  • HiveInspectors.toInspector(dataType), toInspector(expr) (TIME literal) and toTypeInfo now throw UNSUPPORTED_DATATYPE via a shared unsupportedHiveType helper. Previously toInspector(dataType) had no TimeType case and no default branch, so a TIME column hit a raw scala.MatchError.
  • HiveFileFormat.supportDataType rejects TimeType (recursing into nested struct/array/map/UDT types, preserving the prior default for all other types) so Hive SerDe writes raise UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE (format Hive) via FileFormatWriter.verifySchema.
  • Documented the limitation on the TIME entry in docs/sql-ref-datatypes.md.

Why are the changes needed?

HiveInspectors had no TimeType case, so object-inspector creation and TypeInfo mapping fell through to a MatchError/internal error when a TIME column or literal reached Hive SerDe paths (for example, a TIME argument to a Hive UDF/UDAF/UDTF). This makes the behavior explicit and documented, consistent with the existing TIME rejection for Hive ORC (SPARK-51590).

Does this PR introduce any user-facing change?

Yes. Using TIME with Hive UDFs or in a Hive SerDe write now fails with a clear error that names the unsupported TIME type, instead of a MatchError/internal error. For example, SELECT myHiveUDF(TIME'12:01:02') now reports [UNSUPPORTED_DATATYPE] Unsupported data type "TIME(6)" (wrapped by the Hive UDF resolver), and writing a TIME column through the Hive SerDe write path reports [UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The Hive datasource doesn't support the column ... of the type "TIME(6)".

How was this patch tested?

Added tests and ran them locally (build/sbt 'hive/testOnly *HiveInspectorSuite *HiveUDFSuite *InsertSuite'):

  • HiveInspectorSuite: toInspector(TimeType()), a TIME literal, and TimeType().toTypeInfo raise UNSUPPORTED_DATATYPE.
  • HiveUDFSuite: passing TIME'12:01:02' to a Hive GenericUDFHash fails with a message naming the unsupported TIME type.
  • InsertSuite: INSERT OVERWRITE LOCAL DIRECTORY ... STORED AS PARQUET SELECT TIME'...' (with spark.sql.hive.convertMetastoreInsertDir=false) raises UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.8)

MaxGekk added 2 commits June 28, 2026 17:27
… SerDe interop

### What changes were proposed in this pull request?

Apache Hive has no TIME type, so `TimeType` has no faithful representation in
Hive SerDe interop. This PR makes `TimeType` produce a clear `AnalysisException`
instead of a `scala.MatchError` or an internal error when it reaches the
`HiveInspectors` mapping functions, and rejects it in the Hive SerDe write path:

- `HiveInspectors.toInspector(dataType)`, `toInspector(expr)` (TIME literal) and
  `toTypeInfo` now throw `UNSUPPORTED_DATATYPE` via a shared helper.
- `HiveFileFormat.supportDataType` rejects `TimeType` (recursing into nested
  types) so Hive SerDe writes raise `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`.

### Why are the changes needed?

`HiveInspectors` had no `TimeType` case, so object-inspector creation and
TypeInfo mapping fell through to a `MatchError`/internal error when a TIME
column or literal reached Hive SerDe paths (e.g. a Hive UDF argument). This
makes the behavior explicit and documented, consistent with the existing TIME
rejection for Hive ORC (SPARK-51590).

### Does this PR introduce any user-facing change?

Yes. Using TIME with Hive UDFs or Hive SerDe writes now fails with a clear
error message naming the unsupported TIME type rather than a MatchError/internal
error.

### How was this patch tested?

Added tests in `HiveInspectorSuite`, `HiveUDFSuite` and `InsertSuite`, and
documented the limitation in `docs/sql-ref-datatypes.md`.
@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Could you review when you have a moment? cc @cloud-fan @dongjoon-hyun

@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

@uros-b Could you review this PR, please.

Comment on lines +419 to +420
assert(e.getMessage.contains("UNSUPPORTED_DATATYPE"))
assert(e.getMessage.contains(TimeType().sql))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please use checkError instead of assert/contains.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this but checkError does not cleanly apply here. The Hive UDF resolver in HiveSessionStateBuilder.makeFunctionExpression catches the inner UNSUPPORTED_DATATYPE and re-wraps it into a new _LEGACY_ERROR_TEMP_3084 AnalysisException, capturing the original only as the e string parameter (e.toString) and copying its stack trace -- it does not attach it as a cause (the 2-arg AnalysisException(errorClass, messageParameters) constructor sets cause = None). So getCause is null and there is no inner condition to assert against; the only structured option would be checkError on _LEGACY_ERROR_TEMP_3084 with a brittle full-message e param.

I kept assert(...contains...) and filed SPARK-57750 (sub-task of SPARK-37935) to give that legacy error a proper name and attach the cause, after which this can become a clean nested checkError.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to merge this #56867 first of all.

condition = "UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE",
parameters = Map(
"columnName" -> "`c`",
"columnType" -> s"\"${ArrayType(TimeType()).sql}\"",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: how about map/struct? Shall we do those too?

}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: how about a more direct table-write case, e.g. INSERT INTO <metastore Hive serde table>?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but it turns out a Hive serde table cannot hold a TIME column in the first place, so a managed-table INSERT INTO never reaches the new write-path check. CREATE TABLE t (c TIME) STORED AS PARQUET fails at metastore creation:

AnalysisException: HiveException: IllegalArgumentException:
  Error: type expected at the position 0 of 'time(6)' but 'time' is found.
  at HiveExternalCatalog.createTable ... at CreateTableCommand.run

HiveClientImpl.toHiveColumn emits the catalog string time(6) as the Hive type, which Hive cannot parse. So the INSERT OVERWRITE DIRECTORY ... STORED AS cases (scalar + nested) are the ones that exercise HiveFileFormat.supportDataType directly; I left this as-is.

Comment thread docs/sql-ref-datatypes.md Outdated
time-zone.
- `TimeType(precision)`: Represents values comprising values of fields hour, minute and second with the number of decimal digits `precision` following the decimal point in the seconds field, without a time-zone.
The range of values is from `00:00:00` to `23:59:59` for min precision `0`, and to `23:59:59.999999999` for max precision `9`. The default precision is `6`.
- Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value.
- Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but otherwise LGTM - thank you @MaxGekk!

…UDF assert

- De-indent the Hive-serde TIME note to align with other datetime types.
- Cover nested TIME in map and struct (not just array) in the directory-write test.
- Keep assert/contains for the wrapped Hive UDF error; the inner UNSUPPORTED_DATATYPE
  has no cause attached on _LEGACY_ERROR_TEMP_3084 (tracked by SPARK-57750).

Co-authored-by: Isaac
@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Thank you for the reviews, @uros-b! Addressed your comments: applied the doc note suggestion, extended the nested-type coverage to map and struct, and replied inline on the direct table-write case and the checkError suggestion (the latter tracked by SPARK-57750).

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk .

@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Follow-up for the checkError discussion above: I opened #56867 to give _LEGACY_ERROR_TEMP_3084 a proper name and attach the original exception as the cause (SPARK-57750). Once that lands, the Hive UDF test here can assert the inner error directly via checkError. @uros-b would you mind taking a look at #56867?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants