[SPARK-57777][SQL][CONNECT] Distinguish explicit collation when rendering string literals to SQL#56892
Conversation
…ring string literals to SQL ### What changes were proposed in this pull request? Make `Literal.sql` render a `collate` clause for string literals with an explicit collation (including explicit `UTF8_BINARY`), while keeping the default `StringType` clause-less. Also remove the `PlanGenerationTestSuite` normalization that stamped `UTF8_BINARY` onto empty-collation proto string types, and regenerate the affected Spark Connect golden files so they reflect the real wire format. ### Why are the changes needed? A default `StringType` (undetermined, eligible to inherit a default collation) and an explicitly-`UTF8_BINARY` `StringType` are semantically distinct; rendering the latter without a `collate` clause is lossy on re-parse. The test shim also made the Connect golden files misrepresent the protocol. ### Does this PR introduce any user-facing change? Yes. `Literal.sql` now appends ` collate UTF8_BINARY` for explicit-`UTF8_BINARY` string literals; default literals and other collations are unchanged. ### How was this patch tested? LiteralExpressionSuite, PlanGenerationTestSuite, ProtoToParsedPlanTestSuite, collation SQLQueryTestSuite inputs, and v1/v2 ShowCreateTableSuite all pass; goldens regenerated. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8)
| case (v: UTF8String, StringType) => | ||
| // Escapes all backslashes and single quotes. | ||
| "'" + v.toString.replace("\\", "\\\\").replace("'", "\\'") + "'" | ||
| case (v: UTF8String, st: StringType) => |
There was a problem hiding this comment.
Nit: there's no direct unit assertion for the central behavior change: an explicit-UTF8_BINARY literal now rendering collate UTF8_BINARY (and the default singleton staying clause-free) is covered only indirectly via regenerated Connect goldens. A ~3-line LiteralExpressionSuite test pinning StringType singleton -> 'x', StringType("UTF8_BINARY") (non-eq) -> 'x' collate UTF8_BINARY, StringType("UTF8_LCASE") unchanged would lock in the regression guard cheaply.
There was a problem hiding this comment.
Good call — added a direct LiteralExpressionSuite test (SPARK-57777: render explicit collation in string literal SQL) pinning the three cases: default singleton -> 'x', explicit StringType("UTF8_BINARY") -> 'x' collate UTF8_BINARY, and UTF8_LCASE unchanged. Thanks for the review!
uros-b
left a comment
There was a problem hiding this comment.
Thank you @cloud-fan! I left just one comment, but otherwise looks good.
…iteral SQL Address review nit (uros-b): pin the Literal.sql behavior directly instead of only via regenerated Connect goldens.
What changes were proposed in this pull request?
This PR makes
Literal.sqlrender an explicitcollateclause for any string literal whosetype carries an explicit collation, including an explicit
UTF8_BINARY, while still renderingthe default (un-collated)
StringTypewithout a clause.Literal.sqlpreviously had two arms for string literals:The first arm matches via the
StringTypecase object'sequals(which comparescollationIdand
constraint), so it collapsed both the defaultStringTypeand an explicitly collatedUTF8_BINARYStringTypeinto the same clause-less output. The two arms are now merged into onethat decides the clause via
DataTypeUtils.isDefaultStringCharOrVarcharType(the samesingleton-identity check used elsewhere, e.g. by
ApplyDefaultCollationandSHOW CREATE TABLE):For non-default collations the produced string is byte-for-byte identical to the previous
st.typeName.substring(6)output (typeNameiss"string collate $collationName").This PR also removes a test-only normalization in
PlanGenerationTestSuitethat stampedUTF8_BINARYonto every string-type proto whosecollationfield was empty before writing thegolden files. That shim made the generated
query-testsgolden artifacts (.proto.bin/.json,and the downstream
.explainfiles) misrepresent the real wire format: a defaultStringTypeisserialized with an empty
collationfield (the "undetermined / default collation" sentinel),not
UTF8_BINARY. The affected golden files were regenerated so they now reflect what a realclient actually sends.
Why are the changes needed?
A default
StringTypeand an explicitly-UTF8_BINARYStringTypeare semantically different:the former is "undetermined" and is eligible to inherit a default collation during analysis (e.g.
CREATE TABLE ... DEFAULT COLLATION UTF8_LCASE AS SELECT 'x'), while the latter is explicitlypinned and must not inherit.
Literal.sqlis used in view text,SHOW CREATE TABLE, errormessages, etc.; rendering an explicitly-collated
UTF8_BINARYliteral without acollateclauseloses that distinction and is not faithful on re-parse. This aligns string-literal SQL rendering
with how the rest of the engine already distinguishes explicit collation.
The test-normalization removal is a correctness fix for the Spark Connect golden files: they are
meant to be a faithful record of the protocol, and they were showing
collate UTF8_BINARYondefault strings where the actual proto omits the collation.
Does this PR introduce any user-facing change?
Yes.
Literal.sqlnow appendscollate UTF8_BINARYwhen a string literal's type is an explicit(non-default)
UTF8_BINARYStringType. A plain default string literal is unchanged (no clause),and literals with other explicit collations are unchanged. This affects SQL text generated from
literals (e.g. view definitions,
SHOW CREATE TABLE, error messages).How was this patch tested?
Existing suites, all passing:
catalyst/testOnly org.apache.spark.sql.catalyst.expressions.LiteralExpressionSuite(49)connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite(727; golden files regenerated and reviewed)connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite(732; explain golden regenerated)sql/testOnly org.apache.spark.sql.SQLQueryTestSuitefor the collation inputs (collations-basic,view-with-default-collation,collations-padding-trim,collations-string-functions,collations-aliases,listagg-collations) (12; no golden changes)sql/testOnly org.apache.spark.sql.execution.command.{v1,v2}.ShowCreateTableSuite(47)The regenerated golden diff is limited to dropping the test-stamped
collation: "UTF8_BINARY"from default-string proto types and the corresponding
CAST(NULL AS STRING COLLATE UTF8_BINARY)->
CAST(NULL AS STRING)in one explain file.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)