Skip to content

[SPARK-57777][SQL][CONNECT] Distinguish explicit collation when rendering string literals to SQL#56892

Open
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:cloud-fan/string-literal-default-collation
Open

[SPARK-57777][SQL][CONNECT] Distinguish explicit collation when rendering string literals to SQL#56892
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:cloud-fan/string-literal-default-collation

Conversation

@cloud-fan

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR makes Literal.sql render an explicit collate clause for any string literal whose
type carries an explicit collation, including an explicit UTF8_BINARY, while still rendering
the default (un-collated) StringType without a clause.

Literal.sql previously had two arms for string literals:

case (v: UTF8String, StringType) =>            // matched any UTF8_BINARY StringType by value equality
  "'" + escaped + "'"
case (v: UTF8String, st: StringType) =>        // only reached for non-UTF8_BINARY collations
  "'" + escaped + "'" + st.typeName.substring(6)

The first arm matches via the StringType case object's equals (which compares collationId
and constraint), so it collapsed both the default StringType and an explicitly collated
UTF8_BINARY StringType into the same clause-less output. The two arms are now merged into one
that decides the clause via DataTypeUtils.isDefaultStringCharOrVarcharType (the same
singleton-identity check used elsewhere, e.g. by ApplyDefaultCollation and SHOW CREATE TABLE):

case (v: UTF8String, st: StringType) =>
  val collateClause =
    if (DataTypeUtils.isDefaultStringCharOrVarcharType(st)) "" else s" collate ${st.collationName}"
  "'" + escaped + "'" + collateClause

For non-default collations the produced string is byte-for-byte identical to the previous
st.typeName.substring(6) output (typeName is s"string collate $collationName").

This PR also removes a test-only normalization in PlanGenerationTestSuite that stamped
UTF8_BINARY onto every string-type proto whose collation field was empty before writing the
golden files. That shim made the generated query-tests golden artifacts (.proto.bin / .json,
and the downstream .explain files) misrepresent the real wire format: a default StringType is
serialized with an empty collation field (the "undetermined / default collation" sentinel),
not UTF8_BINARY. The affected golden files were regenerated so they now reflect what a real
client actually sends.

Why are the changes needed?

A default StringType and an explicitly-UTF8_BINARY StringType are semantically different:
the former is "undetermined" and is eligible to inherit a default collation during analysis (e.g.
CREATE TABLE ... DEFAULT COLLATION UTF8_LCASE AS SELECT 'x'), while the latter is explicitly
pinned and must not inherit. Literal.sql is used in view text, SHOW CREATE TABLE, error
messages, etc.; rendering an explicitly-collated UTF8_BINARY literal without a collate clause
loses that distinction and is not faithful on re-parse. This aligns string-literal SQL rendering
with how the rest of the engine already distinguishes explicit collation.

The test-normalization removal is a correctness fix for the Spark Connect golden files: they are
meant to be a faithful record of the protocol, and they were showing collate UTF8_BINARY on
default strings where the actual proto omits the collation.

Does this PR introduce any user-facing change?

Yes. Literal.sql now appends collate UTF8_BINARY when a string literal's type is an explicit
(non-default) UTF8_BINARY StringType. A plain default string literal is unchanged (no clause),
and literals with other explicit collations are unchanged. This affects SQL text generated from
literals (e.g. view definitions, SHOW CREATE TABLE, error messages).

How was this patch tested?

Existing suites, all passing:

  • catalyst/testOnly org.apache.spark.sql.catalyst.expressions.LiteralExpressionSuite (49)
  • connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite (727; golden files regenerated and reviewed)
  • connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite (732; explain golden regenerated)
  • sql/testOnly org.apache.spark.sql.SQLQueryTestSuite for the collation inputs (collations-basic, view-with-default-collation, collations-padding-trim, collations-string-functions, collations-aliases, listagg-collations) (12; no golden changes)
  • sql/testOnly org.apache.spark.sql.execution.command.{v1,v2}.ShowCreateTableSuite (47)

The regenerated golden diff is limited to dropping the test-stamped collation: "UTF8_BINARY"
from default-string proto types and the corresponding CAST(NULL AS STRING COLLATE UTF8_BINARY)
-> CAST(NULL AS STRING) in one explain file.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

…ring string literals to SQL

### What changes were proposed in this pull request?

Make `Literal.sql` render a `collate` clause for string literals with an explicit
collation (including explicit `UTF8_BINARY`), while keeping the default `StringType`
clause-less. Also remove the `PlanGenerationTestSuite` normalization that stamped
`UTF8_BINARY` onto empty-collation proto string types, and regenerate the affected
Spark Connect golden files so they reflect the real wire format.

### Why are the changes needed?

A default `StringType` (undetermined, eligible to inherit a default collation) and an
explicitly-`UTF8_BINARY` `StringType` are semantically distinct; rendering the latter
without a `collate` clause is lossy on re-parse. The test shim also made the Connect
golden files misrepresent the protocol.

### Does this PR introduce any user-facing change?

Yes. `Literal.sql` now appends ` collate UTF8_BINARY` for explicit-`UTF8_BINARY` string
literals; default literals and other collations are unchanged.

### How was this patch tested?

LiteralExpressionSuite, PlanGenerationTestSuite, ProtoToParsedPlanTestSuite, collation
SQLQueryTestSuite inputs, and v1/v2 ShowCreateTableSuite all pass; goldens regenerated.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)
case (v: UTF8String, StringType) =>
// Escapes all backslashes and single quotes.
"'" + v.toString.replace("\\", "\\\\").replace("'", "\\'") + "'"
case (v: UTF8String, st: StringType) =>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: there's no direct unit assertion for the central behavior change: an explicit-UTF8_BINARY literal now rendering collate UTF8_BINARY (and the default singleton staying clause-free) is covered only indirectly via regenerated Connect goldens. A ~3-line LiteralExpressionSuite test pinning StringType singleton -> 'x', StringType("UTF8_BINARY") (non-eq) -> 'x' collate UTF8_BINARY, StringType("UTF8_LCASE") unchanged would lock in the regression guard cheaply.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added a direct LiteralExpressionSuite test (SPARK-57777: render explicit collation in string literal SQL) pinning the three cases: default singleton -> 'x', explicit StringType("UTF8_BINARY") -> 'x' collate UTF8_BINARY, and UTF8_LCASE unchanged. Thanks for the review!

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cloud-fan! I left just one comment, but otherwise looks good.

…iteral SQL

Address review nit (uros-b): pin the Literal.sql behavior directly instead of
only via regenerated Connect goldens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants