Skip to content

Commit e6694b1

Browse files
committed
chore(audit): address PR review feedback on string-expressions audit
Apply the recurring patterns surfaced in PR review comments rather than documenting them as Known Limitation prose. - Remove the unreachable `CometScalarFunction("btrim")` serde mapping. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim` before serde runs, so the entry was dead code. Updated the support-doc sub-bullet for `btrim` to point readers at the `trim` entry. - Rewrite `CometCaseConversionBase` to use the standard `Incompatible` opt-in instead of the bespoke `spark.comet.caseConversion.enabled` conf. Override `getSupportLevel` to return `Incompatible(Some(reason))` and drop the in-`convert` config check. Remove `COMET_CASE_CONVERSION_ENABLED` and update the two callers and the string-expression benchmark to use the per-expression `spark.comet.expression.{Upper,Lower,InitCap}.allowIncompatible` keys. Closes the gating asymmetry called out in #4467. - Mark `replace` as `Incompatible` when `search` is a literal empty string. DataFusion's `replace` inserts the replacement between every character whereas Spark short-circuits to return `src` unchanged. Refile under #4497 (#3344 was closed previously with the bug body describing the divergence in the wrong direction). - Drop the spurious `CometStringRepeat.getCompatibleNotes()` claim that DataFusion `repeat` throws on negative counts. Verified in `expressions/string/string_repeat.sql` that negative counts produce the empty string Spark expects. The audit's prose claim was wrong, so #4462 was closed as already-fixed. - Update the `replace` SQL fixture to assert the new fallback message rather than the prior `ignore(...)` shape. Update `lower.sql` / `upper.sql` `expect_fallback(...)` strings to match the new reason text. Switch `lower_enabled.sql` / `upper_enabled.sql` from `spark.comet.caseConversion.enabled` to the per-expression `allowIncompatible` keys. - Cross-reference the collation umbrella issue (#4496) from every support-doc sub-bullet that previously documented "non-default collations may diverge silently" without an issue link. Add the cross-reference to the entries that route through Spark 4.0 `CollationSupport.X.exec` but had no collation note at all (`bit_length`, `concat_ws`, `endswith`, `initcap`, `instr`, `lpad`, `octet_length`, `repeat`, `replace`, `right`, `rpad`, `rtrim`, `split`, `startswith`, `substring`, `substring_index`, `translate`). - Replace the stale "Known limitation: enforced inside `convert`" prose on `left`, `right`, `substring`, and `length`. The literal-only argument restriction has already been moved into `getSupportLevel`; the prose just hadn't been updated. Tests: - `./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/"` (42 pass) - `./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite"` (33 pass)
1 parent d994671 commit e6694b1

11 files changed

Lines changed: 78 additions & 84 deletions

File tree

docs/source/contributor-guide/spark_expressions_support.md

Lines changed: 37 additions & 43 deletions
Large diffs are not rendered by default.

spark/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -745,14 +745,6 @@ object CometConf extends ShimCometConf {
745745
.toSequence
746746
.createWithDefault(Seq("Range,InMemoryTableScan,RDDScan,OneRowRelation"))
747747

748-
val COMET_CASE_CONVERSION_ENABLED: ConfigEntry[Boolean] =
749-
conf("spark.comet.caseConversion.enabled")
750-
.category(CATEGORY_EXEC)
751-
.doc("Java uses locale-specific rules when converting strings to upper or lower case and " +
752-
"Rust does not, so we disable upper and lower by default.")
753-
.booleanConf
754-
.createWithDefault(false)
755-
756748
val COMET_PARQUET_UNSIGNED_SMALL_INT_CHECK: ConfigEntry[Boolean] =
757749
conf("spark.comet.scan.unsignedSmallIntSafetyCheck")
758750
.category(CATEGORY_SCAN)

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,14 +191,13 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
191191
classOf[StartsWith] -> CometScalarFunction("starts_with"),
192192
classOf[StringInstr] -> CometScalarFunction("instr"),
193193
classOf[StringRepeat] -> CometStringRepeat,
194-
classOf[StringReplace] -> CometScalarFunction("replace"),
194+
classOf[StringReplace] -> CometStringReplace,
195195
classOf[StringRPad] -> CometStringRPad,
196196
classOf[StringLPad] -> CometStringLPad,
197197
classOf[StringSpace] -> CometScalarFunction("space"),
198198
classOf[StringSplit] -> CometStringSplit,
199199
classOf[StringTranslate] -> CometScalarFunction("translate"),
200200
classOf[StringTrim] -> CometScalarFunction("trim"),
201-
classOf[StringTrimBoth] -> CometScalarFunction("btrim"),
202201
classOf[StringTrimLeft] -> CometScalarFunction("ltrim"),
203202
classOf[StringTrimRight] -> CometScalarFunction("rtrim"),
204203
classOf[Left] -> CometLeft,

spark/src/main/scala/org/apache/comet/serde/strings.scala

Lines changed: 22 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ package org.apache.comet.serde
2121

2222
import java.util.Locale
2323

24-
import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Concat, ConcatWs, Expression, GetJsonObject, If, InitCap, IsNull, Left, Length, Like, Literal, Lower, RegExpReplace, Right, RLike, StringLPad, StringRepeat, StringRPad, StringSplit, Substring, SubstringIndex, Upper}
24+
import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Concat, ConcatWs, Expression, GetJsonObject, If, InitCap, IsNull, Left, Length, Like, Literal, Lower, RegExpReplace, Right, RLike, StringLPad, StringRepeat, StringReplace, StringRPad, StringSplit, Substring, SubstringIndex, Upper}
2525
import org.apache.spark.sql.types.{BinaryType, DataTypes, LongType, StringType}
2626
import org.apache.spark.unsafe.types.UTF8String
2727

@@ -33,10 +33,6 @@ import org.apache.comet.serde.QueryPlanSerde.{createBinaryExpr, exprToProtoInter
3333

3434
object CometStringRepeat extends CometExpressionSerde[StringRepeat] {
3535

36-
override def getCompatibleNotes(): Seq[String] = Seq(
37-
"A negative argument for the number of times to repeat throws an exception" +
38-
" instead of returning an empty string as Spark does")
39-
4036
override def convert(
4137
expr: StringRepeat,
4238
inputs: Seq[Attribute],
@@ -54,21 +50,13 @@ object CometStringRepeat extends CometExpressionSerde[StringRepeat] {
5450
class CometCaseConversionBase[T <: Expression](function: String)
5551
extends CometScalarFunction[T](function) {
5652

57-
override def getIncompatibleReasons(): Seq[String] = Seq(
58-
"Results can vary depending on locale and character set." +
59-
s" Requires `${CometConf.COMET_CASE_CONVERSION_ENABLED.key}=true` to enable.")
53+
private val incompatReason =
54+
"Results can vary depending on locale and character set " +
55+
"(https://github.com/apache/datafusion-comet/issues/2190)."
6056

61-
override def convert(expr: T, inputs: Seq[Attribute], binding: Boolean): Option[Expr] = {
62-
if (!CometConf.COMET_CASE_CONVERSION_ENABLED.get()) {
63-
withInfo(
64-
expr,
65-
"Comet is not compatible with Spark for case conversion in " +
66-
s"locale-specific cases. Set ${CometConf.COMET_CASE_CONVERSION_ENABLED.key}=true " +
67-
"to enable it anyway.")
68-
return None
69-
}
70-
super.convert(expr, inputs, binding)
71-
}
57+
override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)
58+
59+
override def getSupportLevel(expr: T): SupportLevel = Incompatible(Some(incompatReason))
7260
}
7361

7462
object CometUpper extends CometCaseConversionBase[Upper]("upper")
@@ -228,6 +216,21 @@ object CometRight extends CometExpressionSerde[Right] {
228216
}
229217
}
230218

219+
object CometStringReplace extends CometScalarFunction[StringReplace]("replace") {
220+
221+
private val emptySearchReason =
222+
"Empty `search` string produces different output: Spark returns `str` unchanged, " +
223+
"DataFusion inserts `replace` between every character " +
224+
"(https://github.com/apache/datafusion-comet/issues/4497)."
225+
226+
override def getIncompatibleReasons(): Seq[String] = Seq(emptySearchReason)
227+
228+
override def getSupportLevel(expr: StringReplace): SupportLevel = expr.searchExpr match {
229+
case Literal(s: UTF8String, _) if s.numBytes == 0 => Incompatible(Some(emptySearchReason))
230+
case _ => Compatible()
231+
}
232+
}
233+
231234
object CometConcat extends CometScalarFunction[Concat]("concat") {
232235
val unsupportedReason = "CONCAT supports only string input parameters"
233236

spark/src/test/resources/sql-tests/expressions/string/lower.sql

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ CREATE TABLE test_lower(s string) USING parquet
2121
statement
2222
INSERT INTO test_lower VALUES ('HELLO'), ('hello'), ('Hello World'), (''), (NULL), ('123ABC')
2323

24-
query expect_fallback(case conversion)
24+
query expect_fallback(locale and character set)
2525
SELECT lower(s) FROM test_lower
2626

2727
-- literal arguments
28-
query expect_fallback(case conversion)
28+
query expect_fallback(locale and character set)
2929
SELECT lower('HELLO'), lower(''), lower(NULL)

spark/src/test/resources/sql-tests/expressions/string/lower_enabled.sql

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
-- specific language governing permissions and limitations
1616
-- under the License.
1717

18-
-- Test lower() with case conversion enabled (happy path)
19-
-- Config: spark.comet.caseConversion.enabled=true
18+
-- Test lower() with the standard allowIncompatible opt-in (happy path)
19+
-- Config: spark.comet.expression.Lower.allowIncompatible=true
2020

2121
statement
2222
CREATE TABLE test_lower_enabled(s string) USING parquet

spark/src/test/resources/sql-tests/expressions/string/string_replace.sql

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ INSERT INTO test_str_replace VALUES ('hello world', 'world', 'there'), ('aaa', '
2424
query
2525
SELECT replace(s, search, replace) FROM test_str_replace
2626

27-
query ignore(https://github.com/apache/datafusion-comet/issues/3344)
27+
-- Comet returns 'xhxexlxlxox' where Spark returns 'hello' (short-circuits on empty search).
28+
query expect_fallback(Empty `search`)
2829
SELECT replace('hello', '', 'x')
2930

3031
-- column + literal + literal

spark/src/test/resources/sql-tests/expressions/string/upper.sql

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ CREATE TABLE test_upper(s string) USING parquet
2121
statement
2222
INSERT INTO test_upper VALUES ('hello'), ('HELLO'), ('Hello World'), (''), (NULL), ('123abc')
2323

24-
query expect_fallback(case conversion)
24+
query expect_fallback(locale and character set)
2525
SELECT upper(s) FROM test_upper
2626

2727
-- literal arguments
28-
query expect_fallback(case conversion)
28+
query expect_fallback(locale and character set)
2929
SELECT upper('hello'), upper(''), upper(NULL)

spark/src/test/resources/sql-tests/expressions/string/upper_enabled.sql

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
-- specific language governing permissions and limitations
1616
-- under the License.
1717

18-
-- Test upper() with case conversion enabled (happy path)
19-
-- Config: spark.comet.caseConversion.enabled=true
18+
-- Test upper() with the standard allowIncompatible opt-in (happy path)
19+
-- Config: spark.comet.expression.Upper.allowIncompatible=true
2020

2121
statement
2222
CREATE TABLE test_upper_enabled(s string) USING parquet

spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -261,7 +261,9 @@ class CometStringExpressionSuite extends CometTestBase {
261261
}
262262

263263
test("Upper and Lower") {
264-
withSQLConf(CometConf.COMET_CASE_CONVERSION_ENABLED.key -> "true") {
264+
withSQLConf(
265+
CometConf.getExprAllowIncompatConfigKey("Upper") -> "true",
266+
CometConf.getExprAllowIncompatConfigKey("Lower") -> "true") {
265267
val table = "names"
266268
withTable(table) {
267269
sql(s"create table $table(id int, name varchar(20)) using parquet")
@@ -339,7 +341,7 @@ class CometStringExpressionSuite extends CometTestBase {
339341
}
340342

341343
test("trim") {
342-
withSQLConf(CometConf.COMET_CASE_CONVERSION_ENABLED.key -> "true") {
344+
withSQLConf(CometConf.getExprAllowIncompatConfigKey("Upper") -> "true") {
343345
val table = "test"
344346
withTable(table) {
345347
sql(s"create table $table(col varchar(20)) using parquet")

0 commit comments

Comments
 (0)