Skip to content

Commit 4ead54d

Browse files
committed
feat: under engine=rust, fall through to JVM dispatcher for unimplemented regex
Previously engine=rust returned Spark fallback for regexp_extract, regexp_extract_all, and regexp_instr because they have no native Rust path. With the codegen dispatcher now enabled by default, prefer the JVM dispatcher over Spark in that case so users still get Comet acceleration with full Spark semantics. Only decline to native and dispatcher are both unavailable. Also document the per-expression spark.comet.expression.<ClassName>.enabled disable knobs in the regex compatibility guide, and add a regression test that exercises the new rust→JVM fallthrough.
1 parent 8c9deeb commit 4ead54d

4 files changed

Lines changed: 50 additions & 15 deletions

File tree

docs/source/user-guide/latest/compatibility/regex.md

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -30,19 +30,36 @@ expression inside Comet's Arrow-direct codegen dispatcher (the same dispatcher u
3030
- `rust` — run the Rust engine when an expression has a native implementation. Setting this is itself
3131
the opt-in for the semantic differences between Java and Rust regex (no separate `allowIncompatible`
3232
flag needed). Expressions without a native Rust implementation (`regexp_extract`,
33-
`regexp_extract_all`, `regexp_instr`) fall back to Spark.
33+
`regexp_extract_all`, `regexp_instr`) fall through to the Java engine so users still get Comet
34+
acceleration with full Spark semantics.
3435

3536
With pure defaults (`engine=java`, `scalaUDF.codegen.enabled=true`), all regex expressions run on
3637
the Comet path with full Spark compatibility.
3738

39+
## Disabling Comet for individual regex expressions
40+
41+
Each regex expression has a per-class `spark.comet.expression.<ClassName>.enabled` flag (default
42+
`true`) that disables Comet's serde for that expression and forces a Spark fallback. This is
43+
useful for narrowing a regression or comparing performance on a single operator without changing
44+
the engine selector:
45+
46+
| Expression | Config |
47+
| ------------------- | -------------------------------------------------------- |
48+
| `rlike` | `spark.comet.expression.RLike.enabled=false` |
49+
| `regexp_extract` | `spark.comet.expression.RegExpExtract.enabled=false` |
50+
| `regexp_extract_all`| `spark.comet.expression.RegExpExtractAll.enabled=false` |
51+
| `regexp_instr` | `spark.comet.expression.RegExpInStr.enabled=false` |
52+
| `regexp_replace` | `spark.comet.expression.RegExpReplace.enabled=false` |
53+
| `split` | `spark.comet.expression.StringSplit.enabled=false` |
54+
3855
## Choosing an engine
3956

40-
| | Rust engine | Java engine (default) |
41-
| -------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
42-
| **Compatibility** | Differs from Java regex (see below) | 100% compatible with Spark |
43-
| **Feature coverage** | `rlike`, `regexp_replace`, `split` only | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
44-
| **Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
45-
| **Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |
57+
| | Rust engine | Java engine (default) |
58+
| -------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
59+
| **Compatibility** | Differs from Java regex (see below) | 100% compatible with Spark |
60+
| **Feature coverage** | `rlike`, `regexp_replace`, `split` natively; `regexp_extract`, `regexp_extract_all`, `regexp_instr` via fallthrough | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
61+
| **Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
62+
| **Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |
4663

4764
The **Rust engine** is faster but cannot match Java regex semantics for every pattern. Because the engine
4865
choice is itself the opt-in, setting `spark.comet.exec.regexp.engine=rust` declares acceptance of those

spark/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -388,9 +388,10 @@ object CometConf extends ShimCometConf {
388388
s"${COMET_SCALA_UDF_CODEGEN_ENABLED.key}=false. `$REGEXP_ENGINE_RUST` runs the " +
389389
"native DataFusion regexp engine when an implementation exists; setting this is " +
390390
"itself the opt-in for the semantic differences between Java and Rust regex. " +
391-
"Affected expressions: rlike, regexp_extract, regexp_extract_all, regexp_replace, " +
392-
"regexp_instr, and split (the extract/instr family has no native Rust path; they " +
393-
s"fall back to Spark under `$REGEXP_ENGINE_RUST`).")
391+
"Expressions without a native Rust implementation (`regexp_extract`, " +
392+
"`regexp_extract_all`, `regexp_instr`) fall through to the JVM codegen dispatcher " +
393+
s"under `$REGEXP_ENGINE_RUST` so users still get Comet acceleration with full " +
394+
"Spark semantics.")
394395
.stringConf
395396
.transform(_.toLowerCase(Locale.ROOT))
396397
.checkValues(Set(REGEXP_ENGINE_RUST, REGEXP_ENGINE_JAVA))

spark/src/main/scala/org/apache/comet/serde/strings.scala

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,9 @@ private object RegexpRoute {
6464
* Pick a route given the user's config and whether a native Rust implementation exists for the
6565
* expression. `engine=java` (default) routes through the codegen dispatcher when
6666
* [[CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED]] is true (also the default); otherwise Spark
67-
* fallback. `engine=rust` runs native if available; else Spark fallback.
67+
* fallback. `engine=rust` runs native if available, otherwise transparently falls back to the
68+
* JVM codegen dispatcher when it is enabled, and only declines (Spark fallback) when neither
69+
* the native path nor the dispatcher can serve the expression.
6870
*/
6971
def choose(exprName: String, hasNative: Boolean): RegexpRoute = {
7072
val engine = CometConf.COMET_REGEXP_ENGINE.get()
@@ -74,12 +76,13 @@ private object RegexpRoute {
7476
case CometConf.REGEXP_ENGINE_RUST =>
7577
if (hasNative) {
7678
Native
79+
} else if (codegenEnabled) {
80+
JvmCodegen
7781
} else {
7882
Fallback(
79-
s"$exprName has no native Rust implementation. Set " +
80-
s"${CometConf.COMET_REGEXP_ENGINE.key}=${CometConf.REGEXP_ENGINE_JAVA} with " +
81-
s"${CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED.key}=true to route through the " +
82-
"codegen dispatcher.")
83+
s"$exprName has no native Rust implementation and " +
84+
s"${CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED.key}=false, so neither the rust " +
85+
"engine nor the JVM codegen dispatcher can serve this expression.")
8386
}
8487

8588
case CometConf.REGEXP_ENGINE_JAVA =>

spark/src/test/scala/org/apache/comet/CometRegExpJvmSuite.scala

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -390,4 +390,18 @@ class CometRegExpJvmSuite extends CometTestBase with AdaptiveSparkPlanHelper {
390390
|""".stripMargin))
391391
}
392392
}
393+
394+
test("engine=rust falls through to JVM dispatcher for expressions with no native path") {
395+
withSQLConf(CometConf.COMET_REGEXP_ENGINE.key -> CometConf.REGEXP_ENGINE_RUST) {
396+
withSubjects("abc123def", "no match", null, "xyz789") {
397+
// regexp_extract / regexp_extract_all / regexp_instr have no native rust path; under
398+
// engine=rust they should still run on Comet via the JVM codegen dispatcher rather than
399+
// falling back to Spark.
400+
checkSparkAnswerAndOperator(
401+
sql("SELECT s, regexp_extract(s, '([a-z]+)(\\\\d+)', 2) FROM t"))
402+
checkSparkAnswerAndOperator(sql("SELECT s, regexp_extract_all(s, '\\\\d+', 0) FROM t"))
403+
checkSparkAnswerAndOperator(sql("SELECT s, regexp_instr(s, '\\\\d+', 0) FROM t"))
404+
}
405+
}
406+
}
393407
}

0 commit comments

Comments
 (0)