Skip to content

Commit ea939ce

Browse files
committed
fix: default regexp engine back to rust, mark java engine experimental
The native Rust engine remains the default. The JVM-backed engine is available as an experimental opt-in for workloads that need 100% Java-regex-compatible semantics.
1 parent 336ec6e commit ea939ce

2 files changed

Lines changed: 25 additions & 23 deletions

File tree

common/src/main/scala/org/apache/comet/CometConf.scala

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -387,16 +387,16 @@ object CometConf extends ShimCometConf {
387387
conf("spark.comet.exec.regexp.engine")
388388
.category(CATEGORY_EXEC)
389389
.doc(
390-
"Experimental. Selects the engine used to evaluate supported regular-expression " +
390+
"Selects the engine used to evaluate supported regular-expression " +
391391
s"expressions. `$REGEXP_ENGINE_RUST` uses the native DataFusion regexp engine. " +
392-
s"`$REGEXP_ENGINE_JAVA` routes through a JVM-side UDF (java.util.regex.Pattern) for " +
393-
"Spark-compatible semantics, at the cost of JNI roundtrips per batch. Expressions " +
394-
"routed when set to java: rlike, regexp_extract, regexp_extract_all, regexp_replace, " +
395-
"regexp_instr, and split.")
392+
s"`$REGEXP_ENGINE_JAVA` is experimental and routes through a JVM-side UDF " +
393+
"(java.util.regex.Pattern) for Spark-compatible semantics, at the cost of JNI " +
394+
"roundtrips per batch. Expressions routed when set to java: rlike, regexp_extract, " +
395+
"regexp_extract_all, regexp_replace, regexp_instr, and split.")
396396
.stringConf
397397
.transform(_.toLowerCase(Locale.ROOT))
398398
.checkValues(Set(REGEXP_ENGINE_RUST, REGEXP_ENGINE_JAVA))
399-
.createWithDefault(REGEXP_ENGINE_JAVA)
399+
.createWithDefault(REGEXP_ENGINE_RUST)
400400

401401
val COMET_EXEC_SHUFFLE_WITH_HASH_PARTITIONING_ENABLED: ConfigEntry[Boolean] =
402402
conf("spark.comet.native.shuffle.partitioning.hash.enabled")

docs/source/user-guide/latest/compatibility/regex.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -19,35 +19,37 @@ under the License.
1919

2020
# Regular Expressions
2121

22-
Comet provides two regexp engines for evaluating regular expressions: a **Java engine** that calls back into
23-
the JVM and a **Rust engine** that uses the Rust [`regex`] crate natively. The engine is selected with:
22+
Comet provides two regexp engines for evaluating regular expressions: a **Rust engine** that uses the Rust
23+
[`regex`] crate natively and an experimental **Java engine** that calls back into the JVM. The engine is
24+
selected with:
2425

2526
```
26-
spark.comet.exec.regexp.engine=java # default
27-
spark.comet.exec.regexp.engine=rust
27+
spark.comet.exec.regexp.engine=rust # default
28+
spark.comet.exec.regexp.engine=java # experimental
2829
```
2930

3031
## Choosing an engine
3132

32-
| | Java engine | Rust engine |
33+
| | Rust engine | Java engine (experimental) |
3334
|---|---|---|
34-
| **Compatibility** | 100% compatible with Spark | Pattern-dependent differences |
35-
| **Feature coverage** | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) | `rlike`, `regexp_replace`, `split` only |
36-
| **Performance** | One JNI round-trip per batch (Arrow vectors stay columnar) | Fully native, no JNI overhead |
37-
| **Pattern support** | All Java regex features (backreferences, lookaround, etc.) | Linear-time subset only |
35+
| **Compatibility** | Pattern-dependent differences | 100% compatible with Spark |
36+
| **Feature coverage** | `rlike`, `regexp_replace`, `split` only | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
37+
| **Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
38+
| **Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |
3839

39-
The **Java engine** (default) is recommended for correctness-sensitive workloads. It evaluates expressions by
40-
passing Arrow vectors to a JVM-side UDF that uses `java.util.regex`, producing identical results to Spark for
41-
all patterns.
42-
43-
The **Rust engine** is faster but only supports a subset of patterns. When it encounters a pattern it cannot
44-
handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet considers
45-
potentially incompatible, set:
40+
The **Rust engine** (default) is faster but only supports a subset of patterns. When it encounters a pattern
41+
it cannot handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet
42+
considers potentially incompatible, set:
4643

4744
```
4845
spark.comet.expression.regexp.allowIncompatible=true
4946
```
5047

48+
The **Java engine** is an experimental option for correctness-sensitive workloads. It evaluates expressions
49+
by passing Arrow vectors to a JVM-side UDF that uses `java.util.regex`, producing identical results to Spark
50+
for all patterns. Because it is experimental, the behavior, configuration, and supported expressions may
51+
change in future releases.
52+
5153
## Why the engines differ
5254

5355
Java's `java.util.regex` is a backtracking engine in the Perl/PCRE family. It supports the full range of
@@ -108,7 +110,7 @@ shape and want to avoid the JNI overhead of the Java engine, switching to the Ru
108110
`allowIncompatible=true` is generally safe.
109111

110112
For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
111-
defaults, use the Java engine (the default).
113+
defaults, use the experimental Java engine.
112114

113115
[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
114116
[`regex`]: https://docs.rs/regex/latest/regex/

0 commit comments

Comments
 (0)