@@ -19,35 +19,37 @@ under the License.
1919
2020# Regular Expressions
2121
22- Comet provides two regexp engines for evaluating regular expressions: a ** Java engine** that calls back into
23- the JVM and a ** Rust engine** that uses the Rust [ ` regex ` ] crate natively. The engine is selected with:
22+ Comet provides two regexp engines for evaluating regular expressions: a ** Rust engine** that uses the Rust
23+ [ ` regex ` ] crate natively and an experimental ** Java engine** that calls back into the JVM. The engine is
24+ selected with:
2425
2526```
26- spark.comet.exec.regexp.engine=java # default
27- spark.comet.exec.regexp.engine=rust
27+ spark.comet.exec.regexp.engine=rust # default
28+ spark.comet.exec.regexp.engine=java # experimental
2829```
2930
3031## Choosing an engine
3132
32- | | Java engine | Rust engine |
33+ | | Rust engine | Java engine (experimental) |
3334| ---| ---| ---|
34- | ** Compatibility** | 100% compatible with Spark | Pattern-dependent differences |
35- | ** Feature coverage** | All regexp expressions (` rlike ` , ` regexp_extract ` , ` regexp_extract_all ` , ` regexp_instr ` , ` regexp_replace ` , ` split ` ) | ` rlike ` , ` regexp_replace ` , ` split ` only |
36- | ** Performance** | One JNI round-trip per batch (Arrow vectors stay columnar) | Fully native, no JNI overhead |
37- | ** Pattern support** | All Java regex features (backreferences, lookaround, etc.) | Linear-time subset only |
35+ | ** Compatibility** | Pattern-dependent differences | 100% compatible with Spark |
36+ | ** Feature coverage** | ` rlike ` , ` regexp_replace ` , ` split ` only | All regexp expressions (` rlike ` , ` regexp_extract ` , ` regexp_extract_all ` , ` regexp_instr ` , ` regexp_replace ` , ` split ` ) |
37+ | ** Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
38+ | ** Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |
3839
39- The ** Java engine** (default) is recommended for correctness-sensitive workloads. It evaluates expressions by
40- passing Arrow vectors to a JVM-side UDF that uses ` java.util.regex ` , producing identical results to Spark for
41- all patterns.
42-
43- The ** Rust engine** is faster but only supports a subset of patterns. When it encounters a pattern it cannot
44- handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet considers
45- potentially incompatible, set:
40+ The ** Rust engine** (default) is faster but only supports a subset of patterns. When it encounters a pattern
41+ it cannot handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet
42+ considers potentially incompatible, set:
4643
4744```
4845spark.comet.expression.regexp.allowIncompatible=true
4946```
5047
48+ The ** Java engine** is an experimental option for correctness-sensitive workloads. It evaluates expressions
49+ by passing Arrow vectors to a JVM-side UDF that uses ` java.util.regex ` , producing identical results to Spark
50+ for all patterns. Because it is experimental, the behavior, configuration, and supported expressions may
51+ change in future releases.
52+
5153## Why the engines differ
5254
5355Java's ` java.util.regex ` is a backtracking engine in the Perl/PCRE family. It supports the full range of
@@ -108,7 +110,7 @@ shape and want to avoid the JNI overhead of the Java engine, switching to the Ru
108110` allowIncompatible=true ` is generally safe.
109111
110112For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
111- defaults, use the Java engine (the default) .
113+ defaults, use the experimental Java engine.
112114
113115[ `java.util.regex` ] : https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
114116[ `regex` ] : https://docs.rs/regex/latest/regex/
0 commit comments