@@ -19,6 +19,97 @@ under the License.
1919
2020# Regular Expressions
2121
22- Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's
23- regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but
24- this can be overridden by setting ` spark.comet.expression.regexp.allowIncompatible=true ` .
22+ Comet provides two regexp engines for evaluating regular expressions: a ** Java engine** that calls back into
23+ the JVM and a ** Rust engine** that uses the Rust [ ` regex ` ] crate natively. The engine is selected with:
24+
25+ ```
26+ spark.comet.exec.regexp.engine=java # default
27+ spark.comet.exec.regexp.engine=rust
28+ ```
29+
30+ ## Choosing an engine
31+
32+ | | Java engine | Rust engine |
33+ | ---| ---| ---|
34+ | ** Compatibility** | 100% compatible with Spark | Pattern-dependent differences |
35+ | ** Feature coverage** | All regexp expressions (` rlike ` , ` regexp_extract ` , ` regexp_extract_all ` , ` regexp_instr ` , ` regexp_replace ` , ` split ` ) | ` rlike ` , ` regexp_replace ` , ` split ` only |
36+ | ** Performance** | One JNI round-trip per batch (Arrow vectors stay columnar) | Fully native, no JNI overhead |
37+ | ** Pattern support** | All Java regex features (backreferences, lookaround, etc.) | Linear-time subset only |
38+
39+ The ** Java engine** (default) is recommended for correctness-sensitive workloads. It evaluates expressions by
40+ passing Arrow vectors to a JVM-side UDF that uses ` java.util.regex ` , producing identical results to Spark for
41+ all patterns.
42+
43+ The ** Rust engine** is faster but only supports a subset of patterns. When it encounters a pattern it cannot
44+ handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet considers
45+ potentially incompatible, set:
46+
47+ ```
48+ spark.comet.expression.regexp.allowIncompatible=true
49+ ```
50+
51+ ## Why the engines differ
52+
53+ Java's ` java.util.regex ` is a backtracking engine in the Perl/PCRE family. It supports the full range of
54+ features that style of engine provides, including some whose worst-case running time grows exponentially with
55+ the input.
56+
57+ Rust's [ ` regex ` ] crate is a finite-automaton engine in the [ RE2] family. It deliberately omits features that
58+ cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
59+ in time linear in the size of the input. This is the same trade-off RE2, Go's ` regexp ` , and several other
60+ engines make.
61+
62+ The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and
63+ several constructs that look the same in source have different semantics on the two sides.
64+
65+ ## Features supported by Java but not by the Rust engine
66+
67+ Patterns that use any of the following will not compile in Comet's Rust engine and must run on Spark (or use
68+ the Java engine):
69+
70+ - ** Backreferences** such as ` \1 ` , ` \2 ` , or ` \k<name> ` . The Rust engine has no backtracking and cannot match
71+ a previously captured group.
72+ - ** Lookaround** , including lookahead (` (?=...) ` , ` (?!...) ` ) and lookbehind (` (?<=...) ` , ` (?<!...) ` ).
73+ - ** Atomic groups** (` (?>...) ` ).
74+ - ** Possessive quantifiers** (` *+ ` , ` ++ ` , ` ?+ ` , ` {n,m}+ ` ). Rust supports greedy and lazy quantifiers but not
75+ possessive.
76+ - ** Embedded code, conditionals, and recursion** such as ` (?(cond)yes|no) ` or ` (?R) ` . Rust accepts none of
77+ these.
78+
79+ ## Features that exist on both sides but behave differently
80+
81+ Even where both engines accept a construct, the matching behavior is not always the same.
82+
83+ - ** Unicode-aware character classes.** In the Rust engine, ` \d ` , ` \w ` , ` \s ` , and ` . ` are Unicode-aware by
84+ default, so ` \d ` matches every digit codepoint defined by Unicode rather than only ` 0 ` -` 9 ` . Java's defaults
85+ match ASCII only and require the ` UNICODE_CHARACTER_CLASS ` flag (or ` (?U) ` inline) to switch to Unicode
86+ semantics. The same pattern can therefore match a different set of characters on each side.
87+ - ** Line terminators.** In multiline mode, Java treats ` \r ` , ` \n ` , ` \r\n ` , and a few additional Unicode line
88+ separators as line boundaries by default. The Rust engine treats only ` \n ` as a line boundary unless CRLF
89+ mode is enabled. ` ^ ` , ` $ ` , and ` . ` (with ` (?s) ` off) all depend on this definition.
90+ - ** Case-insensitive matching.** Both engines support ` (?i) ` , but Java's default is ASCII case folding while
91+ the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters
92+ outside ASCII can produce different results.
93+ - ** POSIX character classes.** The Rust engine supports ` [[:alpha:]] ` style POSIX classes inside bracket
94+ expressions but not Java's ` \p{Alpha} ` shorthand. Java accepts both. Unicode property escapes (` \p{L} ` ,
95+ ` \p{Greek} ` , etc.) are supported by both engines but cover slightly different sets of properties.
96+ - ** Octal and Unicode escapes.** Java accepts ` \0nnn ` for octal and ` \uXXXX ` for a BMP codepoint. Rust uses
97+ ` \x{...} ` for arbitrary codepoints and does not accept Java's bare ` \uXXXX ` form.
98+ - ** Empty matches in ` split ` .** Spark's ` StringSplit ` , which is built on Java's regex, includes leading empty
99+ strings produced by zero-width matches at the start of the input. The Rust engine's ` split ` follows different
100+ rules, so split results can differ in edge cases involving empty matches even when the pattern itself is
101+ identical on both sides.
102+
103+ ## When the Rust engine is safe
104+
105+ For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
106+ ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
107+ shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
108+ ` allowIncompatible=true ` is generally safe.
109+
110+ For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
111+ defaults, use the Java engine (the default).
112+
113+ [ `java.util.regex` ] : https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
114+ [ `regex` ] : https://docs.rs/regex/latest/regex/
115+ [ RE2 ] : https://github.com/google/re2/wiki/Syntax
0 commit comments