Skip to content

Commit 6cac094

Browse files
committed
docs: update regexp compatibility guide for java vs rust engine
1 parent 4683199 commit 6cac094

1 file changed

Lines changed: 94 additions & 3 deletions

File tree

  • docs/source/user-guide/latest/compatibility

docs/source/user-guide/latest/compatibility/regex.md

Lines changed: 94 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,97 @@ under the License.
1919

2020
# Regular Expressions
2121

22-
Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's
23-
regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but
24-
this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`.
22+
Comet provides two regexp engines for evaluating regular expressions: a **Java engine** that calls back into
23+
the JVM and a **Rust engine** that uses the Rust [`regex`] crate natively. The engine is selected with:
24+
25+
```
26+
spark.comet.exec.regexp.engine=java # default
27+
spark.comet.exec.regexp.engine=rust
28+
```
29+
30+
## Choosing an engine
31+
32+
| | Java engine | Rust engine |
33+
|---|---|---|
34+
| **Compatibility** | 100% compatible with Spark | Pattern-dependent differences |
35+
| **Feature coverage** | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) | `rlike`, `regexp_replace`, `split` only |
36+
| **Performance** | One JNI round-trip per batch (Arrow vectors stay columnar) | Fully native, no JNI overhead |
37+
| **Pattern support** | All Java regex features (backreferences, lookaround, etc.) | Linear-time subset only |
38+
39+
The **Java engine** (default) is recommended for correctness-sensitive workloads. It evaluates expressions by
40+
passing Arrow vectors to a JVM-side UDF that uses `java.util.regex`, producing identical results to Spark for
41+
all patterns.
42+
43+
The **Rust engine** is faster but only supports a subset of patterns. When it encounters a pattern it cannot
44+
handle, it falls back to Spark automatically. To opt in to native evaluation for patterns Comet considers
45+
potentially incompatible, set:
46+
47+
```
48+
spark.comet.expression.regexp.allowIncompatible=true
49+
```
50+
51+
## Why the engines differ
52+
53+
Java's `java.util.regex` is a backtracking engine in the Perl/PCRE family. It supports the full range of
54+
features that style of engine provides, including some whose worst-case running time grows exponentially with
55+
the input.
56+
57+
Rust's [`regex`] crate is a finite-automaton engine in the [RE2] family. It deliberately omits features that
58+
cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
59+
in time linear in the size of the input. This is the same trade-off RE2, Go's `regexp`, and several other
60+
engines make.
61+
62+
The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and
63+
several constructs that look the same in source have different semantics on the two sides.
64+
65+
## Features supported by Java but not by the Rust engine
66+
67+
Patterns that use any of the following will not compile in Comet's Rust engine and must run on Spark (or use
68+
the Java engine):
69+
70+
- **Backreferences** such as `\1`, `\2`, or `\k<name>`. The Rust engine has no backtracking and cannot match
71+
a previously captured group.
72+
- **Lookaround**, including lookahead (`(?=...)`, `(?!...)`) and lookbehind (`(?<=...)`, `(?<!...)`).
73+
- **Atomic groups** (`(?>...)`).
74+
- **Possessive quantifiers** (`*+`, `++`, `?+`, `{n,m}+`). Rust supports greedy and lazy quantifiers but not
75+
possessive.
76+
- **Embedded code, conditionals, and recursion** such as `(?(cond)yes|no)` or `(?R)`. Rust accepts none of
77+
these.
78+
79+
## Features that exist on both sides but behave differently
80+
81+
Even where both engines accept a construct, the matching behavior is not always the same.
82+
83+
- **Unicode-aware character classes.** In the Rust engine, `\d`, `\w`, `\s`, and `.` are Unicode-aware by
84+
default, so `\d` matches every digit codepoint defined by Unicode rather than only `0`-`9`. Java's defaults
85+
match ASCII only and require the `UNICODE_CHARACTER_CLASS` flag (or `(?U)` inline) to switch to Unicode
86+
semantics. The same pattern can therefore match a different set of characters on each side.
87+
- **Line terminators.** In multiline mode, Java treats `\r`, `\n`, `\r\n`, and a few additional Unicode line
88+
separators as line boundaries by default. The Rust engine treats only `\n` as a line boundary unless CRLF
89+
mode is enabled. `^`, `$`, and `.` (with `(?s)` off) all depend on this definition.
90+
- **Case-insensitive matching.** Both engines support `(?i)`, but Java's default is ASCII case folding while
91+
the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters
92+
outside ASCII can produce different results.
93+
- **POSIX character classes.** The Rust engine supports `[[:alpha:]]` style POSIX classes inside bracket
94+
expressions but not Java's `\p{Alpha}` shorthand. Java accepts both. Unicode property escapes (`\p{L}`,
95+
`\p{Greek}`, etc.) are supported by both engines but cover slightly different sets of properties.
96+
- **Octal and Unicode escapes.** Java accepts `\0nnn` for octal and `\uXXXX` for a BMP codepoint. Rust uses
97+
`\x{...}` for arbitrary codepoints and does not accept Java's bare `\uXXXX` form.
98+
- **Empty matches in `split`.** Spark's `StringSplit`, which is built on Java's regex, includes leading empty
99+
strings produced by zero-width matches at the start of the input. The Rust engine's `split` follows different
100+
rules, so split results can differ in edge cases involving empty matches even when the pattern itself is
101+
identical on both sides.
102+
103+
## When the Rust engine is safe
104+
105+
For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
106+
ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
107+
shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
108+
`allowIncompatible=true` is generally safe.
109+
110+
For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
111+
defaults, use the Java engine (the default).
112+
113+
[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
114+
[`regex`]: https://docs.rs/regex/latest/regex/
115+
[RE2]: https://github.com/google/re2/wiki/Syntax

0 commit comments

Comments
 (0)