Add a built-in RE2 implementation based on re2js#290
Add a built-in RE2 implementation based on re2js#290jonbodner-buf wants to merge 18 commits intomainfrom
Conversation
…h broken imports.
srikrsna
left a comment
There was a problem hiding this comment.
Left some comments regarding the integration. I'll review the RE2 implementation in a second pass.
| /\\c[A-Z]/, // control character eg: /\cM\cJ/ | ||
| /\\u[0-9a-fA-F]{4}/, // UTF-16 code-unit | ||
| /\\0(?!\d)/, // NUL | ||
| /\[\\b.*\]/, // Backspace eg: [\b] |
| // can probably delete this since the RE2 engine will already reject them, but keep for now | ||
| for (const invalidPattern of invalidPatterns) { | ||
| if (invalidPattern.test(pattern)) { | ||
| throw new Error( | ||
| `Error evaluating pattern ${pattern}, invalid RE2 syntax`, | ||
| ); | ||
| } |
| } | ||
| const re = new RegExp(pattern, flags); | ||
| return re.test(this); | ||
| const re: RE2JS = RE2JS.compile(pattern, flagVal); |
There was a problem hiding this comment.
My understanding is that flags are part of the syntax in RE2? Can we add support for them in the library instead of trying to identify them here? Or am I missing something?
cel-go also passes them directly to regex engine without any preprocessing: https://github.com/google/cel-go/blob/646cdc1728643aec9499e3a00236ef1007a5d3fa/common/types/string.go#L156
There was a problem hiding this comment.
yeah, not needed. Removing the code.
| "@bufbuild/re2": "0.4.0" | ||
| }, | ||
| "devDependencies": { | ||
| "@unicode/unicode-16.0.0": "^1.6.16", |
There was a problem hiding this comment.
Is this a peer dependency that is required?
| "peggy": "^5.0.6", | ||
| "peggy-ts": "github:hudlow/peggy-ts#v0.0.9", | ||
| "expect-type": "^1.3.0" | ||
| "unicode-property-value-aliases": "^3.9.0" |
There was a problem hiding this comment.
Is this a peer dependency that is required?
| "@unicode/unicode-16.0.0": "^1.6.16", | ||
| "unicode-property-value-aliases": "^3.9.0" |
There was a problem hiding this comment.
I see that these are added as dev dependencies in the cel package as well, will users end up needing them?
There was a problem hiding this comment.
no, they are used for tests and to build a unicode lookup table.
| }, | ||
| "dependencies": { | ||
| "@bufbuild/re2": "^0.1.0" |
There was a problem hiding this comment.
I think it's needed? The cel package now depends on the re2 package to run.
There was a problem hiding this comment.
But this is the root workspace package.json not for the CEL package.
|
|
||
| // biome-ignore format: table | ||
| export default [ | ||
| // ! |
There was a problem hiding this comment.
We should revert the formatting changes here. It was deliberately formatted like this for readability
There was a problem hiding this comment.
yeah, this wasn't intentional. I'll revert it.
| SemanticAdorner, | ||
| toDebugString, | ||
| } from "@bufbuild/cel-spec/testdata/to-debug-string.js"; | ||
| } from "../../cel-spec/dist/cjs/testdata/to-debug-string.js"; |
There was a problem hiding this comment.
| } from "../../cel-spec/dist/cjs/testdata/to-debug-string.js"; | |
| } from "@bufbuild/cel-spec/testdata/to-debug-string.js"; |
|
|
||
| ## Credits | ||
| This code is a fork of the [RE2JS](https://re2js.leopard.in.ua) project. It has been converted to TypeScript and has a feature set tailored for | ||
| CEL and Protovalidate-es. No newline at end of file |
There was a problem hiding this comment.
I don't think it does anything special for protovalidate, does it?
| CEL and Protovalidate-es. | |
| CEL. |
There was a problem hiding this comment.
Not yet, but when native rules are added to protovalidate-es, we will want to use the re2 engine for string pattern rules. Agree that we should remove that for now.
| pattern = pattern.substring(flagMatches[0].length); | ||
| } | ||
| const re = new RegExp(pattern, flags); | ||
| const re: RE2JS = RE2JS.compile(pattern); |
There was a problem hiding this comment.
Seconding Krishna's comment - return RE2JS.compile(pattern).test(this); is idiomatic style.
| celMethod(olc.ENDS_WITH, STRING, [STRING], BOOL, String.prototype.endsWith), | ||
| celMethod(olc.STARTS_WITH, STRING, [STRING], BOOL, String.prototype.startsWith), | ||
| celMethod(olc.MATCHES, STRING, [STRING], BOOL, matches), | ||
| // ! |
There was a problem hiding this comment.
The previous formatting looks more readable to me, and there's a formatter ignore comment on line 86 to keep it. Any reason to change the formatting? If yes, let's drop the formatter ignore.
There was a problem hiding this comment.
format is back to the way it was (this was an unintentional change)
| SemanticAdorner, | ||
| toDebugString, | ||
| } from "@bufbuild/cel-spec/testdata/to-debug-string.js"; | ||
| } from "../../cel-spec/dist/cjs/testdata/to-debug-string.js"; |
There was a problem hiding this comment.
This can break in subtle and mysterious ways. Need to keep the previous import.
There was a problem hiding this comment.
I'm unclear. The current code is:
} from "../../cel-spec/dist/cjs/testdata/to-debug-string.js";
Do I change it to:
} from "@bufbuild/cel-spec/testdata/to-debug-string.js";
Or do I leave it as-is?
| "include": ["src/**/*.test.ts"], | ||
| "exclude": ["./src/__tests__", "./src/__fixtures__", "./src/__utils__"] |
There was a problem hiding this comment.
This is broken - the exclude means we aren't catching any compiler errors in tests, fixtures, or utils.
There was a problem hiding this comment.
I've moved the tests out of tests and the utils out of utils into src. The fixtures directory is now being scanned as well.
| dist/*/testing.js | ||
| dist/*/testing.d.ts |
There was a problem hiding this comment.
Those two lines aren't needed here. The picture will change with a fixed tsconfig.json that includes tests and fixtures.
c008b4c to
5530de2
Compare
…e2 package. Fix all issues found.
|
|
||
| describe("bug-hunt verification", () => { | ||
| // Phase 1c: DFA.match ANCHOR_START with pos>0 | ||
| test("executeEngine with ANCHOR_START and pos>0 finds substring match", () => { |
There was a problem hiding this comment.
The test expects ANCHOR_START to mean "anchor the match to the current pos index (3)". However, in RE2 semantics, ANCHOR_START strictly means "the match must start at the absolute beginning of the input string (index 0)". So test is incorrect
P.S. update re2js vendoring, because latest will not pass here
|
|
||
| // Phase 1b: equalsIgnoreCase EOF handling | ||
| test("equalsIgnoreCase(-1, X) returns true per current implementation", () => { | ||
| assert.strictEqual(equalsIgnoreCase(-1, 0x41), true); |
There was a problem hiding this comment.
-1 represents EOF. Case-insensitively comparing EOF to a valid character like A (0x41) should absolutely be false. So it is incorrect test case
P.S. update re2js vendoring, because latest will not pass here
| @@ -0,0 +1,49 @@ | |||
| { | |||
| "name": "@bufbuild/re2", | |||
| "version": "0.4.0", | |||
There was a problem hiding this comment.
re2js already 2.3.1. Before v2 version it will be slow, because have only Pike VM (NFA), no DFA engines - https://github.com/le0pard/re2js#re2js-vs-re2-node-c-bindings
| "// returns stride-1 ranges. Surrogates are included — String.fromCodePoint", | ||
| "// returns the lone surrogate char and platform regex matches \\p{Cs} on it.", | ||
| "const sweepPlatform = (pattern: string): Uint32Array => {", | ||
| ' const re = new RegExp(pattern, "u")', |
There was a problem hiding this comment.
Cool idea about optimization, but just for info about problems to run on older js engines: the pattern passed into this function contains strings like \p{General_Category=L} or \p{Script=Latin}. Passing this to the native RegExp constructor alongside the "u" flag requires ES2018 support for Unicode property escapes. Older browsers, older versions of React Native, or legacy Node.js environments will immediately throw a SyntaxError here, crashing the script (which maybe not a problem, if not need support Node.js older than 10 version)
| ' const re = new RegExp(pattern, "u")', | ||
| " const ranges: number[] = []", | ||
| " let start = -1", | ||
| " for (let cp = 0; cp <= 0x10ffff; cp++) {", |
There was a problem hiding this comment.
This for loop executes 1,114,112 times, allocating a string and executing a regex test on every single Unicode codepoint. Because JavaScript is single-threaded, this entirely blocks the main thread until the loop finishes. First-use CPU Spike, which blocking the main thread will happening here. That is why original re2js pays a penalty in bundle size by pre-compiling all arrays, but its compilation speed is consistently fast
| "", | ||
| " const base = sweepPlatform(pattern)", | ||
| ' const delta = kind === "category" ? _DELTA_CATEGORIES.get(name) : _DELTA_SCRIPTS.get(name)', | ||
| " const merged = delta ? mergeRanges(base, delta) : base", |
There was a problem hiding this comment.
Just for information: the delta mechanism assumes the host is running at least Unicode 15.0. If the host environment is running an ancient version of V8/Node with Unicode 12.0 or 13.0, applying the 15.0 -> 16.0 delta on top of the host's 12.0 base will result in an inaccurate and broken Unicode table (which maybe not a problem, if not need support Node.js older than 18 version)
The CEL spec says that its regular expressions meet the RE2 spec, but the existing ES implementation defaults to the regex implementation that's built into ES runtimes, which uses a backtracking implementation that has pathological cases susceptible to ReDOS attacks.
This PR adds a stripped-down version of RE2JS (https://github.com/le0pard/re2js) as a package, and integrates it into CEL as the default regex engine.