Skip to content

Commit 94f8bfb

Browse files
authored
Merge pull request #295 from EsadCetiner/main
chore: sync contribution guidelines
2 parents 7290ed8 + 752b1a4 commit 94f8bfb

1 file changed

Lines changed: 47 additions & 0 deletions

File tree

content/6-development/6-1-contribution-guidelines.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,53 @@ Lazy and greedy matching change the order in which a regular expression engine p
266266

267267
Possessive quantifiers (e.g., `x++`) and atomic groups (e.g., `(?>x)`) are tools that can be used to prevent a backtracking engine from backtracking. They _can_ be used for performance optimization but are only supported by backtracking engines and, therefore, are not permitted in CRS rules.
268268

269+
### Avoiding Catastrophic Backtracking (Ambiguity)
270+
271+
CRS cannot use possessive quantifiers (e.g., `++`, `?+`) or atomic groups (also known as possessive groups, e.g., `(?>ab|cd)`) to bound backtracking, because neither are supported by RE2. Therefore, the only defense against catastrophic backtracking is to write expressions that are not ambiguous in the first place. The root cause of catastrophic backtracking is **ambiguity, not greediness**: it happens when a quantified group (i.e., a group with a quantifier, e.g., `(?:...)+`) can match the same input in more than one way, and a later part of the expression fails to match. The engine is then forced to try every possible split before giving up. Switching greedy quantifiers to lazy ones (e.g., `+?` instead of `+`) does _not_ fix this — it only changes the order in which the splits are tried.
272+
273+
The principle to follow is: **inside a quantified group, every input character should be consumable by exactly one branch (and one position), and any given run of whitespace should be consumable in exactly one place.** Two shapes commonly violate this:
274+
275+
* A bare whitespace branch (`\s`) placed next to a branch whose body also matches whitespace, for example line comments (`#.*`, `//.*`) or block comments (`/\*.*\*/`). The same space or tab can be matched either by the `\s` branch or by the comment body, so every whitespace character doubles the number of paths to explore.
276+
* A token wrapped in `\s*` on _both_ sides inside a repeated group, for example `(?:\s*\(\s*)*`, especially when the group is itself surrounded by `\s*` separators. The trailing `\s*` of one iteration and the leading `\s*` of the next overlap on the same whitespace.
277+
278+
For example, this suffix (matching optional whitespace and PHP comments before a `(`) is ambiguous, because the `\s` branch overlaps the comment bodies:
279+
280+
```python
281+
(?:\s|/\*.*\*/|(?:#|//).*)*\(.*\)
282+
```
283+
284+
The de-ambiguated version uses the "unrolling-the-loop" form for block comments, so that whitespace inside a comment can only be consumed by the comment branch, and requires line comments to terminate at a line break (a `#` or `//` comment before `(` is only reachable across a newline):
285+
286+
```python
287+
(?:\s|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|(?:#|//)[^\n\r]*[\n\r])*\(.*\)
288+
```
289+
290+
The block-comment branch `/\*[^*]*\*+(?:[^/*][^*]*\*+)*/` is the "unrolling-the-loop" technique described by Jeffrey Friedl in [_Mastering Regular Expressions_](https://www.oreilly.com/library/view/mastering-regular-expressions/0596528124/) (O'Reilly). It replaces the ambiguous `/\*.*\*/` with an equivalent that matches the comment body in exactly one way. Read part by part:
291+
292+
* `/\*` — the literal `/*` opener.
293+
* `[^*]*` — a run of characters that are not `*` (the bulk of the comment body).
294+
* `\*+` — one or more `*` (a candidate closer).
295+
* `(?:[^/*][^*]*\*+)*` — zero or more repeats of: a character that is _neither_ `/` nor `*` (so the preceding `*`s were not the closer), then more non-`*` characters, then one or more `*` again.
296+
* `/` — the literal `/` that finally closes the comment.
297+
298+
Because every position is consumable by exactly one part, the engine never has two ways to match the same comment, so there is nothing to backtrack over. The general pattern — `START [^Q]* Q+ (?:[^EQ] [^Q]* Q+)* END` — applies to any "delimited block" construct, not just C-style comments, where:
299+
300+
* `Q` is the inner terminator character — the character that signals a *candidate* end of the block (`*` here).
301+
* `E` is the closing-delimiter character that, following a run of `Q`, actually ends the block (`/` here; it is the same character as `END`).
302+
303+
So `[^EQ]` matches a character that neither closes the block (`E`) nor extends the run of `Q`, which is what makes the preceding `Q+` unambiguously "not the closer".
304+
305+
When checking an expression for catastrophic backtracking, two points specific to CRS are worth keeping in mind:
306+
307+
* **Test with `.` matching newlines (DOTALL).** ModSecurity runs `@rx` patterns in DOTALL mode, so `.` matches newline characters. A `.+` or `.*` therefore consumes newlines and whitespace, which feeds backtracking. A quick test in an engine or checker that does _not_ enable DOTALL (the default in many tools, including [devina.io](https://devina.io/redos-checker)) can under-report the problem.
308+
* **The meaningful signal is whether PCRE/PCRE2 actually exceeds its backtracking limit, not whether a checker reports theoretical "exponential" complexity.** RE2 is immune to ReDoS, and PCRE2's optimizer (for example, scanning for a required literal that is absent) tames many theoretically-ambiguous expressions. However, when PCRE2 _does_ exceed its backtracking limit, it returns an error rather than a result — which makes the rule fail to match (a potential bypass) while still consuming CPU. That outcome is more serious than a merely slow match and should be treated accordingly.
309+
310+
### Curly braces
311+
312+
Curly braces are used for repetition quantifiers (`{m}`, `{m,}`, `{m,n}`). When a brace is intended as a literal character, especially an opening brace, it must be escaped (`\{`) so it is not parsed as the start of a quantifier. Some regex engines treat an unescaped `{` that does not form a valid quantifier as a syntax error rather than a literal.
313+
314+
Inside character classes (`[...]`), braces are literal characters and must not be escaped, as escaping is unnecessary and may reduce readability or portability.
315+
269316
### Writing Regular Expressions for Non-Backtracking Compatibility
270317
271318
Traditional regular expression engines use backtracking to solve some additional problems, such as finding a string that is preceded or followed by another string. While this functionality can certainly come in handy and has its place in certain applications, it can also lead to performance issues and, in uncontrolled environments, open up possibilities for attacks (the term "[ReDoS](https://en.wikipedia.org/wiki/ReDoS)" is often used to describe an attack that exhausts process or system resources due to excessive backtracking).

0 commit comments

Comments
 (0)