Skip to content

Commit f08f376

Browse files
committed
SEE LOG. Merge branch 'sv/integration-target--combinable-DFA-capture-resolution' into sv/integrate-combinable-DFA-capture-resolution
This was a really gnarly merge and took a while to get through. Both the branch implementing captures and the branch implementing `fsm_union_repeated_pattern_group` made interface changes and added significant behavior to `ast_analysis.c`, in ways that were at times tricky to reconcile. `ast_compile.c` was also restructured, but some of the changes were necessary to inform capture and unioning functionality. As of now _almost_ all the tests are passing, but there are a couple specific things to note: - This brings in an interface change to `fsm_endid_set`, it now returns an enum rather than just an int. Because if you squint enough they are both just numbers (according to the C language spec) code using `if (!fsm_endid_set(...)) { ... }` will not get a warning for the changed meaning of the return code. I prefer having this be an enum, but it IS an interface change, and I'm not opposed to changing it back in a later commit. - `build/tests/capture/res_test_case_list:FAIL` I'm going to fix this in a later commit. I'm not 100% sure yet, but I think it's related to conflicting changes in the parser code, which I'm waiting to regenerate until after this merge commit. - `build/tests/endids/res10_minimise_partial_overlap:FAIL` This has to do with `AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE` or `AST_ANALYSIS_ERROR_UNSUPPORTED_CAPTURE` not being handled yet by code specific to the native dialect. I'm going to handle that later, but will have to check whether native should behave like PCRE in those particular cases or not. Either the error handling code needs to be updated, or the code raising the UNSUPPORTED error needs to check which regex dialect is in effect. - fuzz/target.c I have confirmed that the merged fuzzer harness code builds, but haven't yet spend time re-fuzzing anything. In my experience libfuzzer has bit-rotted over the last couple clang/LLVM releases and tends to nondeterministically crash in combination with some of the clang sanitizers now, so we may want to retarget this to using AFL-Fuzz++. That's well outside the scope of this PR, though. - src/lx/parser.act I updated this but haven't re-generated the parsers yet, and I updated the generated code directly with a one-line change to reflect the `fsm_endid_set` interface change. As mentioned above, I'll re-generate that code in a separate commit. Conflicts: - fuzz/target.c - include/adt/stateset.h - include/re/re.h - src/adt/stateset.c - src/fsm/main.c - src/libfsm/Makefile - src/libfsm/capture.c - src/libfsm/clone.c - src/libfsm/closure.c - src/libfsm/consolidate.c - src/libfsm/determinise.c - src/libfsm/determinise_internal.h - src/libfsm/endids.c - src/libfsm/epsilons.c - src/libfsm/exec.c - src/libfsm/internal.h - src/libfsm/merge.c - src/libfsm/minimise.c - src/libfsm/state.c - src/libre/ast.h - src/libre/ast_analysis.c - src/libre/ast_analysis.h - src/libre/ast_compile.c - src/libre/ast_rewrite.c - src/libre/re.c - src/lx/parser.act - src/re/main.c - tests/capture/captest.c - tests/capture/captest.h - tests/capture/capture3.c - tests/capture/capture4.c - tests/capture/capture5.c - tests/capture/capture_concat1.c - tests/capture/capture_concat2.c - tests/capture/capture_union1.c - tests/minimise/minimise_test_case_list.c
2 parents 239927c + 7ea5421 commit f08f376

462 files changed

Lines changed: 31818 additions & 9468 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 97 additions & 76 deletions
Large diffs are not rendered by default.

Makefile

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ PKG += libtheft
7171
.if !defined(NODOC)
7272
SUBDIR += man/fsm.1
7373
SUBDIR += man/re.1
74+
SUBDIR += man/rx.1
7475
SUBDIR += man/lx.1
7576
SUBDIR += man/fsm_print.3
7677
SUBDIR += man/libfsm.3
@@ -101,6 +102,7 @@ SUBDIR += src/libre/print
101102
SUBDIR += src/libre
102103
SUBDIR += src/fsm
103104
SUBDIR += src/re
105+
SUBDIR += src/rx
104106
SUBDIR += src/retest
105107
SUBDIR += src/lx/print
106108
SUBDIR += src/lx
@@ -114,14 +116,16 @@ SUBDIR += tests/intersect
114116
SUBDIR += tests/eclosure
115117
SUBDIR += tests/equals
116118
SUBDIR += tests/subtract
119+
SUBDIR += tests/detect_required
117120
SUBDIR += tests/determinise
121+
SUBDIR += tests/eager_output
118122
SUBDIR += tests/endids
119123
SUBDIR += tests/epsilons
124+
SUBDIR += tests/fsm
120125
SUBDIR += tests/glob
121126
SUBDIR += tests/like
122127
SUBDIR += tests/literal
123-
# FIXME: commenting this out for now due to Makefile error
124-
#SUBDIR += tests/lxpos
128+
SUBDIR += tests/lxpos
125129
SUBDIR += tests/minimise
126130
SUBDIR += tests/native
127131
SUBDIR += tests/pcre
@@ -131,6 +135,8 @@ SUBDIR += tests/pcre-flags
131135
SUBDIR += tests/pcre-repeat
132136
SUBDIR += tests/pred
133137
SUBDIR += tests/re_literal
138+
SUBDIR += tests/re_strings
139+
SUBDIR += tests/regressions
134140
SUBDIR += tests/reverse
135141
SUBDIR += tests/trim
136142
SUBDIR += tests/union
@@ -141,6 +147,7 @@ SUBDIR += tests/sql
141147
SUBDIR += tests/queue
142148
SUBDIR += tests/aho_corasick
143149
SUBDIR += tests/retest
150+
SUBDIR += tests/re_interpolate_groups
144151
SUBDIR += tests
145152
.if make(theft) || make(${BUILD}/theft/theft)
146153
SUBDIR += theft
@@ -184,6 +191,6 @@ STAGE_BUILD := ${STAGE_BUILD:Nbin/cvtpcre}
184191

185192
.if make(test)
186193
.END::
187-
grep FAIL ${BUILD}/tests/*/res*; [ $$? -ne 0 ]
194+
grep -I FAIL ${BUILD}/tests/*/*res*; [ $$? -ne 0 ]
188195
.endif
189196

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,23 @@
44
; re -cb -pl dot '[Ll]ibf+(sm)*' '[Ll]ibre' | dot
55
![libfsm.svg](doc/tutorial/libfsm.svg)
66

7+
libfsm is not a drop-in replacement for other regex engines, and it only supports patterns that can be compiled to deterministic FSMs. In return, supported patterns run in linear time.
8+
79
Getting started:
810

911
* See the [tutorial introduction](doc/tutorial/re.md) for a quick overview
1012
of the re(1) command line interface.
1113
* [Compilation phases](doc/tutorial/phases.md) for typical applications
1214
which compile regular expressions to code.
15+
* [Advice on using libfsm](doc/advice.md) for suggestions around compilation time, unsupported features, common usage patterns, and examples.
1316

1417
You get:
1518

1619
* libfsm — library for manipulating FSM (NFA and DFA)
1720
* libre — library for compiling regular expressions to NFA
1821
* fsm(1) — command line interface for FSM
19-
* re(1) — command line interface for executing regular expressions
22+
* re(1) — command line interface for regular expressions
23+
* rx(1) — command line interface for compiling sets of regular expressions
2024
* lx(1) — lexer generator
2125

2226
lx is an attempt to produce a simple, expressive, and unobtrusive

doc/advice.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Advice on using libfsm for high-performance pattern matching
2+
3+
libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**.
4+
5+
Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6+
This means the same pattern can have different execution costs depending on the input.
7+
8+
libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9+
At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10+
11+
As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
12+
13+
**libfsm is not a drop-in replacement for traditional regex engines.** It only supports patterns that can be compiled to FSMs.
14+
15+
### **Topics**
16+
17+
- [What libfsm Cannot Do](#what-libfsm-cannot-do)
18+
- [Quick Start](#quick-start)
19+
- [Supported Code Generation Targets](#supported-code-generation-targets)
20+
- [Workflow Overview](#workflow-overview)
21+
- [Writing Effective libfsm Patterns](#writing-effective-libfsm-patterns)
22+
- [Byte Search Optimization](#byte-search-optimization-optional)
23+
- [Troubleshooting](#troubleshooting)
24+
- [Pattern Matches Empty String Unintentionally](#pattern-matches-empty-string-unintentionally)
25+
26+
## What libfsm Cannot Do
27+
28+
These PCRE features will not compile:
29+
30+
* Word boundaries (`\b`)
31+
* Non-greedy quantifiers (`*?`, `+?`, `??`)
32+
* Group capture (coming soon!) and backreferences
33+
* Lookahead/lookbehind assertions (`(?=`, `(?!`, `(?<=`, `(?<!`)
34+
* Conditional expressions (`(?(condition)then|else)`)
35+
* Recursion and subroutines (`(?R)`, `(?1)`)
36+
37+
## Quick Start
38+
39+
Generate a matcher from a regex:
40+
41+
```sh
42+
# Generate a Go matcher
43+
re -p -r pcre -l go -k str 'user\d+' > user_detector.go
44+
```
45+
46+
This produces a standalone matcher function.
47+
48+
## Supported Code Generation Targets
49+
50+
libfsm provides stable, “first-class” code generation for:
51+
- High-level languages: C (via `-l vmc`), Go, Rust
52+
- LLVM IR
53+
- Native WebAssembly
54+
55+
Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
56+
57+
## Workflow Overview
58+
59+
libfsm provides two main tools for pattern matching:
60+
- **`re`** takes patterns from the command line
61+
- **`rx`** takes patterns from a file
62+
63+
A recommended workflow when using libfsm is:
64+
65+
1. Validate the regex
66+
67+
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
68+
69+
2. Verify libfsm compatibility
70+
71+
If unsupported constructs exist, libfsm reports the failing location:
72+
```sh
73+
re -r pcre -l ast 'x*?'
74+
# Output: /x*?/:3: Unsupported operator
75+
```
76+
In this example, `:3` indicates that the character at byte offset three in the pattern is an unsupported feature.
77+
78+
```sh
79+
# patterns with unsupported operators are output to declined.txt
80+
rx -r pcre -l ast -d declined.txt 'x*?'
81+
```
82+
83+
84+
3. Generate code
85+
86+
```sh
87+
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
88+
```
89+
90+
4. Use multiple patterns
91+
92+
Execution complexity for the generated code is proportional to the length of the text being matched, not to the number of patterns.
93+
Assuming your generated code isn't too large to compile, this means you can have as many patterns as you want,
94+
for the same time it takes to execute a single pattern.
95+
96+
Take advantage of this.
97+
98+
```sh
99+
# re - patterns from command line:
100+
re -p -r pcre -l go -k str '^x?a b+c$' '^x*def?$' '^x$'
101+
102+
# rx - patterns from file:
103+
rx -p -r pcre -l vmc -k str -d skipped.txt patterns.txt > detectors.c
104+
```
105+
106+
5. Call the generated code from your program somehow
107+
108+
You're on your own for this. `-k` controls the API for the generated code to read in data to match. Try different options for the language you're using and see which suits you.
109+
110+
The generated API can also vary depending on how you want libfsm to handle ambiguities between different patterns. See the `AMBIG_*` flags in [include/fsm/options.h](../include/fsm/options.h) for different approaches there.
111+
112+
Both tools:
113+
* Combine all patterns into one function (like using `|` to join them)
114+
* Generate code that can return `(bool, int)` for the match status and pattern ID
115+
* Pattern ID is argument position for `re`, line number for `rx`
116+
* When encountering unsupported patterns: `rx` can decline them to `-d` file and generates code with working patterns only; `re` fails completely
117+
118+
### Common Flags
119+
120+
| Flag | Purpose | Common Options | Notes |
121+
|:----:|:---------------------------- |:------------------------------------------ |:---------------------------------------------------------------- |
122+
| `-r` | Regex dialect | `pcre`, `literal`, `glob`, `native`, `sql` | `pcre` supports the widest set of features |
123+
| `-l` | Output language for printing | `go`, `rust`, `vmc`, `llvm`, `wasm`, `dot` | Use `vmc` for `C` code. Pipe `dot` into `idot` for visualization |
124+
| `-k` | Generated function I/O API | `str`, `getc`, `pair` | `str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
125+
| `-p` | Print mode | *(no value)* | Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
126+
| `-d` | Declined patterns | filename | Only applies to `rx` (batch mode) |
127+
128+
This is not an exhaustive list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
129+
The man pages can be built by running `bmake -r doc`, then view with `man build/man/re.1/re.1`.
130+
131+
## Writing Effective libfsm Patterns
132+
133+
Generally, to keep generated code compact, stick to the least expressive subset of features.
134+
135+
libfsm has no way to know in advance what text you'll be passing to its generated code.
136+
For example, are you matching a string that you know will never contain a newline?
137+
libfsm doesn't know that.
138+
It has to generate code that's capable of handling any input.
139+
You can help it out by making your patterns precise.
140+
141+
Think about what you intend your pattern to match, and what it's actually capable of matching given arbitrary text.
142+
This helps restrict the scope of your pattern from arbitrary text to exactly what you mean.
143+
The following bits of advice illustrate various specific ways to bring down this scope.
144+
145+
1. Replace broad wildcards
146+
147+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
148+
149+
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
150+
151+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
152+
153+
Use negated character classes to match only the allowed content:
154+
155+
| Avoid | Better |
156+
| ---------- | -------------- |
157+
| `<.*>` | `<[^>]*>` |
158+
| `\((.*)\)` | `\([^)]*\)`|
159+
| `price=.+` | `price=[0-9]+` |
160+
| `var\s.+=` | `var\s[^=]+=` |
161+
162+
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
163+
164+
2. Take care with bounded repetition
165+
166+
If you have the pattern `^x{3,5}$`, libfsm's resulting DFA will be structured like "match an x, then match an x, then match an x, then match an x or skip it, then match an x or skip it, then report an overall match if at the end of input". It has to repeat the pattern, noting each time whether it's required or optional (beyond the lower count in `{min,max}`), because DFA execution doesn't have a counter, just the current state within the overall DFA.
167+
168+
When the subexpression (represented by `x`) unintentionally matches too many things, they all have to be spelled out every time.
169+
So pay especially close attention to tightening up subexpressions in bounded repetition clauses.
170+
171+
3. Anchor when matching full string
172+
173+
When the intention is to match an entire string, use anchors.
174+
Use `^` at the beginning and `\z` for the true end of the string.
175+
176+
```regex
177+
# Correct: matches only this exact hostname
178+
# Matches "web12.example.com"
179+
# Does not match "foo-web12.example.com-bar"
180+
^web\d+\.example\.com\z
181+
182+
# Incorrect: would match inside a larger string
183+
# Matches "web12.example.com"
184+
# Also matches "foo-web12.example.com-bar"
185+
web\d+\.example\.com
186+
```
187+
188+
4. Prefer `\z` over `$` for End-of-String
189+
190+
`\z` always matches the end of the string.
191+
`$` will also match a trailing newline at the end of the string,
192+
so if you use this in combination with capturing groups, you may not be capturing what you expect.
193+
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
194+
195+
```regex
196+
# Preferred: matches only if the string ends with "bar"
197+
# Matches "/foo/bar"
198+
# Does NOT match "/foo/bar\n"
199+
/bar\z
200+
201+
# Incorrect: allows a trailing newline,
202+
# which is usually unintended and adds unnecessary complexity
203+
# Matches "/foo/bar"
204+
# Also matches "/foo/bar\n"
205+
/bar$
206+
```
207+
208+
5. Escape special characters when used as literals
209+
210+
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
211+
If you mean to match them literally, escape them:
212+
213+
| Literal You Want | Correct Regex | Explanation |
214+
|----------------------------|-----------------------------|--------------------------------------------|
215+
| `example.com` | `example\.com` | `.` matches any character unless escaped |
216+
| `a+b` | `a\+b` | `+` means “one or more” |
217+
| `price?` | `price\?` | `?` means “optional” |
218+
| `[value]` | `\[value\]` | `[` and `]` start/end a character class |
219+
| `(test)` | `\(test\)` | `(` and `)` begin/end a group |
220+
| Markdown link `[t](u)` | `(\[[^]]*\]\([^)]*\))` | Matches `[text](url)` without crossing `]` or `)` |
221+
222+
The `.` wildcard in particular is often mistakenly left unescaped in practice.
223+
On testing, it will match a literal `.` as intended. But it will also match any other character.
224+
This means that not only is your pattern incorrect (write negative test cases!),
225+
but also this part of your FSM is 256 times larger than it should be.
226+
227+
6. Use non-capturing groups
228+
229+
Capture groups are _currently_ not supported (coming soon!).
230+
231+
If you don't need to capture things, don't use capture.
232+
If you need grouping for alternation or precedence, use PCRE's non-capturing syntax `(?:...)`:
233+
234+
```regex
235+
# Correct
236+
(?:private|no-store)
237+
238+
# Not what's intended
239+
(private|no-store)
240+
```
241+
242+
## Byte Search Optimization
243+
244+
Patterns that start with an uncommon character can be accelerated using an initial byte scan before running the FSM.
245+
This quickly jumps to likely match positions instead of scanning every byte.
246+
247+
Good candidates are patterns that start with uncommon prefix characters, for example:
248+
249+
```regex
250+
#tag-[a-z]+
251+
@user-[0-9]+
252+
\[section\]
253+
{"key":
254+
"name='[^']+'"
255+
```
256+
257+
These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) are rare in normal text, so a byte search can skip ahead before running the matcher.
258+
259+
We found using `strings.IndexByte` before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
260+
261+
## Pattern Matches Empty String Unintentionally
262+
263+
Pattern:
264+
265+
```regex
266+
\s*
267+
```
268+
269+
Will compile to code that always returns true.
270+
271+
This is only an issue if that is not what you intend.
272+
273+
**Fix options:**
274+
275+
* Require at least one match: `\s+`
276+
* Anchor context: `^\s+$` or alternatively, use `-Fb` flag

examples/bm/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,6 @@ pcre: pcre.c
1010
libfsm: libfsm.c
1111
gcc -o libfsm -O3 -Wall -std=c99 ${BM_CFLAGS} libfsm.c -I ../../include ../../build/lib/libre.a ../../build/lib/libfsm.a
1212

13+
clean:
14+
rm -f pcre libfsm
15+

examples/bm/libfsm.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ main(int argc, char *argv[])
6161
opt.io = FSM_IO_STR;
6262

6363
p = argv[0];
64-
fsm = re_comp(RE_PCRE, fsm_sgetc, &p, &opt, flags, &e);
64+
fsm = re_comp(RE_PCRE, fsm_sgetc, &p, NULL, flags, &e);
6565
if (fsm == NULL) {
6666
re_perror(RE_LITERAL, &e, NULL, s);
6767
return 1;
@@ -80,7 +80,7 @@ main(int argc, char *argv[])
8080
printf("#include <time.h>\n");
8181
printf("\n");
8282

83-
fsm_print_c(stdout, fsm);
83+
fsm_print(stdout, fsm, &opt, NULL, FSM_PRINT_C);
8484

8585
printf("int\n");
8686
printf("main(void)\n");

0 commit comments

Comments
 (0)