You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: notes/17-regular-expressions.typ
+17Lines changed: 17 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -60,6 +60,21 @@ A famous method for building an NFA from a regular expression. The resulting NFA
60
60
- Each state has at most two outgoing $epsilon$-transitions or one non-$epsilon$-transition.
61
61
]
62
62
63
+
#info_box(title: "Lemmas on NFA(e) Construction and Size")[
64
+
*Claim:* A regular expression $e$ of size $m$ (number of character occurrences from the alphabet) can be constructed to contain at most:
65
+
- $2m$ parentheses
66
+
- $m$ binary operators (union `|` and concatenation, e.g., in `(ab)`)
67
+
- $2m$ occurrences of the Kleene star operator `*`
68
+
69
+
*Corollary:* The total length of such a regular expression $e$, denoted as $|e|$, is at most $6m$.
70
+
71
+
*Claim:* If NFA($e$) = $(V, E)$ is the NFA constructed from $e$ using Thompson's construction, then:
72
+
- The number of states $|V|$ is at most $8m$.
73
+
- The number of edges $|E|$ is at most $13m$.
74
+
75
+
*Corollary:* An NFA for a regular expression $e$ of size $m$ can be constructed in $O(m)$ time.
76
+
]
77
+
63
78
=== NFA Simulation
64
79
65
80
To find matches of a regex $v$ in a text $T$, we *build an NFA for the regex $Sigma^* v$*. This allows a match to begin at any point in the text.
@@ -164,3 +179,5 @@ The subset construction creates a DFA state for every possible *subset* of NFA s
164
179
The primary advantage is speed, especially for large files and complex patterns. Regex engines, particularly NFA-based ones, can be relatively slow ($O(m n)$). Many patterns contain simple, literal substrings (or "necessary factors") that must exist for a match to be possible.
165
180
1. *Fast Pre-filtering:* Searching for a simple, fixed string is extremely fast (e.g., using algorithms like Boyer-Moore or Aho-Corasick, which are often close to $O(n)$).
166
181
2. *Reducing Expensive Work:* By first identifying the locations of these mandatory substrings, the expensive, full regex engine only needs to be run on a few, small portions of the text. If the necessary factor is rare, this can eliminate over 99% of the text from consideration, leading to a massive performance improvement.
0 commit comments