Skip to content

Commit 82d3bef

Browse files
committed
refactor(16): Minor changes
1 parent c44b1d0 commit 82d3bef

1 file changed

Lines changed: 31 additions & 14 deletions

File tree

notes/16-approximate-string-matching-2.typ

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,19 @@ A match of the entire pattern $P$ with at most $k$ errors is found at text posit
2525
The recurrence relation from the DP matrix can be translated into a series of bitwise operations. For each character $T[i]$, the new state vectors $s'_q$ are computed from the old state vectors $s_q$.
2626

2727
The update for $s'_q$ involves several components:
28-
- *Substitution*: Handled by OR-ing the previous state $s_q-1$ with the character mask $t[T[i]]$.
28+
- *Substitution*: Handled by OR-ing the previous state $s_(q-1)$ with the character mask $t[T[i]]$.
2929
- *Deletion*: A right shift of the current state $s_q$.
30-
- *Insertion*: A right shift of the previous state $s_q-1$.
30+
- *Insertion*: A right shift of the previous state $s_(q-1)$.
3131

32-
Combining these for the Levenshtein distance ($d_L$, where substitution costs 2) gives a complex but efficient update formula. A simplified version for edit distance ($d_E$, sub cost = 1) is:
32+
For the Levenshtein distance $d_L$, the substitution costs $S=2$, for edit distance $d_E$ it is $S=1$. The formula goes as follows:
3333

3434
$
35-
s'_q = ((s_q <<< 1) | t[T[i]]) & ((s_q-1 <<< 1)) & (s_q-1 <<< 1 | t[T[i]]) & s_q-1
35+
s_i[q,j] = s_(i-1) [q-1,j] and s_i [q-1,j-1] and (s_(i-1) [q-S, j-1] and (s_(i-1) [q,j-1] or t[T[i],j-1]))
3636
$
3737

3838
where $t[c]$ is the precomputed character mask for character $c$.
39+
We try all the values that have less errors than us corresponding to regular update equations.
40+
If there is an error propagating and one new coming from or, number of errors is increased.
3941

4042
== Algorithm Pseudocode
4143

@@ -98,10 +100,31 @@ where $t[c]$ is the precomputed character mask for character $c$.
98100

99101
The Wu-Manber framework is highly flexible.
100102

103+
=== Finding the Minimal $k$ for a Match
104+
105+
Instead of searching for a match with a fixed number of errors $k$, we can also solve the problem of finding the *minimal* number of errors $k^*$ for which an approximate match of pattern $P$ in text $T$ exists.
106+
107+
- *Problem:* Find the minimal $k^*$ such that there is at least one $k^*$-approximate occurrence of $P$ in $T$.
108+
109+
- *Solution:* The solution involves an iterative application of the Wu-Manber algorithm.
110+
1. First, test for an exact match ($k=0$) using a fast algorithm like Shift-Or.
111+
2. If no exact match is found, iterate the Wu-Manber algorithm for exponentially increasing values of $k$. For example, test $k = 1, 3, 7, ..., 2^B-1$.
112+
3. Stop when a match is found for some $k=2^B -1$. We then know that the minimal $k^*$ is in the range $[2^(B-1), 2^B -1]$. A binary search can be performed in this range to find the exact $k^*$, but simply running WM for each value in the range is often sufficient.
113+
114+
- *Time Complexity:* If the minimal number of errors is $k^*$, the total work is dominated by the last successful iteration. The total complexity is $O(4k^* ceil(m/w) n)$, where $w$ is the word size. This is very efficient if the best match is a good one (i.e., $k^*$ is small).
115+
101116
=== Searching with Wildcards
102-
The algorithm can be adapted to handle wildcards like `?` (matches any single character) or `*` (matches any sequence of characters).
117+
The algorithm can be adapted to handle wildcards like `?` (matches any single character) or the Kleene star `*` (matches any sequence of zero or more characters).
103118
- For `?`, the corresponding position in all character masks $t[c]$ is set to 0, ensuring it always matches.
104-
- For `*`, the logic must be modified to allow zero-cost transitions that span multiple text characters, typically by manipulating an additional bitmask representing the wildcard positions.
119+
- For `*`, the logic is more involved. Let the pattern be $P = p_1 * p_2 * ... * p_r$.
120+
121+
We introduce a new bit vector $t^*$ of length $m$, where $t^*[j] = 0$ if $P[j]$ is a `*` and 1 otherwise.
122+
123+
The definition of the state vectors `s[q]` is modified to handle the zero-cost matching of `*`. An observation is that if we have a q-approximate match ending just before a `*`, we can extend this to a q-approximate match at the `*` position, and also at any position after it, for free.
124+
125+
The update rule for the state vector $s[q]$ is adjusted to reflect this. The new state $s'[q]$ is computed, and then it is AND-ed with a mask derived from the previous state and the $t^*$ vector.
126+
$ s'_q = s'_q and (s_(q-1) or t^*) $
127+
This operation effectively says that if a q-1 approximate match was possible at the previous step, we can treat the current position as a match regardless of the character, provided it's a wildcard position.
105128

106129
=== Multiple Patterns
107130
To search for a finite set of patterns, the Wu-Manber algorithm can be adapted by concatenating all patterns into a single, long super-pattern. The core algorithm is then run on this super-pattern, but it requires special handling to manage the boundaries between the original patterns.
@@ -119,12 +142,6 @@ To search for a finite set of patterns, the Wu-Manber algorithm can be adapted b
119142

120143
This technique allows the algorithm to find all occurrences of any of the patterns from the set, while maintaining its efficiency.
121144

122-
=== Finding Minimal $k$
123-
Instead of a fixed $k$, we can find the best match for a pattern. This is done by running the algorithm iteratively.
124-
1. Test for $k=0$ (exact match).
125-
2. If no match, run WM for $k=1$, then $k=2, 3, ...$
126-
3. Stop at the first $k$ for which a match is found. This $k$ is the minimum edit distance.
127-
128145
=== Combining Exact and Approximate Matching
129146
The algorithm can be modified to enforce that some parts of the pattern must match exactly, while others can match approximately. This is useful for patterns where certain characters are more significant than others.
130147

@@ -137,9 +154,9 @@ The state update formula is then modified to incorporate this mask. The original
137154

138155
The simplified update part $(s^1 & s^2 & s^3)$ (which combines the results of deletion, insertion, and substitution from the previous states) is OR-ed with the exact mask $e$. This forces the bits corresponding to exact-match positions to $1$, effectively preventing an approximate match from being considered valid at those positions.
139156

140-
The modified update looks like this:
157+
The modified update looks like this: (approx, see the slides)
141158
$
142-
s'_q = ((s_q <<< 1) | t[T[i]]) & (((s_q-1 <<< 1)) & (s_q-1 <<< 1 | t[T[i]]) & s_q-1) | e
159+
s'_q = ((((s_q <<< 1) or t[T[i]]) and (((s_q-1 <<< 1))) or e) and (s_q-1 <<< 1 or t[T[i]]) and s_q-1)
143160
$
144161

145162
This ensures that for any position $j$ where $e[j]=1$, the state bit $s'[q, j]$ can only become 0 if there is an exact match path, not through an error-introducing path.

0 commit comments

Comments
 (0)