Skip to content

Commit a02da8e

Browse files
committed
feat(06): Explain LCP search on suffix arrays
1 parent 5db4a09 commit a02da8e

1 file changed

Lines changed: 95 additions & 4 deletions

File tree

notes/06-suffix-array.typ

Lines changed: 95 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -142,12 +142,103 @@ The algorithm iterates through the suffixes in the order of their starting posit
142142

143143
Linear time complexity goes from the fact that $h$ is decreased at most $n$ times and that $h$ is bounded: $0<=h<=n$.
144144

145-
== Searching for a Pattern
145+
== Searching for a Pattern in a Suffix Array
146146

147-
A pattern $P$ of length $m$ can be found in the text $T$ by performing a binary search on the suffix array.
147+
Once a suffix array is built, it can be used to efficiently find all occurrences of a pattern $P$ of length $m$ in the text $T$. The core idea is to use binary search on the sorted suffix array. All suffixes that start with the pattern $P$ will form a contiguous block in the suffix array. The goal is to find the boundaries of this block.
148148

149-
- A simple binary search takes $O(m log n + k)$ time, as each comparison between $P$ and a suffix takes $O(m)$ time.
150-
- With the LCP array, this can be accelerated. By keeping track of the LCP between the pattern and the suffixes at the low, mid, and high pointers of the binary search, we can avoid re-comparing the same prefixes. This improved binary search runs in $O(m + log n + k)$ time.
149+
=== Naive Binary Search (SANaive)
150+
151+
The most straightforward approach is a standard binary search. In each step, we compare the pattern $P$ with the suffix at the middle of the current search interval.
152+
153+
*Algorithm:*
154+
1. Initialize search interval boundaries, $L = -1$ and $R = n$.
155+
2. While the interval is valid ($R - L > 1$):
156+
1. Compute the midpoint $M = floor((L+R)/2)$.
157+
2. Compare the pattern $P$ with the suffix starting at $"SA"[M]$ character by character.
158+
3. If $P$ is lexicographically smaller than the suffix, set $R = M$.
159+
4. If $P$ is lexicographically larger, set $L = M$.
160+
5. If they are equal, we have found one occurrence. We then need to perform two more binary searches to find the leftmost and rightmost occurrences, defining the full range of matches.
161+
162+
#code_box([
163+
#smallcaps([SA-Naive-Search]) ($P$, $T$, $"SA"$)
164+
```
165+
L, R = -1, n
166+
while R - L > 1:
167+
M = floor((L+R) / 2)
168+
// Direct comparison of P with the suffix
169+
suffix = T[SA[M]:]
170+
if P == suffix[:m]:
171+
// Found one, now find boundaries
172+
return "Found at", M
173+
elif P < suffix:
174+
R = M
175+
else:
176+
L = M
177+
return "Not found"
178+
```
179+
])
180+
181+
- *Complexity:* The binary search performs $O(log n)$ iterations. In each iteration, we compare the pattern with a suffix, which takes up to $O(m)$ time. This results in a total time complexity of $O(m log n)$. After finding one match, two more binary searches are needed to find the block of all occurrences, but this does not change the overall complexity.
182+
183+
=== Simple Accelerated Search (SASimple)
184+
185+
We can improve the naive search by avoiding re-comparing prefixes that we already know match. This version makes use of the Longest Common Prefix (LCP) between the pattern and the boundaries of our search interval.
186+
187+
*Algorithm:*
188+
Let $"lcp"(p, s)$ be the length of the longest common prefix of pattern $p$ and suffix $s$.
189+
1. Initialize $L = -1, R = n$. Let $"lcp_L"$ be $"lcp"(P, T["SA"[L]:])$ and $"lcp_R"$ be $"lcp"(P, T["SA"[R]:])$. (Initially these are 0 or handled as edge cases).
190+
2. Let $"lcp"_"known" = min("lcp_L", "lcp_R")$. We already know the first $"lcp"_"known"$ characters of any suffix in the interval $(L, R)$ match the pattern.
191+
3. In each binary search step, when comparing $P$ with the suffix at $"SA"[M]$, we only need to start the comparison from character $"lcp"_"known"$.
192+
4. The LCP of $P$ with the middle suffix, $"lcp"_M$, is $"lcp"_"known" + "lcp"(P["lcp"_"known":], T["SA"[M]+"lcp"_"known":])$.
193+
5. We then update $L$ or $R$ and the corresponding $"lcp"_L$ or $"lcp"_R$ value with $"lcp"_M$.
194+
195+
#code_box([
196+
#smallcaps([SA-Simple-Search]) ($P$, $T$, $"SA"$)
197+
```
198+
L, R = -1, n
199+
lcpL, lcpR = 0, 0
200+
while R - L > 1:
201+
M = floor((L+R)/2)
202+
lcp_known = min(lcpL, lcpR)
203+
204+
// Compare from the first differing character
205+
lcpM = lcp_known + lcp(P[lcp_known:], T[SA[M]+lcp_known:])
206+
207+
if lcpM == m:
208+
return "Found at", M // Found a full match
209+
210+
// Decide which way to go based on the next character
211+
if P[lcpM] > T[SA[M]+lcpM]:
212+
L, lcpL = M, lcpM
213+
else:
214+
R, lcpR = M, lcpM
215+
return "Not found"
216+
```
217+
])
218+
219+
- *Complexity:* Although this optimization significantly speeds up the search in practice by avoiding redundant comparisons, the worst-case time complexity remains $O(m log n)$. This is because the LCP between the pattern and the suffixes could be small in every step, forcing a near-full comparison each time.
220+
221+
=== Complex Accelerated Search (SAComplex)
222+
223+
This is the most advanced approach, achieving a search time of $O(m + log n)$. The key idea is to precompute LCP information for *all intervals* that will be examined during the binary search.
224+
225+
*Core Idea:*
226+
The binary search process on an array of size $n$ always explores the same set of intervals. We can model this as a "binary search comparison tree". This tree has $O(n)$ nodes, where each node corresponds to an interval $(L, R)$ that the binary search would examine. For any such interval, we can precompute the LCP between the suffix at $"SA"[L]$ and $"SA"[R]$. The crucial recursive relationship for LCPs simplifies this: $"lcp"(T["SA"[L]:], T["SA"[R]:]) = min("lcp"(T["SA"[L]:], T["SA"[M]:]), "lcp"(T["SA"[M]:], T["SA"[R]:]))$, where $M$ is the midpoint of $(L, R)$.
227+
228+
1. *Preprocessing:*
229+
- Construct the standard LCP array (length of LCP between adjacent suffixes $"SA"[i-1]$ and $"SA"[i]$) in $O(n)$ time using Kasai's algorithm.
230+
- Traverse the conceptual binary search comparison tree. For each node (interval $L, R$) in this tree, precompute and store the value $"lcp"(T["SA"[L]:], T["SA"[R]:])$. This entire precomputation can be done in $O(n)$ time by utilizing the recursive LCP property. A Range Minimum Query (RMQ) data structure over the LCP array can alternatively be used to find the LCP of any two suffixes $"SA"[i]$ and $"SA"[j]$ in $O(1)$ time, providing a different way to access these LCP values during the search.
231+
232+
2. *Searching:*
233+
- The search now proceeds like `SASimple`, but instead of re-computing LCPs between interval boundaries at each step, this information is retrieved in $O(1)$ time from the precomputed data structure.
234+
- The total number of character comparisons is reduced to $O(m + log n)$.
235+
236+
- *Complexity:*
237+
- Preprocessing: $O(n)$
238+
- Searching: $O(m + log n)$ to find the first occurrence, plus $O(k)$ to list all $k$ occurrences. This is asymptotically optimal.
239+
- Space: The precomputed data structures require $O(n)$ additional space.
240+
241+
While `SAComplex` has the best asymptotic complexity, `SASimple` is often faster in practice due to its simpler logic and lower constant factors, as noted by Manber and Myers themselves.
151242

152243
== Burrows-Wheeler Transform (BWT)
153244

0 commit comments

Comments
 (0)