Skip to content

Commit 009c9a0

Browse files
committed
feat(09): Ad additional explanation to FM-index search.
1 parent 04e0526 commit 009c9a0

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

notes/09-fm-index.typ

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -86,12 +86,12 @@ To achieve a constant $O(1)$ query time for $"Occ"(a, i)$, a multi-level data st
8686
- Offset in sub-block: $i_3 = i mod log n$
8787

8888
The final count is calculated by summing the pre-computed values and the rank in the final segment:
89-
$"Occ"(a, i) = P[i_1] + Q[i_1, i_2] + "rank"_a("at sub-block", i_3)$
89+
$"Occ"(a, i) = P[i_1] + Q[i_1, i_2] + "rank"_a ("at sub-block", i_3)$
9090
- $P[i_1]$ provides the count up to the boundary of the large block.
9191
- $Q[i_1, i_2]$ adds the count up to the boundary of the sub-block.
9292
- The `rank` function provides the final count within the last sub-block segment of length $i_3$.
9393

94-
This structure is built for each character, allowing constant-time queries for any of them. The total space complexity can be optimized to $O(n)$ for the entire alphabet, making it very efficient in theory, though often more complex to implement than Wavelet Trees.
94+
This structure is built for each character, allowing constant-time queries for any of them. The total space complexity can be optimized to $O(|Sigma| n)$ for the entire alphabet, making it very efficient in theory, though often more complex to implement than Wavelet Trees.
9595

9696
==== Wavelet Trees\*
9797
A more powerful and standard solution is to use a *wavelet tree*. A wavelet tree is a data structure built on the BWT string that can answer `rank` (Occ), `select`, and `access` queries in logarithmic time with respect to the alphabet size. This topic is not discussed in the lectures.
@@ -125,12 +125,12 @@ A more powerful and standard solution is to use a *wavelet tree*. A wavelet tree
125125
- You use the `rank` operation on the node's bit-vector to count how many characters from that part appeared before position $i$. This gives you the new, smaller position for the query in the child node.
126126
- When you reach the leaf for $c$, the final position you calculated is the $"Occ"$ value.
127127

128-
With this structure, $"Occ"$ queries take $O(log|Sigma|)$ time (where $|Sigma|$ is the alphabet size), which is extremely fast. The space required is $O(n log|Sigma|)$.
128+
With this structure, $"Occ"$ queries take $O(log|Sigma|)$ time, which is extremely fast. The space required is $O(n log|Sigma|)$.
129129
]
130130

131131
== LF-Mapping (Last-to-First Mapping)
132132

133-
The core of the FM-index search is the LF-mapping property. For the $i$-th character of the BWT (which is $T["SA"[i]-1]$), its corresponding character (same character in a string one cyclic rotation away) in the first column is at index $j = C["BWT"[i]] + "Occ"("BWT"[i], i)$ (number of string's characters smaller than $"BWT"[i]$ + occurrences of the same character in $"BWT"$ before $i$). This allows us to move from a character in the last column to its corresponding position in the first column.
133+
The core of the FM-index search is the LF-mapping property. For the $i$-th character of the BWT (which is $T["SA"[i]-1]$, the predecessor of the first character of $i$-th row), its corresponding character (same character in a string one cyclic rotation away) in the first column is at index $j = C["BWT"[i]] + "Occ"("BWT"[i], i)$ (number of string's characters smaller than $"BWT"[i]$ + occurrences of the same character in $"BWT"$ before $i$). Using array $C$, all the preceding characters are skipped. Then notice that the relative position of same characters in BWT stays the same (stable) after cyclic rotation and sorting, as the second characters are in the same order as before rotation, when they were first. This allows us to move from a character in the last column to its corresponding position in the first column.
134134

135135
#example_box(title: "LF-Mapping Example")[
136136
Let's use our example where $T = "banana$"$ and BWT = "annb\$aa". The C-table is `C = ('$': 0, 'a': 1, 'b': 4, 'n': 5)`.

0 commit comments

Comments
 (0)