Skip to content

Commit 3cc8f02

Browse files
committed
feat(06): Add explanation of BWS benefits
1 parent 009c9a0 commit 3cc8f02

1 file changed

Lines changed: 8 additions & 4 deletions

File tree

notes/06-suffix-array.typ

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -151,14 +151,12 @@ A pattern $P$ of length $m$ can be found in the text $T$ by performing a binary
151151

152152
== Burrows-Wheeler Transform (BWT)
153153

154-
The Burrows-Wheeler Transform (BWT) is a reversible permutation of a string that is extremely useful for compression. While not a compression algorithm itself, it groups similar characters together, making the transformed string much more compressible by algorithms like run-length encoding or move-to-front transform followed by Huffman or arithmetic coding. You will learn more on BTW in Chapter 9.
155-
156154
The BWT of a text $T$ is created as follows:
157155
1. Create a matrix where each row is a cyclic shift of $T$.
158156
2. Sort the rows of this matrix lexicographically.
159157
3. The BWT is the last column of the sorted matrix.
160158

161-
The connection to suffix arrays is that the sorted rows of the BWT matrix are equivalent to the sorted suffixes of the text (if we consider the cyclic shifts as suffixes). The last column of the BWT matrix corresponds to the character preceding each suffix in the original text.
159+
The connection to suffix arrays is that the sorted rows of the BWT matrix are equivalent to the sorted suffixes of the text (if we consider the cyclic shifts as suffixes). The last column of the BWT matrix corresponds to the character preceding each suffix in the original text. BWT is discussed further in Chapter 09.
162160

163161
#example_box(title: "Example")[
164162
Let $T = "banana$"$.
@@ -196,7 +194,13 @@ We build the original text from found first-last character pairs, starting with
196194
6. Repeat $n$ times to reconstruct the full string.
197195

198196
#info_box(title: "BWT and Compression")[
199-
The BWT is not a compression algorithm on its own, but it's a crucial preprocessing step for many compression tools, most notably `bzip2`. By grouping identical characters together, BWT increases the effectiveness of other compression techniques that thrive on runs of identical characters, like Move-to-Front (MTF) and Run-Length Encoding (RLE).
197+
The BWT is not a compression algorithm on its own, but it's a crucial preprocessing step for many compression tools, most notably `bzip2`. The reason for its effectiveness lies in its ability to group identical characters together, which makes the transformed string highly compressible.
198+
199+
This grouping happens because the BWT sorts all cyclic shifts of the text. If a text contains multiple occurrences of the same word, for example, "the", then the rows starting with "he " (from "the ") will be adjacent in the sorted matrix. The preceding characters (the last column of the BWT) will all be 't'. This creates long runs of identical characters in the BWT's output string.
200+
201+
This property is then exploited by other compression algorithms:
202+
- *Move-to-Front (MTF):* After BWT, the transformed string is often processed with MTF. In MTF, recently seen characters are moved to the front of a list. Since BWT creates runs of identical characters, MTF will output a sequence of small numbers (often zeros), which can be very efficiently compressed using an entropy coder like Huffman or arithmetic coding.
203+
- *Run-Length Encoding (RLE):* RLE is effective at compressing sequences with long runs of identical characters. The output of BWT is often full of such runs, which RLE can compress significantly.
200204
]
201205

202206

0 commit comments

Comments
 (0)