Preprocessing text to make it more compressible
The Endeavour 2024-10-15
Repetitive text compresses efficiently. Text like the lyrics to Jingle Bells ought to compress more efficiently than ordinary prose, assuming the compression algorithm can exploit the repetition.
The idea of the Burrows-Wheeler transform is to permute text in before compressing it. The hope is that the permutation will make the repetition in the text easier for a compression algorithm to take advantage of. The permutation must be reversible, so the compressed file can be uncompressed later.
It’s not immediately obvious that the Burrows-Wheeler transform is reversible; that may be the topic of another post some day. This post will explain what the transform is and show that it rearranges text in a way that makes run-length encoding more efficient.
Algorithm and example
The Burrows-Wheeler transform (BWT) lists all the rotations of a string in a table, sorts the table, then takes the last column as the transform value. (Software implementing the BWT does not need to actually construct the table, but it gives the same result as if it did.)
Before tabulating the rotations, the algorithm adds a character for beginning of string and one for end of string. We’ll use ^ and $ respectively, based on the meaning of these symbols in regular expressions.
Here is an example applied to wabisabi
.
^wabisabi$ ^wabisabi$ $^wabisabi $^wabisabi i$^wabisab abi$^wabis bi$^wabisa abisabi$^w abi$^wabis bi$^wabisa sabi$^wabi bisabi$^wa isabi$^wab i$^wabisab bisabi$^wa isabi$^wab abisabi$^w sabi$^wabi wabisabi$^ wabisabi$^
The table on the left lists all rotations of ^wabisabi$
, and the table on the right is the sorted form of the table on the left. The last column, $iswaabbi^
, is the BWT of wabisabi
. Note that the a
‘s and the b
‘s are grouped together in the transformed string.
When we sort the table we run into an annoying complication: what is the sort order of the markers for beginning of text and end of text? I used the Python code from the Wikipedia article on BWT for the examples below, so I followed the code’s sorting convention that the begin string symbol comes first, then the end string symbol, then ordinary text.
Run-length encoding
At the top of the post I mentioned the lyrics to Jingle Bells as an example of text with repetition. If we calculate the BTW of
jingle bells, jingle bells, jingle all the way.
we get
$.eee,,lessy w lllhbbnnntjjj ^lgggaeelliiill a
Now we have a lot of repeated letters, which means we could compress the string with run-length encoding. For example, we could write j3
rather than jjj
to indicate a string of three j’s.
$.e3,2les2y w 2l3hb2n3tj3 2^lg3ae2l2i3l2 2a
This isn’t much sorter, though in practice run length encoding would be implemented in binary and would not use an ASCII character 2
to indicate that a character is repeated.
We could make a string of text even more compressible if we simply sorted it. But sorting is not reversible: any permutation of the text would lead to the same sorted text. This is what we’d get if we sorted the text in the example above
,,.aabbeeeeeeggghiiijjjlllllllllnnnsstwy
Run length encoding compresses this even better, though the result cannot be uncompressed.
72.a2b2e6g3h3j3l9n3s2twy
The way to think of BWT is that it partially sorts text in a reversible way. This is kinda paradoxical: sorting is not reversible. You either sort the characters, or you perform a reversible permutation. You can’t do both. But due to the repetition patterns in natural language text [1] you can in practice do both.
The BWT applied to arbitrary text won’t partially sort it. You can’t apply a reversible operation to random input and make it more ordered. But when applied to ordinary prose, BWT does tend to cluster letters together, especially when applied to longer inputs.
As an example of longer input, I applied the BWT to Lincoln’s Gettysburg Address (1454 characters). Here’s an excerpt from the middle of the output.
arrouiuuiga sttcctsss tttttttttwtttwt TtttttttttTsttttt tttwtttww ttttgwww tsgggtLddddddhhddffhmvw f tvtnttvafattttttttteb h hh hrn sflleelcllsrallllllaiap ueretppptg aiuaaaa laobhooeeo e n oioiieaoaaaaoooooieoeooi aooiaaaaaauuei uiiioioiieiir o
This shows that the output is indeed partially sorted.
Related posts
[1] The BWT is applied to more than natural text. It is used, for example, in compressing genetic sequence data. The key property of the input is that some symbols tend to follow other symbols. This is the kind of repetition that the algorithm is intended to exploit.
The post Preprocessing text to make it more compressible first appeared on John D. Cook.