Tech News
← Back to articles

The Burrows-Wheeler Transform

read original related products more articles

Before we dive in, you should know that the BWT has a third, unofficial property: it is not intuitive . Many of the steps in the algorithm will seem arbitrary and it might not be clear why you're even doing them. I'm hoping this article helps you build some intuition around the BWT.

In this interactive article, we explore the borderline-magical algorithm known as the Burrows-Wheeler Transform (BWT). It powers data compression in bzip2 , and is used by sequence alignment tools like bowtie and bwa , both of which were named after the algorithm.

What's the dollar sign for?

The $ marks the end of the string, and is needed to make the BWT reversible. Without that marker, you could still regenerate the matrix in Step , but you wouldn't know which row contains the original string. If it's an English word, you might guess it's banana and not nabana , but that's harder to do with DNA because most rotations will look reasonable.

Intuition behind the BWT

In Step , the sorting causes rows that start the same to be more likely to be grouped together. As a result, the character that comes right before (i.e. the character in the last column) is also likely to be similar, based on repeated patterns in the English language, and also in DNA sequences!

For example, in the BWT of coconut , the letter c is grouped because it's always followed by an o . Although o is followed by either c or n , it still clusters in the BWT because its corresponding rows end up being sorted next to each other.

If there was a row in Step that started with a letter in between c and n , the o 's would no longer cluster. For example, what happens to the o 's if you add an i to coconut : icoconut . Would the o 's cluster if you tried ucoconut ?

Now it's your turn: try encoding your name or a repetitive string. Which characters can you add or remove to make the BWT cluster more or less?

Decoding the BWT

... continue reading