The Burrows-Wheeler Transform

Before we dive in, you should know that the BWT has a third, unofficial property: it is not intuitive . Many of the steps in the algorithm will seem arbitrary and it might not be clear why you're even doing them. I'm hoping this article helps you build some intuition around the BWT. In this interactive article, we explore the borderline-magical algorithm known as the Burrows-Wheeler Transform (BWT). It powers data compression in bzip2 , and is used by sequence alignment tools like bowtie and bwa , both of which were named after the algorithm. What's the dollar sign for? The $ marks the end of the string, and is needed to make the BWT reversible. Without that marker, you could still regenerate the matrix in Step , but you wouldn't know which row contains the original string. If it's an English word, you might guess it's banana and not nabana , but that's harder to do with DNA because most rotations will look reasonable. Intuition behind the BWT In Step , the sorting causes rows that start the same to be more likely to be grouped together. As a result, the character that comes right before (i.e. the character in the last column) is also likely to be similar, based on repeated patterns in the English language, and also in DNA sequences! For example, in the BWT of coconut , the letter c is grouped because it's always followed by an o . Although o is followed by either c or n , it still clusters in the BWT because its corresponding rows end up being sorted next to each other. If there was a row in Step that started with a letter in between c and n , the o 's would no longer cluster. For example, what happens to the o 's if you add an i to coconut : icoconut . Would the o 's cluster if you tried ucoconut ? Now it's your turn: try encoding your name or a repetitive string. Which characters can you add or remove to make the BWT cluster more or less? Decoding the BWT Starting from the BWT string, we can recover the original string from the BWT by more or less doing the opposite of the encoding algorithm: Start with an empty matrix, prepend the BWT string, sort the strings, and repeat until the matrix is filled. Step 0: Prev Next Play animation Slower Faster Once the matrix is filled, we can read off the string from any of the rows since we have the $ marker. How to use BWT for sequence alignment So far, we've seen how to use the Burrows-Wheeler Transform to encode and decode strings. That's nice and all, but how can we use the BWT for sequence alignment, i.e. looking for a small string in a much larger string? To do that, I first need to introduce yet another magical property of the BWT: Last-to-first Mapping. Last-to-first Mapping property This property states that the order in which you see a letter in the first column is the same order in which you see it in the last column! Let's consider the word banana : if we annotate each letter with the number of times it occurs in the string before creating the BWT matrix, the letter a appears in the same order in both the first and last column: a 2 , a 1 , a 0 ! $ undefined b undefined a undefined n undefined a undefined n undefined a undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined b undefined a undefined n undefined a undefined n undefined a undefined $ undefined n undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined a undefined Find a substring With that in mind, let's find all occurences of the pattern an within banana , using only the first and last columns. Let's begin by finding rows that start with the last character of the pattern (i.e. n )—you'll see why in a second: $ undefined b undefined a undefined n undefined a undefined n undefined a undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined b undefined a undefined n undefined a undefined n undefined a undefined $ undefined n undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined a undefined Now that we have an n in the first column, we know that the last column is the character that comes right before n , so we can look for an a in that last column. We find two matches: a 1 and a 0 , so we can visit rows that have those characters in the first column: $ undefined b undefined a undefined n undefined a undefined n undefined a undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined b undefined a undefined n undefined a undefined n undefined a undefined $ undefined n undefined a undefined $ undefined b undefined a undefined n undefined a undefined n undefined a undefined n undefined a undefined $ undefined b undefined a undefined And voilà, we found the only matches for our search query of an within banana . Wait a minute... A few sections ago, we decoded the BWT string by recreating the entire BWT matrix, which was a lot of work. Could we instead use this LF property to decode the BWT string? As a matter of fact, we can! You can think of decoding the BWT string as a special case of searching, where we're looking for whichever string ends with $ . So we can start by finding the $ character, then hop around between the first and column until we find the $ once more and we'll have recreated the reverse of the original string. If you somehow made it all the way here (let me know at [email protected]) , and you can't get enough of the BWT, there's a lot more you can learn about: Suffix Arrays : It turns out the way we generated the BWT transform above was quite inefficient. A string of length n has n rotations, so sorting that list of strings has a time complexity of O(n) rotations * O(n log n) comparisons = O(n 2 log n) . There's an interesting data structure called a Suffix Array that you can use to more efficiently generate that matrix. You can learn more about that from Ben Langmead's lecture notes about suffix arrays and BWT and the FM index . Ben's lab maintains the bowtie2 sequence aligner, so his slides are a great in-depth resource. : It turns out the way we generated the BWT transform above was quite inefficient. A string of length has rotations, so sorting that list of strings has a time complexity of rotations * comparisons = . There's an interesting data structure called a Suffix Array that you can use to more efficiently generate that matrix. You can learn more about that from Ben Langmead's lecture notes about suffix arrays and BWT and the FM index . Ben's lab maintains the bowtie2 sequence aligner, so his slides are a great in-depth resource. FM Index : In the examples above, the searches were small enough that we hopped between the first and last column by eye by looking at every single row. But making millions of queries within a large string like the human genome with 3 billion basepairs would be too slow. If you want to learn about how to mitigate those issues in practice, check out Ben Langmead's lecture notes on the FM index . : In the examples above, the searches were small enough that we hopped between the first and last column by eye by looking at every single row. But making millions of queries within a large string like the human genome with 3 billion basepairs would be too slow. If you want to learn about how to mitigate those issues in practice, check out Ben Langmead's lecture notes on the FM index . Compression: As I mentioned above, BWT helps with compression because of how it tends to cluster characters together. You can learn more about compression from Carl Kingsford's lecture notes on BWT and compression . ✨ Thanks to Ben Langmead, Niema Moshiri, and Maria Nattestad for their insightful feedback on this article.

The Burrows-Wheeler Transform

Share this article

Related Articles