Visualizing Ukkonen's Suffix Tree Algorithm

Learning algorithms from books

I learned most of what I know about algorithms by poring over a copy of Introduction to Algorithms I got while in university. The book is very well known, especially among folks who got a formal education in computer science.

If you have studied it, you know the book: it is over a thousand pages long and it weighs enough to double as a doorstop.

I worked through large sections of it, pen in hand, trying to trace through increasingly complex algorithms, building intuition for their behavior and tradeoffs. The book covers the theory in great depth: correctness proofs, recurrence relations, asymptotic analysis.

But there was often a gap between reading an algorithm and truly understanding it. The book would present pseudocode, sometimes a few diagrams showing state at key moments and theorems about performance characteristics. The work of tracing what actually happens was left as an exercise to the reader. I did that work with pen and paper, drawing trees, crossing out nodes, scribbling indices in the margins. It worked, eventually. But it was slow, error-prone, and the understanding felt fragile.

Implementing from a paper

Years later, I ran into this gap again. I was working on a programming puzzle that required near-instant substring search over a large dataset. After some research, I settled on a Generalized Suffix Tree: a data structure that indexes all suffixes of a set of strings, enabling O ( m ) O(m) O(m) lookups where m m m is the length of the search pattern, even over an extremely large corpus.

The algorithm I chose for building the tree was Ukkonen’s, described in a 1995 paper. The paper is well written and includes the full algorithm in pseudocode:

One of several pseudocode snippets from Ukkonen's paper, describing the update function. Clear on paper, but its translation to working code is much more verbose than this.

It took me a few hours to get right. Not because the pseudocode was wrong: it was precise and correct. The difficulty was that the algorithm manipulates a tree in non-obvious ways. There is an “active point” that walks around the tree. Suffix links connect internal nodes as shortcuts. Three different extension rules fire depending on what is already in the tree and what is being added. The pseudocode tells you what to do, but building an intuition for why it works requires watching it happen.

... continue reading