Tech News
← Back to articles

Index 1.6B Keys with Automata and Rust (2015)

read original related products more articles

Index 1,600,000,000 Keys with Automata and Rust

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.

In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps. This includes introducing an implementation written in Rust called the fst crate. It comes with complete API documentation. I will also show you how to build them using a simple command line tool. Finally, I will discuss a few experiments culminating in indexing over 1,600,000,000 URLs (134 GB) from the July 2015 Common Crawl Archive.

The technique presented in this article is also how Lucene represents a part of its inverted index.

Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations.

Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required.

Teaser

As a teaser to show where we’re headed, let’s take a quick look at an example. We won’t look at 1,600,000,000 strings quite yet. Instead, consider ~16,000,000 Wikipedia article titles ( 384 MB ). Here’s how to index them:

$ time fst set --sorted wiki-titles wiki-titles.fst real 0m18.310

... continue reading