Bloom filters are good for search that does not scale

A great blog post from 2013 describes using bloom filters to build a space-efficient full text search index for small numbers of documents. The algorithm is simple: Per document, create a bloom filter of all its words. To query, simply check each document's bloom filter for the query terms.

With a query time complexity of O(number-of-documents), we can forget about using this on big corpuses, right? In this blog post I propose a way of scaling the technique to large document corpuses (e.g. the web) and discuss why that is a bad idea.

Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild. But let's get into it.

Why even try this?

The bloom filter index's big selling point is its small size. It allows static websites with dozens of pages to ship a full text search index to the client that is as small as a small image. An equivalent inverted index, which is the traditional textbook approach for keyword-based full text search, would be multiple times bigger.

But index size is not only relevant on small blog websites. If we could scale this technique to larger document corpuses and achive similar space savings, that would be huge!

The main thing that seems to stand in our way is query performance. Instead of always checking every document's bloom filter, we will try to construct an index that only checks a small subset of filters, but still finds all matching documents.

Some Ideas that don't work at all

Look, I brainstormed a bunch of ideas for how to improve the bloom filter based index. I will quickly go over two of them, because identifying and discarding ideas that will not work is an important part of science and engineering.

Sort the filters

... continue reading