Extending That XOR Trick to Billions of Rows

Can we extend the XOR trick for finding one or two missing numbers in a list to finding thousands of missing IDs in a billion-row table?

Yes, we can! This is possible using a data structure called an Invertible Bloom Filter (IBF) that compares two sets with space complexity based only on the size of the difference. Using a generalization of the XOR trick [1], all the values that are identical cancel out, so the size of this data structure depends only on the size of the difference.

Most explanations of Invertible Bloom Filters start with standard Bloom filters, which support two operations: insert and maybeContains . Next, they extend to counting Bloom filters, which enables a delete operation. Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation. In this article, I will take a different approach and build up to an IBF from the XOR trick.

IBFs have remained relatively obscure in the software development community while the XOR trick is a well-known technique thanks to leetcode. My goal with this article is to connect IBFs to the XOR trick so that more developers understand this fascinating and powerful data structure.

Finding 3 Missing Numbers

Let's start with a concrete example:

A = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ] B = [ 1 , 2 , 4 , 5 , 7 , 8 , 10 ]

Set B is missing 3 numbers: {3, 6, 9}.

The classic XOR trick works for finding 1 or 2 missing numbers, but what about 3 or more? If we don't know how many numbers are missing, we can use a count field to detect when the usual XOR trick will fail:

XOR ( A ) = 1 ^ 2 ^ ... ^ 10 = 11 COUNT ( A ) = 10 COUNT ( B ) = 7

... continue reading