Cuckoo hashing improves SIMD hash tables (and other hash table tradeoffs) There are many options when designing a hash table. Cuckoo hashing is a curious design that is popular in academia, but unused in some of industry’s most popular designs, such as Google’s Swiss Tables and Meta’s F14 tables. Cuckoo hashing is often avoided because it has worse memory system performance and is beaten by SIMD-accelerated probing. This doesn’t have to be the case! With careful engineering, you can combine SIMD-accelerated probing with cuckoo hashing to beat the standard implementations in many scenarios. Benchmark highlights Adding cuckoo hashing to a strong baseline is typically close to neutral for performance on low load factors , and greatly improves performance on high load factors. We show performance across the range of load factors used by Swiss Tables. For different use cases the best baseline design may also differ; in each case we take the best baseline design and then add cuckoo hashing. Successful lookups on u64 keys and values, “Direct SIMD” baseline: 50% 62.5% 75% 87.5% 0 5 10 15 20 25 30 35 Baseline Branchy cuckoo Branchless cuckoo Load factor Lookup time (ns) In cache (512 KiB) Out of cache (512 MiB) Failed lookups on u64 keys and values, “Indirect SIMD” baseline: 50% 62.5% 75% 87.5% 0 5 10 15 20 25 30 Baseline Branchy cuckoo Branchless cuckoo Load factor Lookup time (ns) In cache (512 KiB) Out of cache (512 MiB) Insertions on u64 keys and values, “Indirect SIMD” baseline: 50% 62.5% 75% 87.5% 0 10 20 30 40 50 Baseline Cuckoo Load factor Insertion time (ns) In cache (512 KiB) Out of cache (512 MiB) Benchmarks are available. No library release is available yet, but perhaps in future. What is cuckoo hashing? All hash tables we consider use open addressing, where keys are stored in a flat array. The designs differ in the probe sequence: which elements, and in what order, we search during a lookup. Baseline SIMD hash tables such as Swiss Tables use SIMD quadratic probing. This starts searching from a position determined by the hash function, searches 4–16 consecutive keys in parallel with SIMD, then steps by 1 and searches, then steps by 2 and searches, then 3, and so on: SIMD width hash step by 1 step by 2 step by 3 The search continues until either the matching key is found, or an empty entry in the array is found . In cuckoo hashing, we use two hash functions , do a single SIMD search at each of those positions, and then stop: SIMD width hash 0 hash 1 The advantage of cuckoo hashing is that we search only a small fixed number of positions during lookups. The surprising fact about cuckoo hashing is that this probe sequence even works! What if I try to insert a new key and all 8 candidate entries are already full? The answer: to make space for the new key, you can kick out any of the 8 candidate entries, and move them to one of their 4 alternative locations. For example, here are the alternatives for the entries in the “hash 0” bucket: hash 0 If you still don’t find space in any of the 8 × 4 = 32 8\times 4=32 8×4=32 alternative entries, look in their alternatives, and then their alternatives, and so on, in a fanout-4 tree. Once you find an empty location, you can perform a chain of moves—each entry moving to one of its alternative locations—to ultimately make space for the inserted key. With high probability this process completes almost immediately, with just 0 or 1 moves on almost all keys. When and why does cuckoo hashing win? Small (in-cache) tables Quadratic probing performs two equality tests and branches per SIMD probe: while (true) { if () { return ; } if () { return ; } } The branches are necessary in quadratic probing because we don’t know in advance how many probes to perform, and at least one of the branches is difficult to predict: it’s nearly random which keys will be at probe position 0 versus probe position 1. With cuckoo probing we know there are at most two probes, so we can unroll the loop, skip the check for empty, and optionally make the entire process branchless by using conditional move or conditional select instructions . The probing becomes: For small tables that fit in the CPU cache, the savings in branch mispredictions and in instruction count make cuckoo hashing a win. Large (out-of-cache) tables, failed lookups For large tables that don’t fit in the CPU cache, cache line fetches become a more important consideration than branch mispredictions or instruction count. The branchless cuckoo implementation fetches a cache line for each of group 0 and group 1, whereas the branchy cuckoo implementation fetches either 1 or 2 cache lines depending on whether the search terminated after the first or second group. Thus the branchy implementation pays a branch mispredict (sometimes) in order to save a cache line fetch (sometimes). At lower load factors this leads to the branchy implementation beating the branchless implementation. Whereas in cuckoo hashing the second probe is almost always on a different cache line than its first probe, in quadratic probing they are often on the same cache line, since quadratic probing designs the first two probes to be adjacent. Depending on the layout of the hash table, this can sometimes save cache line traffic. We’ll consider the layouts that we found to be best across different workloads. One popular design is the Indirect SIMD design from Google’s Swiss Tables, in which there is an array of 1-byte tags and a separate array of (key, value) payloads. SIMD probing is performed on the tag array, and then any matches on the tag array are confirmed by a second (non-SIMD) test against the payload array: Tag array Payload array Cache line SIMD width hash K V K V K V K V Confirm tag match against real key The Indirect SIMD design performs very well for failed lookups (where the key isn’t present in the table), because they can typically be serviced entirely by the tag array without fetching the payload array . Under quadratic probing for Indirect SIMD layout, probe lengths of 1–2 can typically be handled with just one cache line fetch. Under cuckoo probing for Indirect SIMD layout, probe length 2 requires two cache line fetches, which is worse than for quadratic probing . Because of the extra cache line fetches at probe length 2, this (failed lookups for out-of-cache tables) is the worst case for cuckoo hashing: it only starts to pull ahead as load factors grow past 75%. When the load factors are large, quadratic probing starts to require long probe sequences whereas cuckoo hashing keeps them to at most 2: 1 2 3 4 5 6 7 8 9 10 11 12 13 0 10 20 30 40 50 60 Quadratic Cuckoo Probe length % of operations Large (out-of-cache) tables, successful lookups For out-of-cache successful lookups, the Indirect SIMD design of the previous section is not optimal: it requires a minimum of two cache misses per lookup: one for the tag array and one for the payload array. There are better layouts which reduce this to just one in the best case. For integer keys, I found the best layout to be the Direct SIMD layout, with each cache line containing an array of keys and an array of values : Cache line SIMD width hash V V K K K K V V V V K K K The advantage of Direct SIMD layout is that probe-length-1 lookups can be serviced with just one cache line fetch. The disadvantage relative to Indirect SIMD layout is that only a small number of keys can be fit into a cache line: for u64 keys and values, we can only fit 4 keys per cache line, whereas the Indirect SIMD layout fits 64 keys per cache line. For short enough probe lengths, Direct SIMD layout wins, but for very long probe lengths, Indirect SIMD layout wins. Overall, Direct SIMD layout tends to win on successful lookups but lose on failed lookups , and this is true for both quadratic probing and cuckoo probing. Within the context of Direct SIMD layout, cuckoo hashing is a large improvement over quadratic probing, because of the sensitivity of Direct SIMD to probe length. It wins at almost all load factors, by keeping probe lengths short : 1 2 3 4 5 6 7 0 20 40 60 80 100 Quadratic Cuckoo Probe length % of operations Side benefits of cuckoo probing Cuckoo probing brings several practical benefits besides just performance. Cuckoo hashing supports arbitrary table sizes, rather than being restricted to power-of-2 table sizes like quadratic probing . Depending on the number of elements being stored, this can save up to 2× the memory footprint. For power-of-2-sized cuckoo tables, the table size can be doubled very efficiently in a completely branchless way without any search process, as follows. When growing the table size from N N N to 2 N 2N 2N, take all elements that were stored in bucket B B B and store them in either bucket B B B or B + N B+N B+N according to the newly unmasked hash bit. Since bucket B B B hadn’t overflowed before, neither bucket B B B nor B + N B+N B+N will overflow afterwards, and so no search is required. This rehashing operation is extremely efficient: you can grow the table with realloc (avoiding the need to have 3 N 3N 3N memory live simultaneously) and then do a streaming pass over the buckets, with perfect memory system performance for hardware prefetchers (memory is accessed sequentially) and cache line locality (the working set is just two cache lines). Deleted elements in quadratic probing must leave “tombstone” entries in their place, rather than emptying out the element. This increases probe lengths, and if too many tombstones accumulate the table may need to be rebuilt from scratch to clear out the tombstones. Cuckoo probing does not require tombstones. Cuckoo hash tables are a great wire format for sending dictionaries. Both JSON and protocol buffers natively support sending dictionaries. As part of deserializing, the receiver must do the work of building the hash table. Typically the sender has already done that work! Why can’t the sender send the exact hash table they had built? With quadratic probing, this opens you up to a denial-of-service attack by a malicious sender: they could construct the hash table such that probe lengths are extremely long and every query takes an extremely long time. With cuckoo probing, this is impossible: probes are bounded at length 2 by construction. Directly sending cuckoo tables on the wire seems plausible and efficient for binary serialization formats like Cap’n Proto or FlatBuffers. Conclusion Depending on the use case, different hash table layouts are optimal: whether you prioritize successful lookups versus failed lookups versus inserts; how much you are willing to spend memory to save time; whether you table is in cache or out of cache. Generally we find that, in cases where you care about memory footprint somewhat, i.e. for load factors larger than 50%, the best choice tends to be one that uses cuckoo hashing. In many real-world scenarios, the simpler algorithm beats the more complex algorithm. Cuckoo hashing appears to be a case where the more complex algorithm is actually worthwhile, provided it is engineered carefully. Appendix: engineering a high-performance cuckoo probing scheme Cuckoo hashing is more complex than quadratic probing, and you have to get several details right in order to make cuckoo probing competitive in performance with the best hash tables. Unaligned buckets When performing SIMD probing, it is natural to start the probing at a position that is a multiple of the SIMD width, e.g. such that all SIMD loads are aligned on 16-element boundaries. This divides the table into 16-element “buckets”, and is known as a bucketed hash table. This is simple but typically not optimal. For both quadratic probing and cuckoo probing it is typically better to allow SIMD probes to be unaligned, determined by the initial hash code. This reduces “clustering” effects and lowers average probe length: 1 2 0 20 40 60 80 100 Unaligned cuckoo Aligned cuckoo Probe length % of operations 1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 70 80 Unaligned quadratic Aligned quadratic Probe length % of operations For cuckoo hashing there is partial theoretical validation of this approach. Caveats: unaligned probing has slightly worse insertion times than aligned probing, mostly because the tricks of the following section do not apply. Additionally, unaligned probing is only compatible with Indirect SIMD layout. Producing two hash functions An overhead of cuckoo hashing is that you must evaluate two hash functions rather than one, and evaluating hash functions can be expensive. Here are a few options, which are all fast and work well in practice, even though they somewhat violate the “hash function independence” assumptions required by the cuckoo hashing theory. If you prioritize lookup performance over insertion performance, the fastest choice is to compute the second hash function from the first by hash1 = hash0.rotate_left(32) . This takes just 1 instruction and 1 clock cycle on most current CPUs. If you care about insertion performance and you are using the Indirect SIMD table layout, you’d like to be able to switch between the two hash functions using only the single byte in the tag array, without consulting the metadata array. A good choice recommended by the MemC3 paper is by XOR: hash1 = hash0 ^ hash(tag) , where tag is the 1-byte tag. As the paper notes, this construction allows you to go from the current bucket b of an entry to the alternate bucket b2 by b2 = b ^ hash(tag) , without needing to know whether b is hash0 or hash1 . Many options for computing hash(tag) work, so long as they produce 64-bit outputs rather than 8-bit outputs. A simple option is to compute hash(tag) by lookup in a static 256 x u64 table populated at compile time with random numbers. Another option is tag * MUL for some randomly generated odd number MUL . If you are using Direct SIMD layout rather than Indirect SIMD layout, you have the whole key available. A simple option is then hash1 = hash0 ^ hash0.rotate_left(32) . You can move between buckets b and b2 by b2 = b ^ hash(key).rotate_left(32) . The previous two approaches rely on power-of-2 table size for the “XOR trick” to work. This trick can be generalized to arbitrary table size with a few more instructions. Let N be the table size. Mathematically we’d like to compute hash1 = (multiply_high(tag_hash, bucket_count) - hash0) % bucket_count To avoid the expensive modulo ( % ) operator, we instead lower it to a predicated subtraction; the whole snippet executes in just 4 fast instructions on x86-64 and AArch64. Efficient search during insertion When inserting into a cuckoo table and there’s no free space in either bucket, you need to search for a sequence of element moves to make space. The main strategies from the literature are (a) “random walk”: randomly pick an element to evict and recurse, or (b) “breadth first search”: consider evicting all elements in parallel, recursively. Much of the literature tends to imply that random walk is faster in practice, because the search process state can be kept in CPU registers, whereas breadth first search requires maintaining a BFS queue in memory. However, consistent with the second libcuckoo paper, we find that breadth first search is faster. As the paper notes, breadth first search allows fetching many buckets in parallel (improving memory level parallelism), whereas random walk amounts to following a linked list and offers zero memory level parallelism. The only necessary search state is a BFS queue of bucket pointers. Parent pointers of the search tree don’t need to be explicitly stored; they can be recovered by index arithmetic similar to binary heaps, by observing that the BFS queue stores (two interleaved) level-ordered complete N-ary trees of buckets. Given that the breadth first search almost always terminates within 1–4 nodes of search, the most important optimization for the search is ensuring the setup time is minimized. Using a stack-allocated uninitialized BFS queue achieves that; this makes BFS much less attractive on managed languages like Java that don’t support this functionality. In addition to being faster, BFS also can sustain higher load factors than random walk does. This saves memory.