My favourite small hash table

I'm the kind of person who thinks about the design and implementation of hash tables. One design which I find particularly cute, and I think deserves a bit more publicity, is Robin Hood open-addressing with linear probing and power-of-two table size. If you're not familiar with hash table terminology, that might look like a smorgasbord of random words, but it should become clearer as we look at some actual code.

To keep the code simple to start with, I'm going to assume:

Keys are randomly-distributed 32-bit integers. Values are also 32-bit integers. If the key 0 is present, its value is not 0 . The table occupies at most 32 GiB of memory.

Each slot in the table is either empty, or holds a key and a value. The combination of properties (1) and (2) allows a key/value pair to be stored as a 64-bit integer, and property (3) means that the 64-bit value 0 can be used to represent an empty slot (some hash table designs also need a special value for representing tombstones, but this design doesn't need tombstones). Combining a key and a value into 64 bits couldn't be easier: the low 32 bits hold the key, and the high 32 bits hold the value.

The structure for the table itself needs a pointer to the array of slots, the length of said array, and the number of non-empty slots. As the length is always a power of two, it's more useful to store length - 1 instead of length , which leads to mask rather than length , and property (4) means that mask can be stored as 32 bits. As the load factor should be less than 100%, we can assume count < length , and hence count can also be 32 bits. This leads to a mundane-looking:

struct hash_table_t { uint64_t * slots; uint32_t mask; uint32_t count; };

Property (1) means that we don't need to hash keys, as they're already randomly distributed. Every possible key K has a "natural position" in the slots array, which is just K & mask . If there are collisions, the slot in which a key actually ends up might be different to its natural position. The "linear probing" part of the design means that if K cannot be in its natural position, the next slot to be considered is (K + 1) & mask , and if not that slot then (K + 2) & mask , then (K + 3) & mask , and so on. This leads to the definition of a "chain": if K is some key present in the table, C K denotes the sequence of slots starting with K 's natural position and ending with K 's actual position. We have the usual property of open-addressing: none of the slots in C K are empty slots. The "Robin Hood" part of the design then imposes an additional rather interesting property: for each slot S in C K , Score(S.Index, S.Key) ≥ Score(S.Index, K) , where:

S.Index is the index of S in the slots array (not the index of it in C K ).

is the index of in the array (not the index of it in ). S.Key is the key present in slot S (i.e. the low 32 bits of slots[S.Index] ).

is the key present in slot (i.e. the low 32 bits of ). Score(Index, Key) is (Index - Key) & mask .

... continue reading