Tech News
← Back to articles

My favourite small hash table

read original related products more articles

I'm the kind of person who thinks about the design and implementation of hash tables. One design which I find particularly cute, and I think deserves a bit more publicity, is Robin Hood open-addressing with linear probing and power-of-two table size. If you're not familiar with hash table terminology, that might look like a smorgasbord of random words, but it should become clearer as we look at some actual code.

To keep the code simple to start with, I'm going to assume:

Keys are randomly-distributed 32-bit integers. Values are also 32-bit integers. If the key 0 is present, its value is not 0 . The table occupies at most 32 GiB of memory.

Each slot in the table is either empty, or holds a key and a value. The combination of properties (1) and (2) allows a key/value pair to be stored as a 64-bit integer, and property (3) means that the 64-bit value 0 can be used to represent an empty slot (some hash table designs also need a special value for representing tombstones, but this design doesn't need tombstones). Combining a key and a value into 64 bits couldn't be easier: the low 32 bits hold the key, and the high 32 bits hold the value.

The structure for the table itself needs a pointer to the array of slots, the length of said array, and the number of non-empty slots. As the length is always a power of two, it's more useful to store length - 1 instead of length , which leads to mask rather than length , and property (4) means that mask can be stored as 32 bits. As the load factor should be less than 100%, we can assume count < length , and hence count can also be 32 bits. This leads to a mundane-looking:

struct hash_table_t { uint64_t * slots; uint32_t mask; uint32_t count; };

Property (1) means that we don't need to hash keys, as they're already randomly distributed. Every possible key K has a "natural position" in the slots array, which is just K & mask . If there are collisions, the slot in which a key actually ends up might be different to its natural position. The "linear probing" part of the design means that if K cannot be in its natural position, the next slot to be considered is (K + 1) & mask , and if not that slot then (K + 2) & mask , then (K + 3) & mask , and so on. This leads to the definition of a "chain": if K is some key present in the table, C K denotes the sequence of slots starting with K 's natural position and ending with K 's actual position. We have the usual property of open-addressing: none of the slots in C K are empty slots. The "Robin Hood" part of the design then imposes an additional rather interesting property: for each slot S in C K , Score(S.Index, S.Key) ≥ Score(S.Index, K) , where:

S.Index is the index of S in the slots array (not the index of it in C K ).

is the index of in the array (not the index of it in ). S.Key is the key present in slot S (i.e. the low 32 bits of slots[S.Index] ).

is the key present in slot (i.e. the low 32 bits of ). Score(Index, Key) is (Index - Key) & mask .

... continue reading