The probability of a hash collision (2022)

The probability of a hash collision

Tags: probability

A hash function takes arbitrarily complex input - a word, a website, an image, a human being - and maps it to a single number. This is useful for various computer science stuffs, such as data storage and cryptography.

For example, let's say you want to store a book in one of N N N boxes. If you put the book in a random box, it's quite likely that you'll forget which box you picked, especially as N N N gets bigger. What you can do instead is apply a hash function to the title of the book (probably The Notebook, knowing you), which will spit out a number, i i i. You then put your book in the i i i-th box. When you want to retrieve The Notebook again, you recompute the hash function and it tells you to look in the i i i-th box, without you needing to remember where you stored anything! You won't lose a book from your Nicholas Sparks collection ever again.

We won't get into the specifics of what a hash function looks like. This post is instead concerned with hash collisions, which happen when your hash function assigns the same value to two different items. This might be a Bad Thing. Our hash function wouldn't be very useful, for instance, if it told us to put all our books in the 1st box, because we'd have to spend a lot of time rooting around in that box to find a specific book. For this reason, a good hash function should have evenly distributed outputs. But even then, it's only a matter of time before a hash collision occurs as we add more and more books to our collection. By the time we reach N + 1 N+1 N+1 books, there won't enough boxes to store the books individually and there will definitely be at least 1 hash collision.

Given N N N boxes and k k k books, how do you figure out the probability of a hash collision? Hash collisions can be a Bad Thing, but rather than trying to eliminate them entirely (an impossible task), you might instead buy enough boxes that the probability of a hash collision is relatively low.

Jeff Preshing wrote a neat article on how to calculate hash collision probabilities, but there are some gaps in his explanation. This is my attempt to fill those gaps. To follow along you'll need a basic understanding of probability and an iron will.

The Birthday Problem

Consider the following problem statements.

When you compute the hash values of 500 different book titles from your magnificent book collection, and put each book into the corresponding box in a row of 100,000 boxes (yes, it's a lot of boxes), what's the probability that at least 2 books will be in the same box? (~71.3%). When you put 50 balls into 100 buckets at random, what's the probability that at least 2 balls will end up in the same bucket? (~99.99997%). When there are 30 people in a room, what's the probability that at least 2 of them will share a birthday? (~70.6%).

... continue reading