Our teammate Daniel introduced Count-Min Sketches in Instant (a sync engine you can spin up in less than a minute). Sketches were so small and so fast that I got into a rabbit hole learning about them. The following post came out of the process.
I have read and re-read just about every one of PG Wodehouse’s 71 novels. He’s one of my favorite authors. Wodehouse can take seemingly silly plots (quite a few involving stealing pigs) and twist them until you’re rapt with attention. And he’s a master of the English language.
Wodehouse is known for eccentric diction. Instead of "Freddie walked over", he’ll say "Freddie (shimmied | beetled | ambled) over". You may wonder, how many times did he use the word 'beetle'?
Well I could tell you approximately how many times Wodehouse used any word in his entire lexicon, just by loading the data structure embedded in this image:
Compressed, it's 50 kilobytes and covers a 23 megabyte text file, or 3.7 million words. We can use it to answer count estimates with 0.05% error rate and 99% confidence. (If you aren't familiar with the probability terms here, no worries, we'll go over them in this post.)
You can try it yourself right here:
Words in Wodehouse chap castle soul beetle Loading all counts... — Estimate — Actual
The Count-Min Sketch
The magic needed to make this happen is called the Count-Min Sketch — a data structure that can give you frequency estimates over giant amounts of data without becoming a giant object itself.
[ 1 ] You could use it to make passwords safer: track all known passwords on the internet, and detect whenever someone chooses a common password.
... continue reading