Count-Min Sketches in JS – Frequencies, but without the data

Our teammate Daniel introduced Count-Min Sketches in Instant (a sync engine you can spin up in less than a minute). Sketches were so small and so fast that I got into a rabbit hole learning about them. The following post came out of the process. I have read and re-read just about every one of PG Wodehouse’s 71 novels. He’s one of my favorite authors. Wodehouse can take seemingly silly plots (quite a few involving stealing pigs) and twist them until you’re rapt with attention. And he’s a master of the English language. Wodehouse is known for eccentric diction. Instead of "Freddie walked over", he’ll say "Freddie (shimmied | beetled | ambled) over". You may wonder, how many times did he use the word 'beetle'? Well I could tell you approximately how many times Wodehouse used any word in his entire lexicon, just by loading the data structure embedded in this image: Compressed, it's 50 kilobytes and covers a 23 megabyte text file, or 3.7 million words. We can use it to answer count estimates with 0.05% error rate and 99% confidence. (If you aren't familiar with the probability terms here, no worries, we'll go over them in this post.) You can try it yourself right here: Words in Wodehouse chap castle soul beetle Loading all counts... — Estimate — Actual The Count-Min Sketch The magic needed to make this happen is called the Count-Min Sketch — a data structure that can give you frequency estimates over giant amounts of data without becoming a giant object itself. [ 1 ] You could use it to make passwords safer: track all known passwords on the internet, and detect whenever someone chooses a common password. [ 2 ] Or you could use it estimate the popularity of links: update a sketch whenever a user looks at a tweet, and you can query for approximate views. [ 3 ] Or, use it to make databases faster: track the values of different columns, so you can estimate how many rows a filter would return. This is how we use them in Instant: our query planner decides which indexes to use based on estimates from sketches. So how do Count-Min Sketches work? In this post we'll find out by building one from scratch, in JavaScript! Setup [ 4 ] and spin up a project: Let's dust off Bunand spin up a project: mkdir sketches cd sketches bun init cat > wodehouse.txt << 'EOF' At the open window of the great library of Blandings Castle, drooping like a wet sock, as was his habit when he had nothing to prop his spine against, the Earl of Emsworth, that amiable and boneheaded peer, stood gazing out over his domain. EOF We've just made an index.ts file, and a little toy wodehouse.txt that we can play with as we go along. Time to bun run --watch , and we're ready to hack! bun run --watch index.ts An exact solution First things first: let's write a straightforward algorithm. If we wanted to count words exactly, how would we do it? Well we could read wodehouse.txt , parse each word and count them. Here we go: import fs from 'fs' ; const wodehouse = fs . readFileSync ( 'wodehouse.txt' , 'utf-8' ) ; function toWords ( text : string ) : string [ ] { return text . split ( ' ' ) . flatMap ( ( line ) => line . split ( ' ' ) ) . map ( ( w ) => w . trim ( ) . toLowerCase ( ) ) . filter ( ( w ) => w ) ; } function countWords ( words : string [ ] ) : { [ w : string ] : number } { const result : { [ w : string ] : number } = { } ; for ( const word of words ) { result [ word ] = ( result [ word ] || 0 ) + 1 ; } return result ; } const exactCounts = countWords ( toWords ( wodehouse ) ) ; console . log ( 'exactCounts' , exactCounts ) ; This logs a little map in our terminal: exactCounts { at: 1 , the: 3 , // .. . "castle," : 1 , drooping: 1 , // .. . "domain." : 1 , } It works, but we'll have a few problems. Stems What if the word "castle" was used without a comma? Or if instead of "drooping" Wodehouse wrote "drooped"? We would get different counts. It would be nice if we could normalize each word so no matter how Wodehouse wrote "droop", we'd get the same count. This is a common natural-language processing task called " stemming ". There are some great algorithms and libraries for this, but for our post we can write a rough function ourselves: function stem ( word : string ) { let w = word . toLowerCase ( ) . replaceAll ( / [^a-z] / g , '' ) ; if ( w . endsWith ( 'ing' ) && w . length > 4 ) { w = w . slice ( 0 , - 3 ) ; } else if ( w . endsWith ( 'ed' ) && w . length > 3 ) { w = w . slice ( 0 , - 2 ) ; } else if ( w . endsWith ( 's' ) && w . length > 3 && ! w . endsWith ( 'ss' ) ) { w = w . slice ( 0 , - 1 ) ; } else if ( w . endsWith ( 'ly' ) && w . length > 3 ) { w = w . slice ( 0 , - 2 ) ; } else if ( w . endsWith ( 'er' ) && w . length > 4 ) { w = w . slice ( 0 , - 2 ) ; } else if ( w . endsWith ( 'est' ) && w . length > 4 ) { w = w . slice ( 0 , - 3 ) ; } return w ; } function toWords ( text : string ) : string [ ] { return text . split ( ' ' ) . flatMap ( ( line ) => line . split ( ' ' ) ) . map ( stem ) . filter ( ( w ) => w ) ; } With it our console.log starts to show stemmed words: exactCounts { at: 1 , the: 3 , // .. . castle: 1 , // No more ` , ` droop: 1 , // No more ` ing ` ! // .. . "domain" : 1 , // No more ` . ` } And now we have better exact counts. But there's another problem. Growth What happens when you look at more words? Our exactCounts grows with the vocabulary of words : Reset Start words + castle counts { + " castle ": 1 } This isn't too big of an issue with Wodehouse specifically: after all the English dictionary itself could fit in memory. But as our vocabulary gets larger, our data structure gets more annoying. Imagine if we had to track combinations of words: suddenly keeping counts would take more space than the words themselves. Could we do something different? An intuition for sketches Ideally, we would be able to divorce the size of our vocabulary from the size of our counts data structure. Here's one way to do that. Columns of Buckets Our exactCounts was an unbounded hash map. Let's make a bounded version. We can spin up a fixed number of buckets. Each bucket stores a count. We then take a word, hash it, and increment its corresponding bucket. Here's how this could work: Reset Step castle + 454 peer + 189 wet + 168 hash1 () 0 0 0 0 When we want to know the count of word, we hash it, find the corresponding bucket, and that's our count: castle peer wet Estimate: 622 times hash1 () 0 622 castle + 454 wet + 168 0 189 peer + 189 With this we've solved our growth problem! No matter how large our vocabulary gets, our buckets stay a fixed size. But of course this comes with new consequences. The 'sketch' in sketches. Our counts become estimates. If you look at the demo, both 'wet' and 'castle' ended up in the second bucket. If we asked "How many times is 'castle' used?", we'd get 622. Now, it does suck that we got 622 instead of 454 for 'castle'. But if you think about it, it's not such a big deal. Both words are used infrequently. Even when you put them together they pale in comparison to more common words. And if you're worried about errors we can already intuit a way to reduce them. More buckets, fewer errors To reduce errors we can add more buckets. The more buckets we have, the fewer collisions we'll have, and the lower our chances of errors are. (You may wonder how much lower do our errors get? We'll get to that soon!) Add 1 bucket wet Estimate: 622 times hash1 () 0 622 castle + 454 wet + 168 0 189 peer + 189 We may be feeling pretty good here, but we're not done yet. We're going to have a serious problem with high-frequency words. Managing frequencies What happens if we add a word like 'like'? Say it landed where 'peer' was: peer Estimate: 9,262 times 😰 peer 189 like 9,073 → 9,262 peer + 189 like + 9,073 If we asked for the count of 'peer', we'd now get back 9,262. That estimation is wildly inflated by 'like'. Not very useful. If we want to make our estimations better, we would need a way to reduce the chance of very-high frequency words influencing counts. How can we do this? Rows of Hashes Here's one way to reduce the influence of high-frequency words: we'll add more hashes! We can set up a row of hash functions, each with their own buckets. To add a word, we go through each row, hash it and increment the corresponding bucket. Here's how this looks: Reset Step castle + 454 peer + 189 like + 9,073 wet + 168 hash1 () 0 0 0 0 hash2 () 0 0 0 0 [ 5 ] When we want to know the count, we go through each row, find the corresponding bucket and pick the minimum value we find. castle peer like wet Estimate: 622 times hash1 () 0 622 castle + 454 wet + 168 0 9,262 peer + 189 like + 9,073 hash2 () 168 wet + 168 189 peer + 189 0 9,527 castle + 454 like + 9,073 This is pretty cool: a particular word could get unlucky in one hash function, but as long as it gets a lucky bucket from some row, we'll get a respectable count. We can look at 'peer' again for an example. hash1 got us into the same bucket as 'like'. But hash2 got us into our own bucket. That means a better estimation! And it also means we can intuit a way to improve our confidence even more. More hash functions...more confidence To improve confidence we can add more hash functions. The more hash functions we have, the higher the chance that we find at least one good bucket. (You may wonder, how much more confident do we get? We'll get to that soon!) Add 1 row peer Estimate: 9,262 times hash1 () 0 622 castle + 454 wet + 168 0 9,262 peer + 189 like + 9,073 Of course, this depends on how correlated the hash functions are. We'll want to be sure that they are independent of each other, so adding a new hash function fully shuffles around the words. If we do this right, and we build out columns of buckets and rows of hashes, we'll have our Count-Min Sketch! Implementing the Sketch Let's go ahead and write out our ideas in code then. Creating a sketch We'll kick off by typing our Sketch : type Sketch = { rows : number ; columns : number ; buckets : Uint32Array ; } ; rows , columns , and all of our buckets . Technically buckets are arranged as a matrix so we could use an array of arrays to store them. But a single array of buckets is more efficient. [ 6 ] We keep track of a, and all of our. Technicallyare arranged as a matrix so we could use an array of arrays to store them. But a single array of buckets is more efficient. To make life easier let's create a little builder function: function createSketch ( { rows , columns , } : { rows : number ; columns : number ; } ) : Sketch { return { rows , columns , buckets : new Uint32Array ( rows * columns ) } ; } If we use it, we've got ourselves a sketch! const sketch = createSketch ( { rows : 2 , columns : 5 } ) ; console . log ( 'created: ' , sketch ) ; Our console.log shows us a nifty object! created: { rows: 2 , columns: 5 , buckets: Uint32Array ( 10 ) [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] , } Adding words Alright, now for the meat and potatoes. Let's implement add . We want to say: Take a word For each row, hash it and find its corresponding bucket Increment the corresponding bucket Here we go: function add ( { rows , columns , buckets } : Sketch , word : string ) { for ( let rowIdx = 0 ; rowIdx < rows ; rowIdx ++ ) { const hash = Bun . hash . xxHash3 ( word , BigInt ( rowIdx ) ) ; const columnIdx = Number ( hash % BigInt ( columns ) ) ; const globalIdx = rowIdx * columns + columnIdx ; buckets [ globalIdx ] ! ++ ; } } We go through each row. xxHash3 takes a seed argument. We can pass the rowIdx into our 'seed', so for every row we produce an independent hash value! const hash = Bun . hash . xxHash3 ( word , BigInt ( rowIdx ) ) ; columnIdx tells us which bucket to use inside a particular row: const columnIdx = Number ( hash % BigInt ( columns ) ) ; And globalIdx accounts for the particular row that we we're looking at: const globalIdx = rowIdx * columns + columnIdx ; Increment that bucket, and we're done! buckets [ globalIdx ] ! ++ ; We can try it out and see how it feels. add ( sketch , stem ( 'castle' ) ) ; console . log ( 'after castle' , sketch ) ; after castle { rows: 2 , columns: 5 , buckets: Uint32Array ( 10 ) [ 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 0 ] , } Suuper cool! Notice the two increments in buckets , accounting for our different rows. Getting counts All that's left is to get a count. This is going to look similar to 'add'. We want to: Take a word For each row, hash it and nab the corresponding bucket Find the minimum value from all the corresponding buckets Let's do it: function check ( { rows , columns , buckets } : Sketch , word : string ) { let approx = Infinity ; for ( let rowIdx = 0 ; rowIdx < rows ; rowIdx ++ ) { const hash = Bun . hash . xxHash3 ( word , BigInt ( rowIdx ) ) ; const columnIdx = Number ( hash % BigInt ( columns ) ) ; const globalIdx = rowIdx * columns + columnIdx ; approx = Math . min ( approx , buckets [ globalIdx ] ! ) ; } return approx ; } We do the same math to get our globalIdx for each row as we did in add . We track the minimum number we see, and we have our check ! Let's try it out: console . log ( 'check castle' , check ( sketch , 'castle' ) ) ; Aaand we get our result! check castle 1 Congratulations, you've implemented a Count-Min Sketch! Getting real Alright, now that we have a real Count-Min Sketch, let's put it to the test. We'll find out approximately how many times 'beetle' is used in Wodehouse's texts. Get all of Wodehouse I went ahead and compiled all 61 novels from Project Gutenberg into one giant text file. You can go ahead and download it: curl https://www.instantdb.com/posts/count_min_sketch/wodehouse-full.txt \ -o wodehouse-full.txt We have a wodehouse-full.txt file we can play with now. Let's load it up: const allWodehouse = fs . readFileSync ( 'wodehouse-full.txt' , 'utf-8' ) ; Getting exact counts We can use up our toWords and exactCounts to get a feel for the vocabulary: const allWodehouse = fs . readFileSync ( 'wodehouse-full.txt' , 'utf-8' ) ; const allWords = toWords ( allWodehouse ) ; const allExactCounts = countWords ( allWords ) ; console . log ( 'exact beetle' , allExactCounts [ stem ( 'beetle' ) ] ) ; If we look at "beetle", we can see it's used exactly 59 times. What would a sketch return? Trying out sketches Let's create a sketch for our wodehouse words: const allSketch = createSketch ( { rows : 5 , columns : 5437 } ) ; And add our words: for ( const word of allWords ) { add ( allSketch , word ) ; } Now if we check out 'beetle': console . log ( 'allSketch beetle' , check ( allSketch , stem ( 'beetle' ) ) ) ; We'll see 78! allSketch beetle 78 [ 7 ] A bit over, but not so bad. If you're curious, try out different sizes and see what you get: Try different sizes Columns Rows Change chap castle soul beetle Loading all counts... — Estimate — Actual A breather to celebrate Congratulations! You just built a Count-Min Sketch from scratch, and used it on Wodehouse. If you'd like to see the full code example, I put this up in its entirety on GitHub Hope you had a lot of fun :). If you're still curious there's more to learn here, I present to you...2 bonus sections! Bonus 1: Probabilities When we created our sketch for Wodehouse, we chose some seemingly random numbers: 5437 columns and 5 rows. Is there a method to this madness? Absolutely. We can use some math to help set bounds around our estimations. Error Rate & Confidence There are two numbers we can play with: The errorRate tells us how far off we expect our estimation to be The confidence tells us how likely it is that we are actually within our estimation. Let's make them concrete. The full text for Wodehouse is about 3.7 million words long (not unique words, here we are counting every occurrence). Say we want an error rate of 0.05% and a 99% confidence. 0.05% of 3.7 million is 1850. We are in effect saying: "You can expect the estimation we give you to be overcounted by at most 1850, and we'll be right 99% of the time" That's pretty cool! How can we be certain like this? Formulas Turns out, you can tie the errorRate and the confidence to the number of rows and columns in a sketch! Here are the formulas: Given an errorRate , get this many columns : c o l u m n s = e e r r o r R a t e columns = \frac{e}{errorRate} co l u mn s = error R a t e e Given a confidence , get this many rows : r o w s = ln ⁡ ( 1 1 − c o n f i d e n c e ) rows = \ln(\frac{1}{1 - confidence}) ro w s = ln ( 1 − co n f i d e n ce 1 ) Now how did we get these formulas? Let's derive them. Variables We can start by writing out some of the numbers that we just went through. We have: The totalWords . This tells us how many occurrences have been counted in our Sketch. For Wodehouse, that's 3.7M The errorRate . How far off we expect our estimation to be as a percentage of totalWords. For us it's 0.05% The maximumOvercount . Our maximum allowed overestimation for a particular totalWords . In our case, it's 1850. The confidence . This tells us how likely we are to be within within our estimation. We want 99%. And our sketch has two properties that we can influence: The columns . This is the number of buckets in one row. We somehow picked 5,437 for our Wodehouse sketch. The rows . This is the number of hash functions in our sketch. We somehow picked 5 rows for our Wodehouse sketch. Goal Our goal is to relate errorRate and confidence to a specific number of columns and rows . Tying errorRate to columns To build our intuition let's consider a sketch with only 1 row: hash1() Say we ask for a count of a word ('wet'). Our hash function will direct us to a bucket. What would we see if we looked into that bucket? wet wet 168 noise 252 → 420 Well it would be composed of the "actual number of times" 'wet' was used, and the noise that comes from all the other collisions that hit our bucket. If we write this out: b u c k e t w o r d = a c t u a l C o u n t w o r d + n o i s e w o r d bucket_{word} = actualCount_{word} + noise_{word} b u c k e t w or d = a c t u a lC o u n t w or d + n o i s e w or d Expected Noise [ 8 ] of our noise for a word? Now here's a question: what is the expected valueof our noise for a word? The first thing we can remember is that our hash function distributes words uniformly across columns. This means that each word has a 1 c o l u m n s \frac{1}{columns} co l u mn s 1 chance of hitting our particular bucket. So if we write our our expectation, it would be: e x p e c t e d N o i s e w o r d = t o t a l W o r d s − a c t u a l C o u n t w o r d c o l u m n s expectedNoise_{word} = \frac{totalWords - actualCount_{word}}{columns} e x p ec t e d N o i s e w or d = co l u mn s t o t a l W or d s − a c t u a lC o u n t w or d Simplifying Noise If you think about, do we really need to subtract the a c t u a l C o u n t w o r d actualCount_{word} a c t u a lC o u n t w or d ? We can simplify this formula by getting more conservative about what we promise. We can bound ourselves to the worst case scenario, where we ask for a word that hasn't been seen before: e x p e c t e d N o i s e w o r d ≤ t o t a l W o r d s c o l u m n s expectedNoise_{word} \le \frac{totalWords}{columns} e x p ec t e d N o i s e w or d ≤ co l u mn s t o t a l W or d s Pretty cool. Now we have a simple relation for our expected noise! Help from Markov But an expected value for noise isn't useful yet. It just gives us an average. What we want is the probability that something is below a maximumOvercount . Markov's Inequality [ 9 ] comes in. Markov's Inequality is a proof about random variables that says: That's wherecomes in. Markov's Inequality is a proof about random variables that says: For any non-negative random variable, the probability that something is at least n n n times its expected value is at most 1 n \frac{1}{n} n 1 . n = e n = e n = e [ 10 ] to Markov's Inequality, we get: To get concrete, if we plug into Markov's Inequality, we get: The probability that something is at least e e e times its expected value is at most 1 e \frac{1}{e} e 1 . [ 11 ]. And we have its expected value. If we use Markov's Inequality we'll get a real probability that we can use! Well, our noise is a non-negative random variable. And we have its expected value. If we use Markov's Inequality we'll get a real probability that we can use! P ( Noise ≥ e × e x p e c t e d N o i s e w o r d ) ≤ 1 e P(\text{Noise} \ge e \times expectedNoise_{word}) \le \frac{1}{e} P ( Noise ≥ e × e x p ec t e d N o i s e w or d ) ≤ e 1 A maximumOvercount with about 63% confidence Let's look that probability a bit more: P ( Noise ≥ e × e x p e c t e d N o i s e w o r d ) ≤ 1 e P(\text{Noise} \ge e \times expectedNoise_{word}) \le \frac{1}{e} P ( Noise ≥ e × e x p ec t e d N o i s e w or d ) ≤ e 1 Let's get it's complement: P ( Noise ≤ e × e x p e c t e d N o i s e w o r d ) ≥ 1 − 1 e P(\text{Noise} \le e \times expectedNoise_{word}) \ge 1 - \frac{1}{e} P ( Noise ≤ e × e x p ec t e d N o i s e w or d ) ≥ 1 − e 1 And to make things more concrete, 1 − 1 e 1 - \frac{1}{e} 1 − e 1 is about 0.63. What is it saying then? Let's write it out in English: "The probability that noise is at most e times expectedNoise is at least ~63%" If you squint, we are talking about maximumOvercount with about 63% confidence! If we set maximumOvercount to to e × e x p e c t e d N o i s e e \times expectedNoise e × e x p ec t e d N o i se , we can say with 1 − 1 e 1 - \frac{1}{e} 1 − e 1 confidence that our estimation will be within our bounds! An errorRate with about 37% confidence Now that we have a probability that uses maximumOvercount , let's start tying things back to errorRate . We said before: You can expect the estimation we give you to be overcounted by at most 1850 Translated to a formula, this was: 3.7 million × 0.05 % ≤ 1850 3.7 \text{ million} \times 0.05\% \le 1850 3.7 million × 0.05% ≤ 1850 If we use variables: t o t a l W o r d s × e r r o r R a t e ≤ m a x i m u m O v e r c o u n t ; totalWords \times errorRate \le maximumOvercount; t o t a l W or d s × error R a t e ≤ ma x im u m O v erco u n t ; Now let's start expanding maximumOverCount , and see where we get: t o t a l W o r d s × e r r o r R a t e ≤ e × e x p e c t e d N o i s e ; totalWords \times errorRate \le e \times expectedNoise; t o t a l W or d s × error R a t e ≤ e × e x p ec t e d N o i se ; And since we know expectedNoise : t o t a l W o r d s × e r r o r R a t e ≤ e × t o t a l W o r d s c o l u m n s totalWords \times errorRate \le \frac{e \times totalWords}{columns} t o t a l W or d s × error R a t e ≤ co l u mn s e × t o t a l W or d s We've just tied errorRate and columns together! Let's keep going: e r r o r R a t e ≤ e c o l u m n s c o l u m n s ≥ e e r r o r R a t e errorRate \le \frac{e}{columns} {} \\ {} \\ columns \ge \frac{e}{errorRate} error R a t e ≤ co l u mn s e co l u mn s ≥ error R a t e e Voila! We've gotten a formula for columns. A solution for 1 row If our goal was to get a particular error rate with about 63% confidence, we could just set: c o l u m n s = e e r r o r R a t e r o w s = 1 columns = \frac{e}{errorRate} {} \\ {} \\ rows = 1 co l u mn s = error R a t e e ro w s = 1 But 63% confidence kind of sucks. How can we improve that? Tying confidence to rows Let's remember our initial Markov Inequality: P ( Noise ≥ e × e x p e c t e d N o i s e w o r d ) ≤ 1 e P(\text{Noise} \ge e \times expectedNoise_{word}) \le \frac{1}{e} P ( Noise ≥ e × e x p ec t e d N o i s e w or d ) ≤ e 1 All bad rows When Noise > maximumOvercount , it basically means that our estimation has failed. We've gotten a "bad row", where the bucket has highly frequent words in it. In this case we can paraphrase our probability to: P ( 1 row is bad ) ≤ 1 e P(\text{1 row is bad}) \le \frac{1}{e} P ( 1 row is bad ) ≤ e 1 Now what happens if we add more rows? What is the chance that 2 rows are bad? Since our hash functions are independent, we know that our probabilities will be too. This means: P ( 2 rows are bad ) ≤ ( 1 e ) 2 P(\text{2 rows are bad}) \le \left(\frac{1}{e}\right)^{2} P ( 2 rows are bad ) ≤ ( e 1 ) 2 Which generalizes. Given some number of rows, what is the probability that all rows are bad? P ( all rows are bad ) ≤ ( 1 e ) r o w s P(\text{all rows are bad}) \le \left(\frac{1}{e}\right)^{rows} P ( all rows are bad ) ≤ ( e 1 ) ro w s And now that we know the formula for "all rows are bad", we actually also know the formula for confidence. Confidence As long as we get 1 good row, we know that we'll return a number within our estimation. In that case we can say our confidence is: c o n f i d e n c e = P ( at least 1 good row ) confidence = P(\text{at least 1 good row}) co n f i d e n ce = P ( at least 1 good row ) So what's the probability of at least 1 good row? It's the complement of getting all bad rows: P ( at least 1 good row ) = 1 − P ( all rows are bad ) P(\text{at least 1 good row}) = 1 - P(\text{all rows are bad}) P ( at least 1 good row ) = 1 − P ( all rows are bad ) Which gets us: c o n f i d e n c e = 1 − P ( all rows are bad ) confidence = 1 - P(\text{all rows are bad}) co n f i d e n ce = 1 − P ( all rows are bad ) Expanding things out Since we know P ( all rows are bad ) P(\text{all rows are bad}) P ( all rows are bad ) , let's expand it: c o n f i d e n c e = 1 − ( 1 e ) r o w s confidence = 1 - \left(\frac{1}{e}\right)^{rows} co n f i d e n ce = 1 − ( e 1 ) ro w s Aand we've just connected confidence to rows ! Let's keep going. Isolate the term for rows: ( 1 e ) r o w s = 1 − c o n f i d e n c e \left(\frac{1}{e}\right)^{rows} = 1 - confidence ( e 1 ) ro w s = 1 − co n f i d e n ce Remember ( 1 e ) r o w s \left(\frac{1}{e}\right)^{rows} ( e 1 ) ro w s is the same as e − r o w s e^{-rows} e − ro w s e − r o w s = 1 − c o n f i d e n c e e^{-rows} = 1 - confidence e − ro w s = 1 − co n f i d e n ce Take the natural log of both sides: ln ⁡ ( e − r o w s ) = ln ⁡ ( 1 − c o n f i d e n c e ) \ln(e^{-rows}) = \ln(1 - confidence) ln ( e − ro w s ) = ln ( 1 − co n f i d e n ce ) Simplify the left side: − r o w s = ln ⁡ ( 1 − c o n f i d e n c e ) r o w s = − ln ⁡ ( 1 − c o n f i d e n c e ) -rows = \ln(1 - confidence) \\ {} rows = -\ln(1 - confidence) − ro w s = ln ( 1 − co n f i d e n ce ) ro w s = − ln ( 1 − co n f i d e n ce ) Push the - inside the ln : r o w s = ln ⁡ ( 1 1 − c o n f i d e n c e ) rows = \ln(\frac{1}{1 - confidence}) ro w s = ln ( 1 − co n f i d e n ce 1 ) And we've gotten our formula for rows ! Formulas to Code Now we have formulas for both columns and rows ! c o l u m n s = e e r r o r R a t e r o w s = ln ⁡ ( 1 1 − c o n f i d e n c e ) columns = \frac{e}{errorRate} {} \\ {} \\ rows = \ln(\frac{1}{1 - confidence}) co l u mn s = error R a t e e ro w s = ln ( 1 − co n f i d e n ce 1 ) So if we wanted an error rate of 0.05% and a confidence of 99%, how many rows and columns would we need? Let's calculate it in JavaScript: function sketchWithBounds ( { errorRate , confidence , } : { errorRate : number ; confidence : number ; } ) : Sketch { const columns = Math . ceil ( Math . E / errorRate ) ; const rows = Math . ceil ( Math . log ( 1 / ( 1 - confidence ) ) ) ; return createSketch ( { rows , columns } ) ; } We try it out: const withBounds = sketchWithBounds ( { errorRate : 0.0005 , confidence : 0.99 , } ) ; console . log ( 'withBounds' , withBounds . columns , withBounds . rows ) ; And we got 5437 columns and 5 rows! withBounds 5437 5 Bonus 2: PNGs Now, you may have wondered, how did we create our cool PNG? For posterity I thought I'd write out the algorithm. Let's start off by installing a library to create PNGs: bun add pngjs bun add -D @types/pngjs Now, we'll take a series of bytes. One pixel can be expressed as R G B A , each that's one byte. So we can fit 4 bytes per pixel. Here's a quick function to do that: import { PNG } from 'pngjs' ; function createPNG ( { width , buffer , } : { width : number ; buffer : Buffer ; } ) : Buffer { const bytesPerPixel = 4 ; const height = Math . ceil ( buffer . length / ( width * bytesPerPixel ) ) ; const png = new PNG ( { width , height , colorType : 6 , } ) ; for ( let i = 0 ; i < png . data . length ; i ++ ) { png . data [ i ] = buffer [ i ] ?? 0 ; } return PNG . sync . write ( png ) ; } A PNG for our Sketch Let's pick up our allSketch we created before, and save it as a PNG: const compressedSketch = await Bun . zstdCompress ( allSketch . buckets ) ; fs . writeFileSync ( 'compressedSketch.png' , createPNG ( { width : 150 , buffer : compressedSketch } ) , ) ; Aand we get our image! But you may wonder, how would it look if we saved the exact counts? A PNG for our exact counts allExactCounts [ 12 ], and save it as a PNG too: Let's try that. We can pick up our, and save it as a PNG too: const compressedExactCounts = await Bun . zstdCompress ( JSON . stringify ( allExactCounts ) , ) ; fs . writeFileSync ( 'compressedExactCounts.png' , createPNG ( { width : 150 , buffer : compressedExactCounts } ) , ) ; Load it up, and we see: Let's see them side by side: Sketch Exact Counts Fin Congratulations, you made it all the way through the bonus too! If you're into this stuff, I'd suggest reading Small Summaries for Big Data . It goes over the Count-Min Sketch, as well as a bunch of other probabilistic data structures. Plus, one of the co-authors invented the Count-Min Sketch! Thanks to Joe Averbukh, Daniel Woelfel, Predrag Gruevski, Irakli Safareli, Nicole Garcia Fischer, Irakli Popkhadze, Mark Shlick, Ilan Tzitrin, Drew Harris, for reviewing drafts of this post

Count-Min Sketches in JS – Frequencies, but without the data

Share this article

Related Articles