Last month I published a blog post about the parallel invention of Prolly Trees, where I observed the repeated independent invention of a certain type of data structure within a relatively short period of time.
In short, I described the concept of a Merkle Tree created by applying a content-defined chunker to a file, hashing the chunks, and then recursively reapplying the chunker to the concatenated list of hashes until only a single chunk remained. Each iteration of this process defined a separate row of the tree, with the leaves containing the chunks of the original file. This process allows for storing the version history of a dataset while achieving a high level of data deduplication.
That post kicked off enough conversation and debate that I felt it was worth taking another look at some of my claims. I definitely recommend at least skimming the original post before continuing here. I'll wait.
Okay, all caught up? Great. Let's continue.
Even More Parallel Invention
In my previous post, I included every implementation of this concept that I was aware of, but acknowledged that it wouldn't surprise me if there were more.
So given the amount of traction the original post got, it's not at all surprising that people chimed in with other examples that they either knew about or had created themselves:
The creator of rdedup pointed out how they independently invented that idea also using recursive content-defined chunking to build a Merkle Tree back in 2016.
compressedgas on HN brought up Jumbo Store, another implementation of nested content defined chunking used for storing files circa 2007, predating bup by two years.
A couple of other people mentioned other previous projects that used a content-defined chunker to build Merkle Trees, such as this proposal to add file chunking to git. However, that proposal is missing the most important bit, which is the idea that this chunking is applied recursively. In fact, the linked proposal by C. Scott Ananian explicitly rejects the idea of applying the process recursively, saying "I propose keeping this only one-level deep: we can only specify chunks, not pieces of files." So I'm not sure that counts.
... continue reading