One reaction GIF. Used constantly in posts, PMs, everywhere. Each use in a different security context creates a new copy. 246,173 copies of Rachel from Friends doing a happy dance.
It started with backup issues. Sites with hundreds of gigabytes of uploads were running out of disk space during backup generation. One site had 600+ GB of uploads and the backup process kept dying.
While looking into reliable large backups, we discovered something wild in one of those sites: the actual unique content was a fraction of the reported size. They were storing the same files over and over again, each with a different filename. The duplication was absurd.
So we shipped an optimization. Detect duplicate files by their content hash, use hardlinks instead of downloading each copy. I wrote some new tests, they all passed, it got approved and merged. But unfortunately, a fix like this is kind of hard to actually fully test.
Then someone ran it on a real production backup and hit a filesystem limit I didn't know existed. The culprit? A single reaction GIF, duplicated 246,173 times...
The Problem
Discourse has a feature called secure uploads. When a file moves between security contexts (say, from a private message to a public post), the system creates a new copy with a randomized SHA1. The original content is identical, but Discourse treats it as a new file.
This happens constantly with reaction GIFs and popular images. Users share them across posts, embed them in PMs, repost in different categories. Each context creates another copy.
This is mostly fine for normal operation. But for backups, it's a disaster.
One customer had 432 GB of uploads. Unique content? 26 GB. The rest was duplicates. A 16x inflation factor, all going into the backup archive.
... continue reading