A summary of the techniques in place to protect my git forge
In August 2024, one of my roommates and partners messaged the apartment group chat, saying she noticed the internet was slow again at our place, and my forgejo was unable to render any page in under 15 seconds.
i investigated, thinking it would be a trivial little problem to solve. Soon enough, however, i would uncover hundreds of thousands of queries a day from thousands of individual IPs, fetching seemingly-random pages in my forge every single day, all the time.
This post summarizes the practical issues that arose as a result of the onslaught of scrapers eager to download millions of commits off of my forge, and the measures i put in place to limit the damage.
# Why the forge?
In the year 2025, on the web, everything is worth being scraped. Everything that came out of the mind of a human is susceptible to be snatched under the vastest labor theft scheme in the history of mankind. This very article, the second it gets published in any indexable page, will be added to countless datasets meant to train foundational large-language models. My words, your words, have contributed infinitesimal shifts of neural-network weights underpinning the largest, most grotesque accumulation of wealth seen over the lifetime of my parents, grandparents, and their grandparents themselves.
Oh, and forges have a lot of commits. See, if you have a public repository that is publicly exposed, every file in every folder for every commit will be connected. Add other options, such as a git blame on a file, and multiply it by the number of files and commits. Add the raw download link, also multiplied by the number of commits.
Say, hypothetically, you have a linux repository available, and only with all the commits in the master branch up to the v6.17 tag from 2025-09-18. That's 1,383,738 commits in the range 1da177e4c3f4..e5f0a698b34e . How many files is that? Well:
count=0; while read -r rev; do point=$(git ls-tree -tr $rev | wc -l); count=$(( $count + $point )); printf "[%s] %s: %d (tot: %d)
" $(git log -1 --pretty=tformat:%cs $rev) $rev $point $count; done < <(git rev-list "1da177e4c3f4..e5f0a698b34e"); printf "Total: $count
... continue reading