Humanely Dealing with Humungus Crawlers

I host a bunch of hobby code on my server. I would think it’s really only interesting to me, but it turns out every day, thousands of people from all over the world are digging through my code, reviewing years old changesets. On the one hand, wow, thanks, this is very flattering. On the other hand, what the heck is wrong with you?

This has been building up for a while, and I’ve been intermittently developing and deploying countermeasures. It’s been a lot like solving a sliding block puzzle. Lots of small moves and changes, and eventually it starts coming together.

My primary principle is that I’d rather not annoy real humans more than strictly intended. If there’s a challenge, it shouldn’t be too difficult, but ideally, we want to minimize the number of challenges presented. You should never suspect that I suspected you of being an enemy agent.

First measure is we only challenge on the deep URLs. So, for instance, I can link to the anticrawl repo no problem, or even the source for anticrawl.go, and that’ll be served immediately. All the pages any casual browser would visit make up less than 1% of the possible URLs that exist, but probably contain 99% of the interesting content.

Also, these pages get cached by the reverse proxy first, so anticrawl doesn’t even evaluate them. We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.

The next step is that anybody loading style.css gets marked friendly. Big Basilisk doesn’t care about my artisanal styles, but most everybody else loves them. So if you start at a normal page, and then start clicking deeper, that’s fine, still no challenge. (Sorry lynx browsers, but don’t worry, it’s not game over for you yet.)

And then let’s say somebody directly links to a changeset like /r/vertigo/v/b5ea481ff167. The first visitor will probably hit a challenge, but then we record that URL as in use. The bots are shotgun crawling all over the place, but if a single link is visited more than once, I’ll assume it’s human traffic, and bypass the challenge. No promises, but clicking that link will mostly likely just return content, no challenge.

The very first version of anticrawl relied on a weak POW challenge (find a SHA hash with first byte 42), just to get something launched, but this does seem counter intuitive. Why are we making humans solve a challenge optimized for machines? Instead I have switched to a much more diabolical challenge. You are asked how many Rs in strawberry. Or maybe something else. To be changed as necessary. But really, the key observation is that any challenge, anything at all, easily sheds like 99.99% of the crawling load.

Notably, because the challenge does not include its own javascript solver, even a smart crawler isn’t going to solve it automatically. If you include the solution on the challenge page, at least some bots are going to use it. All anticrawl challenges now require some degree of contemplation, not just blind interpretation.

It took a few iterations because the actual deployment involves a few pieces. I had to reduce the style.css cache time, so that visitors would periodically refresh it (and thus their humanity). And then exclude it from the caching proxy, so that the request would be properly observed. Basically, a few minutes tinkering now and then while I wait for my latte to arrive, and now I think I’ve gotten things to the point where it’s unlikely to burden anybody except malignant crawlers.

... continue reading