Faking a JPEG - GoKawiil

25th March 2025: Faking a JPEG

Click to expand

I've been wittering on about Spigot for a while. It's small web application which generates a fake hierarchy of web pages, on the fly, using a Markov Chain to make gibberish content for aggressive web crawlers to ingest.

Spigot has been sitting there, doing its thing, for a few months now, serving over a million pages per day. I've not really been keeping track of what it's up to, but every now and then I look at its logs to see what crawlers are hitting it.

Sadly, two of the hardest-hitting crawlers go to extreme lengths to hide their identity, generating random, and unlikely, browser signatures (e.g. Firefox version 134.0, 64 bit, on Windows 98!) and accessing from random addresses. It seems quite likely that this is being done via a botnet - illegally abusing thousands of people's devices. Sigh.

Where I can identify a heavy hitter, I add it to the list on Spigot's front page so I can track crawler behaviour over time.

Anyway... a couple of weeks ago, I noticed a new heavy hitter, "ImageSiftBot". None of Spigot's output contained images, but ImageSiftBot was busily hitting it with thousands of requests per hour, desperately looking for images to ingest. I felt sorry for its thankless quest and started thinking about how I could please it.

My primary aim, for Spigot, is that it should sit there, doing its thing, without eating excessive CPU on my server. Generating images on the fly isn't trivial, in terms of CPU load. If I want to create a bunch of pixels, in a form that a crawler would believe, I pretty much have to supply compressed data. And compressing on the fly is CPU intensive. That's not going to be great for Spigot, and is a complete waste when we're just generating throw-away garbage in any case.

I got to thinking: compression tends to increase the entropy of a bit stream. If a file doesn't look to have random content then it's compressible, and an optimally compressed set of data would be more or less indistinguishable from random data. JPEGs are pretty well compressed. So the compressed data in a JPEG will look random, right?

If I had a template for a JPEG file, which contained the "structured" parts (info on size, colour depth, etc) and tags indicating where highly compressed data goes, I could construct something that looks like a JPEG by just filling out the "compressed" areas with random data. That's a very low-CPU operation. The recipient would see something that looks like a JPEG and would treat the random data as something to decompress.

... continue reading