Tech News
← Back to articles

A valid HTML zip bomb

read original related products more articles

Many sites have been affected by the aggressiveness of web crawlers designed to improve LLMs.

I’ve been relatively spared, but since the phenomenon started, I've been looking for a solution to implement.

Today, I present a zip bomb gzip and brotli that is valid HTML.

The initial problem is the aggressiveness of LLM web crawlers that don't respect robots.txt . The first idea that comes to mind is IP blocking. However, web crawlers have circumvented this restriction by using individual IPs via specialized botnets.

Another solution is therefore to exhaust the resources of the harvesters. With a zip bomb, we attempt to exhaust their RAM.1

We’re exploiting the asymmetry of the resources needed to serve the zip bomb versus those needed to detect it. Naturally, I’m going to try to minimize the resources needed to distribute the zip bomb.

The most basic gzip bomb consists of zeros.

$ dd if =/dev/zero bs=1M count=10240 | gzip -9 > 10G.gzip

That's not bad; the theoretical ratio is 1032:1 (approximately 1030 in practice for a zip bomb), so our file weighs ~10MiB.

The problem is that web browsers parse the page on the fly as soon as possible and quickly detect that it's not a valid HTML page.

... continue reading