A Better Zip Bomb

A better zip bomb 2019-07-02 updated 2019-07-03 , 2019-07-05 , 2019-07-06 , 2019-07-08 , 2019-07-18 , 2019-07-20 , 2019-07-22 , 2019-07-24 , 2019-08-05 , 2019-08-19 , 2019-08-22 , 2019-10-14 , 2019-10-18 , 2019-10-30 , 2019-11-28 , 2020-07-28 , 2021-01-21 , 2021-02-02 , 2021-05-03 , 2021-07-29 , 2023-05-18

Summary This article shows how to construct a non-recursive zip bomb that achieves a high compression ratio by overlapping files inside the zip container. "Non-recursive" means that it does not rely on a decompressor's recursively unpacking zip files nested within zip files: it expands fully after a single round of decompression. The output size increases quadratically in the input size, reaching a compression ratio of over 28 million ( 10 MB → 281 TB ) at the limits of the zip format. Even greater expansion is possible using 64-bit extensions. The construction uses only the most common compression algorithm, DEFLATE, and is compatible with most zip parsers. zbsm.zip 42 kB → 5.5 GB zblg.zip 10 MB → 281 TB zbxl.zip 46 MB → 4.5 PB (Zip64, less compatible) Source code: git clone https://www.bamsoftware.com/git/zipbomb.git zipbomb-20210121.zip Data and source for figures: git clone https://www.bamsoftware.com/git/zipbomb-paper.git Presentation video Русский перевод от @m1rko. 中文翻译: 北岸冷若冰霜.

There are two versions of 42.zip, an older version of 42 374 bytes, and a newer version of 42 838 bytes. The difference is that the newer version requires a password before unzipping. We compare only against the older version. Here is a copy if you need it: 42.zip. non-recursive recursive zipped size unzipped size ratio unzipped size ratio Cox quine 440 440 1.0 ∞ ∞ Ellingsen quine 28 809 42 569 1.5 ∞ ∞ 42.zip 42 374 558 432 13.2 4 507 981 343 026 016 106 billion this technique 42 374 5 461 307 620 129 thousand 5 461 307 620 129 thousand this technique 9 893 525 281 395 456 244 934 28 million 281 395 456 244 934 28 million this technique (Zip64) 45 876 952 4 507 981 427 706 459 98 million 4 507 981 427 706 459 98 million I would like to know/credit the maker of 42.zip but haven't been able to find a source—let me know if you have any info. There's a reference to 42.zip already in a vuln-dev post from 2001-06-19 . On 2023-05-16 there appeared https://42.zip/ using the then-new .zip TLD. The web server there naturally serves a copy of 42.zip. The Wayback Machine has a copy timestamped 2023-05-16 16:51:08 . I made a copy here: 42.zip. This 42.zip is a little different than the one I compared against. Its total compressed size is 42 790 bytes rather than 42 374 bytes. I suspect it is less original than the one I used, the evidence from timestamps: the timestamps increase as you go from the bottom level to the top, but in the 42 790 -byte file, the top "lib" level jumps 8 hours backwards. In fact, it is exactly 8 hours behind the 42.zip I used, which makes me suspect that at some point someone uncompressed and re-compressed the original in a different time zone. By 2023-05-26 the zip bomb had gone and the server response changed to just hello . layer 42 374 bytes 42 790 bytes lib 2000-03-28 21:40:54 2000-03-28 13:40:54 book 2000-03-28 21:38:30 2000-03-28 21:38:30 chapter 2000-03-28 21:36:28 2000-03-28 21:36:28 doc 2000-03-28 21:34:08 2000-03-28 21:34:08 page 2000-03-28 19:49:08 2000-03-28 19:49:08 0.dll 2000-03-28 18:03:14 2000-03-28 18:03:14 Compression bombs that use the zip format must cope with the fact that DEFLATE, the compression algorithm most commonly supported by zip parsers, cannot achieve a compression ratio greater than 1032. For this reason, zip bombs typically rely on recursive decompression, nesting zip files within zip files to get an extra factor of 1032 with each layer. But the trick only works on implementations that unzip recursively, and most do not. The best-known zip bomb, 42.zip, expands to a formidable 4.5 PB if all six of its layers are recursively unzipped, but a trifling 0.6 MB at the top layer. Zip quines, like those of Ellingsen and Cox, which contain a copy of themselves and thus expand infinitely if recursively unzipped, are likewise perfectly safe to unzip once. This article shows how to construct a non-recursive zip bomb whose compression ratio surpasses the DEFLATE limit of 1032. It works by overlapping files inside the zip container, in order to reference a "kernel" of highly compressed data in multiple files, without making multiple copies of it. The zip bomb's output size grows quadratically in the input size; i.e., the compression ratio gets better as the bomb gets bigger. The construction depends on features of both zip and DEFLATE—it is not directly portable to other file formats or compression algorithms. It is compatible with most zip parsers, the exceptions being "streaming" parsers that parse in one pass without first consulting the zip file's central directory. We try to balance two conflicting goals: Maximize the compression ratio. We define the compression ratio as the the sum of the sizes of all the files contained the in the zip file, divided by the size of the zip file itself. It does not count filenames or other filesystem metadata, only contents.

Be compatible. Zip is a tricky format and parsers differ, especially around edge cases and optional features. Avoid taking advantage of tricks that only work with certain parsers. We will remark on certain ways to increase the efficiency of the zip bomb that come with some loss of compatibility.

The first insight: overlapping files By compressing a long string of repeated bytes, we can produce a kernel of highly compressed data. By itself, the kernel's compression ratio cannot exceed the DEFLATE limit of 1032, so we want a way to reuse the kernel in many files, without making a separate copy of it in each file. We can do it by overlapping files: making many central directory headers point to a single file, whose data is the kernel. Let's look at an example to see how this construction affects the compression ratio. Suppose the kernel is 1000 bytes and decompresses to 1 MB . Then the first MB of output "costs" 1078 bytes of input: 31 bytes for a local file header (including a 1-byte filename)

for a local file header (including a 1-byte filename) 47 bytes for a central directory header (including a 1-byte filename)

for a central directory header (including a 1-byte filename) 1000 bytes for the kernel itself But every 1 MB of output after the first costs only 47 bytes —we don't need another local file header or another copy of the kernel, only an additional central directory header. So while the first reference of the kernel has a compression ratio of 1 000 000 / 1078 ≈ 928, each additional reference pulls the ratio closer to 1 000 000 / 47 ≈ 21 277. A bigger kernel raises the ceiling. The problem with this idea is a lack of compatibility. Because many central directory headers point to a single local file header, the metadata—specifically the filename—cannot match for every file. Some parsers balk at that. Info-ZIP UnZip (the standard Unix unzip program) extracts the files, but with warnings: $ unzip overlap.zip inflating: A B: mismatching "local" filename (A), continuing with "central" filename version inflating: B ... And the Python zipfile module throws an exception: $ python3 -m zipfile -e overlap.zip . Traceback (most recent call last): ... __main__.BadZipFile: File name in directory 'B' and header b'A' differ. Next we will see how to modify the construction for consistency of filenames, while still retaining most of the advantage of overlapping files.

The second insight: quoting local file headers We need to separate the local file headers for each file, while still reusing a single kernel. Simply concatenating all the local file headers does not work, because the zip parser will find a local file header where it expects to find the beginning of a DEFLATE stream. But the idea will work, with a minor modification. We'll use a feature of DEFLATE, non-compressed blocks, to "quote" local file headers so that they appear to be part of the same DEFLATE stream that terminates in the kernel. Every local file header (except the first) will be interpreted in two ways: as code (part of the structure of the zip file) and as data (part of the contents of a file). A DEFLATE stream is a sequence of blocks, where each block may be compressed or non-compressed. Compressed blocks are what we usually think of; for example the kernel is one big compressed block. But there are also non-compressed blocks, which start with a 5-byte header with a length field that means simply, "output the next n bytes verbatim." Decompressing a non-compressed block means only stripping the 5-byte header. Compressed and non-compressed blocks may be intermixed freely in a DEFLATE stream. The output is the concatenation of decompressing all the blocks in order. The "non-compressed" notion only has meaning at the DEFLATE layer; the file data still counts as "compressed" at the zip layer, no matter what kind of blocks are used. It is easiest to understand this quoted-overlap construction from the inside out, beginning with the last file and working backwards to the first. Start by inserting the kernel, which will form the end of file data for every file. Prepend a local file header LFH N and add a central directory header CDH N that points to it. Set the "compressed size" metadata field in the LFH N and CDH N to the compressed size of the kernel. Now prepend a 5-byte non-compressed block header (colored green in the diagram) whose length field is equal to the size of LFH N . Prepend a second local file header LFH N −1 and add a central directory header CDH N −1 that points to it. Set the "compressed size" metadata field in both of the new headers to the compressed size of the kernel plus the size of the non-compressed block header (5 bytes) plus the size of LFH N . 2019-08-22 : There's an additional minor optimization possible that I didn't originally think of. Instead of only quoting the immediately following local file header, quote as many local file headers as possible—including their own quoting blocks—up to the limit of 65535 bytes per non-compressed block. The advantage is that the quoting blocks between local file headers now additionally become part of the output file, gaining 5 bytes of output for each one we manage to include. It's a small optimization, gaining only 154 380 bytes in zbsm.zip, or 0.003%. (Far less than extra-field quoting gains.) The --giant-steps option in the source code activates this feature. The giant-steps feature only pays when you are not constrained by maximum output file size. In zblg.zip, we actually want to slow file growth as much as possible so that the smallest file, containing the kernel, can be as large as possible. Using giant steps in zblg.zip actually decreases the compression ratio. I credit Kevin Farrow for sparking the idea for this enhancement during a dc303 talk. Carlos Javier González Cortés (Lethani) also hit on the idea in his article (Español) on overlapped zip bombs. At this point the zip file contains two files, named "Y" and "Z". Let's walk through what a zip parser would see while parsing it. Suppose the compressed size of the kernel is 1000 bytes and the size of LFH N is 31 bytes. We start at CDH N −1 and follow the pointer to LFH N −1 . The first file's filename is "Y" and the compressed size of its file data is 1036 bytes. Interpreting the next 1036 bytes as a DEFLATE stream, we first encounter the 5-byte header of a non-compressed block that says to copy the next 31 bytes. We write the next 31 bytes, which are LFH N , which we decompress and append to file "Y". Moving on in the DEFLATE stream, we find a compressed block (the kernel), which we decompress to file "Y". Now we have reached the end of the compressed data and are done with file "Y". Proceeding to the next file, we follow the pointer from CDH N to LFH N and find a file named "Z" whose compressed size is 1000 bytes. Interpreting those 1000 bytes as a DEFLATE stream, we immediately encounter a compressed block (the kernel again) and decompress it to the file "Z". Now we have reached the end of the final file and are done. The output file "Z" contains the decompressed kernel; the output file "Y" is the same, but additionally prefixed by the 31 bytes of LFH N . We complete the construction by repeating the quoting procedure until the zip file contains the desired number of files. Each new file adds a central directory header, a local file header, and a non-compressed block to quote the immediately succeeding local file header. Compressed file data is generally a chain of DEFLATE non-compressed blocks (the quoted local file headers) followed by the compressed kernel. Each byte in the kernel contributes about 1032 N to the output size, because each byte is part of all N files. The output files are not all the same size: those that appear earlier in the zip file are larger than those that appear later, because they contain more quoted local file headers. The contents of the output files are not particularly meaningful, but no one said they had to make sense. This quoted-overlap construction has better compatibility than the full-overlap construction of the previous section, but the compatibility comes at the expense of the compression ratio. There, each added file cost only a central directory header; here, it costs a central directory header, a local file header, and another 5 bytes for the quoting header.

Optimization Now that we have the basic zip bomb construction, we will try to make it as efficient as possible. We want to answer two questions: For a given zip file size, what is the maximum compression ratio?

What is the maximum compression ratio, given the limits of the zip format? Kernel compression It pays to compress the kernel as densely as possible, because every decompressed byte gets magnified by a factor of N . To that end, we use a custom DEFLATE compressor called bulk_deflate, specialized for compressing a string of repeated bytes. All decent DEFLATE compressors will approach a compression ratio of 1032 when given an infinite stream of repeating bytes, but we care more about specific finite sizes than asymptotics. bulk_deflate compresses more data into the same space than the general-purpose compressors: about 26 kB more than zlib and Info-ZIP, and about 15 kB more than Zopfli, a compressor that trades speed for density. The price of bulk_deflate's high compression ratio is a lack of generality. bulk_deflate can only compress strings of a single repeated byte, and only those of specific lengths, namely 517 + 258 k for integer k ≥ 0. Besides compressing densely, bulk_deflate is fast, doing essentially constant work regardless of the input size, aside from the work of actually writing out the compressed string. Filenames Every byte spent on a filename is 2 bytes not spent on the kernel. (2 because each filename appears twice, in the central directory header and the local file header.) A filename byte results in, on average, only ( N + 1) / 4 bytes of output, while a byte in the kernel counts for 1032 N . For our purposes, filenames are mostly dead weight. While filenames do contribute something to the output size by virtue of being part of quoted local file headers, a byte in a filename does not contribute nearly as much as a byte in the kernel. We want filenames to be as short as possible, while keeping them all distinct, and subject to compatibility considerations. Examples: 1 2 3 The first compatibility consideration is character encoding. The zip format specification states that filenames are to be interpreted as CP 437, or UTF-8 if a certain flag bit is set (APPNOTE.TXT Appendix D). But this is a major point of incompatibility across zip parsers, which may interpret filenames as being in some fixed or locale-specific encoding. So for compatibility, we must limit ourselves to characters that have the same encoding in both CP 437 and UTF-8; namely, the 95 printable characters of US-ASCII. One thing I didn't consider is Windows reserved filenames like "PRN" and "NUL". We are further restricted by filesystem naming limitations. Some filesystems are case-insensitive, so "a" and "A" do not count as distinct names. Common filesystems like FAT32 prohibit certain characters like '*' and '?'. As a safe but not necessarily optimal compromise, our zip bomb will use filenames consisting of characters drawn from a 36-character alphabet that does not rely on case distinctions or use special characters: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Filenames are generated in the obvious way, cycling each position through the possible characters and adding a position on overflow: "0", "1", "2", …, "Z",

... continue reading