Compression always changes data permanently. Common formats (JPG, MP3, MP4) make changes slowly and gently: it usually takes hundreds of cycles of saving, sharing, and re-uploading before the tool marks, called compression artifacts, become apparent. Re-save a JPG enough times and it goes blocky and washed out; iterate an MP3 and metallic tones bleed through the music; re-upload a YouTube video a thousand times and you end up with a blobby mess over unintelligible audio.
A poorly designed codec can go catastrophically wrong. In 2013, David Kriesel scanned a building floor plan on a Xerox WorkCentre and noticed that a room marked 21.11m² had become 14.13m². Xerox’s implementation of the JBIG2 compression format saves space by quilting scans together from common, repeated elements; in Kriesel’s scan, it had silently replaced the original numbers with ones from another part of the document it deemed visually similar enough. After Kriesel published, reports surfaced of the same silent substitution affecting building plans, invoices, and medical records.
From the very first studies into compression (at Bell Labs in the 1940s), researchers knew they’d have to accept a tradeoff: you can achieve smaller file sizes if you’re willing to accept some loss of the original data. This seems counterproductive, since the whole idea is to reproduce the data, but scientists found ways to only discard information that is imperceptible to humans.
The information age has been defined by bandwidth. The internet is limited by how much data we can squeeze into the narrow pipes of transmission infrastructure. So we invented compression, ways of representing the same object — a website, a picture, a song, a movie — within ever smaller digital footprints. YouTube, Spotify, Instagram, and the algorithms that make them work, wouldn’t be possible without it.
If you know what to look for and how to look for it, you can learn a lot about the path that data took to get to you. That’s because compression artifacts are in turn meta-information; you can learn something new about a document by identifying and cataloging its algorithm-induced flaws. Digital forensics uses this meta-information to explore the provenance of documents, photos and videos. Compression leaves breadcrumbs that betray whether or not a document has been edited (and often who, or what, edited it).
Compression artifacts can even become an aesthetic of their own. Deep-fried memes dress images up in the aesthetics of pictures that have been shared and re-shared thousands of times. Datamoshing manipulates compression algorithms to create entirely new video aesthetics. Glitch music stretches and squashes audio files, making the tool marks of audio compression audible and even musical.
Compression has spawned entire fields of art and science (and jokes) all in service of the ideal compromise between fidelity and file size.
Three years ago, Ted Chiang described ChatGPT as a blurry JPEG of the web. LLMs are a lossy compression of their training data, which is itself a lossy sample of all the data available to it. But the artifacts we see in AI slop aren’t in the compression. They’re in the decompression.
Every AI-generated output is an extrapolation from that blurry source, vectored toward your prompt, filling in plausible detail where the compression threw information away. The output gets inflated into blog posts and LinkedIn thoughtspam, software platforms, omnichannel advertising campaigns, and movie cameos from dead actors. Chiang compared the gaps and confabulations to compression artifacts.
I think they’re expansion artifacts.
... continue reading