Former Cloudflare executive John Graham-Cumming recently announced that he launched a website, lowbackgroundsteel.ai, that treats pre-AI, human-created content like a precious commodity—a time capsule of organic creative expression from a time before machines joined the conversation. "The idea is to point to sources of text, images and video that were created prior to the explosion of AI-generated content," Graham-Cumming wrote on his blog last week. The reason? To preserve what made non-AI media uniquely human.
The archive name comes from a scientific phenomenon from the Cold War era. After nuclear weapons testing began in 1945, atmospheric radiation contaminated new steel production worldwide. For decades, scientists needing radiation-free metal for sensitive instruments had to salvage steel from pre-war shipwrecks. Scientists called this steel "low-background steel." Graham-Cumming sees a parallel with today's web, where AI-generated content increasingly mingles with human-created material and contaminates it.
With the advent of generative AI models like ChatGPT and Stable Diffusion in 2022, it has become far more difficult for researchers to ensure that media found on the Internet was created by humans without using AI tools. ChatGPT in particular triggered an avalanche of AI-generated text across the web, forcing at least one research project to shut down entirely.
That casualty was wordfreq, a Python library created by researcher Robyn Speer that tracked word frequency usage across more than 40 languages by analyzing millions of sources, including Wikipedia, movie subtitles, news articles, and social media. The tool was widely used by academics and developers to study how language evolves and to build natural language processing applications. The project announced in September 2024 that it will no longer be updated because "the Web at large is full of slop generated by large language models, written by no one to communicate nothing."
Some researchers also worry about AI models training on their own outputs, potentially leading to quality degradation over time—a phenomenon sometimes called "model collapse." But recent evidence suggests this fear may be overblown under certain conditions. Research by Gerstgrasser et al. (2024) suggests that model collapse can be avoided when synthetic data accumulates alongside real data, rather than replacing it entirely. In fact, when properly curated and combined with real data, synthetic data from AI models can actually assist with training newer, more capable models.