Tech News
← Back to articles

ChatGPT Has Already Polluted the Internet So Badly That It's Hobbling Future AI Development

read original related products more articles

The rapid rise of ChatGPT — and the cavalcade of competitors' generative models that followed suit — has polluted the internet with so much useless slop that it's already kneecapping the development of future AI models.

As the AI-generated data clouds the human creations that these models are so heavily dependent on amalgamating, it becomes inevitable that a greater share of what these so-called intelligences learn from and imitate is itself an ersatz AI creation.

Repeat this process enough, and AI development begins to resemble a maximalist game of telephone in which not only is the quality of the content being produced diminished, resembling less and less what it's originally supposed to be replacing, but in which the participants actively become stupider. The industry likes to describe this scenario as AI "model collapse."

As a consequence, the finite amount of data predating ChatGPT's rise becomes extremely valuable. In a new feature, The Register likens this to the demand for "low-background steel," or steel that was produced before the detonation of the first nuclear bombs, starting in July 1945 with the US's Trinity test.

Just as the explosion of AI chatbots has irreversibly polluted the internet, so did the detonation of the atom bomb release radionuclides and other particulates that have seeped into virtually all steel produced thereafter. That makes modern metals unsuitable for use in some highly sensitive scientific and medical equipment. And so, what's old is new: a major source of low-background steel, even today, is WW1 and WW2 era battleships, including a huge naval fleet that was scuttled by German Admiral Ludwig von Reuter in 1919.

Maurice Chiodo, a research associate at the Centre for the Study of Existential Risk at the University of Cambridge called the admiral's actions the "greatest contribution to nuclear medicine in the world."

"That enabled us to have this almost infinite supply of low-background steel. If it weren't for that, we'd be kind of stuck," he told The Register. "So the analogy works here because you need something that happened before a certain date."

"But if you're collecting data before 2022 you're fairly confident that it has minimal, if any, contamination from generative AI," he added. "Everything before the date is 'safe, fine, clean,' everything after that is 'dirty.'"

In 2024, Chiodo co-authored a paper arguing that there needs to be a source of "clean" data not only to stave off model collapse, but to ensure fair competition between AI developers. Otherwise, the early pioneers of the tech, after ruining the internet for everyone else with their AI's refuse, would boast a massive advantage by being the only ones that benefited from a purer source of training data.

Whether model collapse, particularly as a result of contaminated data, is an imminent threat is a matter of some debate. But many researchers have been sounding the alarm for years now, including Chiodo.

... continue reading