Wikipedia Is Making a Dataset for Training AI Because It’s Overwhelmed by Bots
Published on: 2025-08-22 08:15:35
It seems that AI developers have essentially blackmailed Wikipedia into offering up its data for training. On Wednesday, the Wikimedia Foundation announced it is partnering with Google-owned Kaggle—a popular data science community platform—to release a version of Wikipedia optimized for training AI models. Starting with English and French, the foundation will offer stripped down versions of raw Wikipedia text, excluding any references or markdown code.
Being a non-profit, volunteer-led platform, Wikipedia monetizes through donations and does not own the content it hosts, allowing anyone to use and remix content from the platform. It is fine with other organizations using its vast corpus of knowledge for all sorts of cases—Kiwix, for example, is an offline version of Wikipedia that has been used to smuggle information into North Korea.
But a flood of bots constantly trawling its website for AI training needs has led to a surge in non-human traffic to Wikipedia, something it was intere
... Read full article.