OpenAI’s models ‘memorized’ copyrighted content, new study suggests
Published on: 2025-05-13 02:42:11
A new study appears to lend credence to allegations that OpenAI trained at least some of its AI models on copyrighted content.
OpenAI is embroiled in suits brought by authors, programmers, and other rights-holders who accuse the company of using their works — books, codebases, and so on — to develop its models without permission. OpenAI has long claimed a fair use defense, but the plaintiffs in these cases argue that there isn’t a carve-out in U.S. copyright law for training data.
The study, which was co-authored by researchers at the University of Washington, the University of Copenhagen, and Stanford, proposes a new method for identifying training data “memorized” by models behind an API, like OpenAI’s.
Models are prediction engines. Trained on a lot of data, they learn patterns — that’s how they’re able to generate essays, photos, and more. Most of the outputs aren’t verbatim copies of the training data, but owing to the way models “learn,” some inevitably are. Image models have
... Read full article.