AIs can generate near-verbatim copies of novels from training data

The world’s top AI models can be prompted to generate near-verbatim copies of bestselling novels, raising fresh questions about the industry’s claim that its systems do not store copyrighted works.

A series of recent studies has shown that large language models from OpenAI, Google, Meta, Anthropic, and xAI memorize far more of their training data than previously thought.

AI and legal experts told the FT this “memorization” ability could have serious ramifications on AI groups’ battle against dozens of copyright lawsuits around the world, as it undermines their core defense that LLMs “learn” from copyrighted works but do not store copies.

“There’s growing evidence that memorization is a bigger thing than previously believed,” said Yves-Alexandre de Montjoye, a professor of applied mathematics and computer science at Imperial College London.

AI groups have long argued that memorization does not happen. In a 2023 letter to the US Copyright Office, Google said “there is no copy of the training data—whether text, images, or other formats—present in the model itself.”

The AI industry also claims that training models on copyrighted books is “fair use,” arguing that the technology transforms the original work into something meaningfully new.

But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.

By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.