Skip to content
Tech News
← Back to articles

Elsevier vs Meta: first science publisher sues over scraped research papers

read original get Research Paper Scraper Tool → more articles
Why This Matters

The lawsuit by Elsevier against Meta marks a significant moment in the ongoing debate over copyright and AI training data, highlighting the legal and ethical challenges faced by AI companies using proprietary content. This case could set important precedents for how copyrighted works are used in AI development, impacting both the tech industry and content creators. It underscores the need for clearer regulations and fair use practices in AI training processes.

Key Takeaways

Elsevier is one of several publishers alleging that their copyrighted works were used to train AI models.Credit: Kristoffer Tripplaar/Alamy

A scientific publisher has joined the dozens of firms and individuals suing artificial intelligence companies over their alleged use of copyrighted works in training AI models.

Elsevier — which publishes thousands of journals, including Cell and The Lancet — was part of a class-action lawsuit filed on 5 May against technology company Meta and its chief executive Mark Zuckerberg in the Southern District of New York. Also named as plaintiffs on the lawsuit are book-publishing giants Hachette and Macmillan, and the US fiction author and lawyer Scott Turow. The publishers allege that Meta obtained and reproduced copyrighted works in developing its large language model (LLM) Llama.

“This case is the first AI action brought by major publishing houses, who have their own story to tell about Meta’s flagrant violation of their rights,” said the Association of American Publishers, in a statement.

AI firms must play fair when they use academic data in training

The case mirrors those of authors and media companies — including The New York Times — suing AI firms on similar grounds. Some cases have been settled but, overall, they have yet to establish a clear precedent on whether it is legal to use copyrighted works to train an LLM. A Meta spokesperson has said the company would “fight this lawsuit aggressively”.

Although AI firms are cagey about their training data, it is widely assumed that paywalled research papers, as well as open-access ones, formed part of the billions of web pages that models were trained on.

Training data

To train Llama, the lawsuit alleges that Meta used the Common Crawl data set, a sample of billions of web pages made by trawling the Internet, which the plaintiffs say is likely to have included unauthorized copies of copyrighted works, such as scientific abstracts and paywalled papers.

The publishers also allege that Meta downloaded and torrented (sourced using a file-sharing method) works from sites including LibGen, a database of books, research papers and textbooks; and Sci-Hub, a repository that gives free access to millions of research articles and books regardless of copyright. Both sites have been the subject of legal challenges. Much of the evidence relies on e-mails between Meta employees that were revealed during a separate case in which several book authors sued Meta last year (Kadrey v. Meta).

... continue reading