Two authors accuse Apple of illegally training AI models on pirated books

A new proposed class action suit was filed in the federal court in Northern California today, accusing Apple of illegally using books to train its AI models. Here are the details.

Authors base the accusation on Apple’s own documents

As reported by Reuters, authors Grady Hendrix and Jennifer Robertson are accusing Apple of using a pirated dataset, in which their work was included. From the lawsuit:

“But Apple is building part of this new enterprise using Books3, a dataset of pirated copyrighted books that includes the published works of Plaintiffs and the Class. Apple used Books3 to train its OpenELM language models. Apple also likely trained its Foundation Language Models using this same pirated dataset.”

The accusation is based on details provided by Apple on its paper about OpenELM, an open-source model the company made available on Hugging Face last year.

The paper mentions RedPajama as one of the datasets used in the model. RedPajama, in turn, uses a dataset called Books3, which, as the lawsuit claimed, is “a known body of pirated books.”

The authors are requesting the court to allow the lawsuit to proceed as a Class action against Apple, and ask for the following remedies following a jury trial:

Allowing this action to proceed as a class action, with Plaintiffs serving as Class Representatives, and with Plaintiffs’ counsel as Class Counsel;

Awarding Plaintiffs and the Class statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity;

Permanently enjoining Defendant from the unlawful, unfair, and infringing conduct alleged herein;

... continue reading