Did AI companies win a fight with authors? Technically

In the past week, big AI companies have — in theory — chalked up two big legal wins. But things are not quite as straightforward as they may seem, and copyright law hasn’t been this exciting since last month’s showdown at the Library of Congress.

First, Judge William Alsup ruled it was fair use for Anthropic to train on a series of authors’ books. Then, Judge Vince Chhabria dismissed another group of authors’ complaint against Meta for training on their books. Yet far from settling the legal conundrums around modern AI, these rulings might have just made things even more complicated.

Both cases are indeed qualified victories for Meta and Anthropic. And at least one judge — Alsup — seems sympathetic to some of the AI industry’s core arguments about copyright. But that same ruling railed against the startup’s use of pirated media, leaving it potentially on the hook for massive financial damage. (Anthropic even admitted it did not initially purchase a copy of every book it used.) Meanwhile, the Meta ruling asserted that because a flood of AI content could crowd out human artists, the entire field of AI system training might be fundamentally at odds with fair use. And neither case addressed one of the biggest questions about generative AI: when does its output infringe copyright, and who’s on the hook if it does?

Alsup and Chhabria (incidentally both in the Northern District of California) were ruling on relatively similar sets of facts. Meta and Anthropic both pirated huge collections of copyright-protected books to build a training dataset for their large language models Llama and Claude. Anthropic later did an about-face and started legally purchasing books, tearing the covers off to “destroy” the original copy, and scanning the text.

The authors argued that, in addition to the initial piracy, the training process constituted an unlawful and unauthorized use of their work. Meta and Anthropic countered that this database-building and LLM-training constituted fair use.

Both judges basically agreed that LLMs meet one central requirement for fair use: they transform the source material into something new. Alsup called using books to train Claude “exceedingly transformative,” and Chhabria concluded “there’s no disputing” the transformative value of Llama. Another big consideration for fair use is the new work’s impact on a market for the old one. Both judges also agreed that based on the arguments made by the authors, the impact wasn’t serious enough to tip the scale.

Add those things together, and the conclusions were obvious… but only in the context of these cases, and in Meta’s case, because the authors pushed a legal strategy that their judge found totally inept.

Put it this way: when a judge says his ruling “does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful” and “stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one” — as Chhabria did — AI companies’ prospects in future lawsuits with him don’t look great.

Both rulings dealt specifically with training — or media getting fed into the models — and didn’t reach the question of LLM output, or the stuff models produce in response to user prompts. But output is, in fact, extremely pertinent. A huge legal fight between The New York Times and OpenAI began partly with a claim that ChatGPT could verbatim regurgitate large sections of Times stories. Disney recently sued Midjourney on the premise that it “will generate, publicly display, and distribute videos featuring Disney’s and Universal’s copyrighted characters” with a newly launched video tool. Even in pending cases that weren’t output-focused, plaintiffs can adapt their strategies if they now think it’s a better bet.

The authors in the Anthropic case didn’t allege Claude was producing directly infringing output. The authors in the Meta case argued Llama was, but they failed to convince the judge — who found it wouldn’t spit out more than around 50 words of any given work. As Alsup noted, dealing purely with inputs changed the calculations dramatically. “If the outputs seen by users had been infringing, Authors would have a different case,” wrote Alsup. “And, if the outputs were ever to become infringing, Authors could bring such a case. But that is not this case.”

... continue reading