AI and Copyright: Expanding Copyright Hurts Everyone–Here's What to Do Instead

You shouldn't need a permission slip to read a webpage–whether you do it with your own eyes, or use software to help. AI is a category of general-purpose tools with myriad beneficial uses. Requiring developers to license the materials needed to create this technology threatens the development of more innovative and inclusive AI models, as well as important uses of AI as a tool for expression and scientific research.

Threats to Socially Valuable Research and Innovation

Requiring researchers to license fair uses of AI training data could make socially valuable research based on machine learning (ML) and even text and data mining (TDM) prohibitively complicated and expensive, if not impossible. Researchers have relied on fair use to conduct TDM research for a decade, leading to important advancements in myriad fields. However, licensing the vast quantity of works that high-quality TDM research requires is frequently cost-prohibitive and practically infeasible.

Fair use protects ML and TDM research for good reason. Without fair use, copyright would hinder important scientific advancements that benefit all of us. Empirical studies back this up: research using TDM methodologies are more common in countries that protect TDM research from copyright control; in countries that don’t, copyright restrictions stymie beneficial research. It’s easy to see why: it would be impossible to identify and negotiate with millions of different copyright owners to analyze, say, text from the internet.

The stakes are high, because ML is critical to helping us interpret the world around us. It's being used by researchers to understand everything from space nebulae to the proteins in our bodies. When the task requires crunching a huge amount of data, such as the data generated by the world’s telescopes, ML helps rapidly sift through the information to identify features of potential interest to researchers. For example, scientists are using AlphaFold, a deep learning tool, to understand biological processes and develop drugs that target disease-causing malfunctions in those processes. The developers released an open-source version of AlphaFold, making it available to researchers around the world. Other developers have already iterated upon AlphaFold to build transformative new tools.

Threats to Competition

Requiring AI developers to get authorization from rightsholders before training models on copyrighted works would limit competition to companies that have their own trove of training data, or the means to strike a deal with such a company. This would result in all the usual harms of limited competition—higher costs, worse service, and heightened security risks—as well as reducing the variety of expression used to train such tools and the expression allowed to users seeking to express themselves with the aid of AI. As the Federal Trade Commission recently explained, if a handful of companies control AI training data, “they may be able to leverage their control to dampen or distort competition in generative AI markets” and “wield outsized influence over a significant swath of economic activity.”

Legacy gatekeepers have already used copyright to stifle access to information and the creation of new tools for understanding it. Consider, for example, Thomson Reuters v. Ross Intelligence, widely considered to be the first lawsuit over AI training rights ever filed. Ross Intelligence sought to disrupt the legal research duopoly of Westlaw and LexisNexis by offering a new AI-based system. The startup attempted to license the right to train its model on Westlaw’s summaries of public domain judicial opinions and its method for organizing cases. Westlaw refused to grant the license and sued its tiny rival for copyright infringement. Ultimately, the lawsuit forced the startup out of business, eliminating a would-be competitor that might have helped increase access to the law.

Similarly, shortly after Getty Images—a billion-dollar stock images company that owns hundreds of millions of images—filed a copyright lawsuit asking the court to order the “destruction” of Stable Diffusion over purported copyright violations in the training process, Getty introduced its own AI image generator trained on its own library of images.

Requiring developers to license AI training materials benefits tech monopolists as well. For giant tech companies that can afford to pay, pricey licensing deals offer a way to lock in their dominant positions in the generative AI market by creating prohibitive barriers to entry. To develop a “foundation model” that can be used to build generative AI systems like ChatGPT and Stable Diffusion, developers need to “train” the model on billions or even trillions of works, often copied from the open internet without permission from copyright holders. There’s no feasible way to identify all of those rightsholders—let alone execute deals with each of them. Even if these deals were possible, licensing that much content at the prices developers are currently paying would be prohibitively expensive for most would-be competitors.

... continue reading