The current state of the theory that GPL propagates to AI models

When GitHub Copilot was launched in 2021, the fact that its training data included a vast amount of Open Source code publicly available on GitHub attracted significant attention, sparking lively debates regarding licensing. While there were issues concerning conditions such as attribution required by most licenses, there was a particularly high volume of discourse suggesting that the conditions of copyleft licenses, such as the GNU General Public License (GNU GPL), would propagate to the model itself, necessitating that the entire model be released under the same license. The propagation of the GPL is a concept that many modern software engineers have naturally accepted; thus, for an engineer with a straightforward sensibility, it is a perfectly natural progression to think that if GPL code is included in some form, copyleft applies and the license propagates.

However, as of 2025, the theory that the license of the source code propagates to AI models trained on Open Source code is not seen as frequently as it was back then. Although some ardent believers in software freedom still advocate for such theories, it appears they are being overwhelmed by the benefits of AI coding, which has overwhelmingly permeated the programming field. Amidst this trend, even I sometimes succumb to the illusion that such a theory never existed in the first place.

Has the theory that the license of training code propagates to such AI models been completely refuted?

Actually, it has not. This issue remains an indeterminate problem where lawsuits are still ongoing and the judgments of major national governments have not been made clear. In this article, I will explain the current situation of this license propagation theory, namely “GPL propagates to AI models trained on GPL code,” and connect it to points of discussion such as the legal positioning of models and the nature of the freedom we pursue in the AI domain.

Note: This article is an English translation of a post originally written in Japanese. While it assumes a Japanese reader, I believe it may also be useful for an English-speaking audience.

The Current Standing in Two Lawsuits

First, let us organize what the “GPL propagation theory to AI models” entails. This is the idea that when an AI model ingests GPL code as training data, the model itself constitutes a derivative work (derivative) of the GPL code; therefore, when distributing the model, the copyleft conditions of the GPL, such as the obligation to disclose source code, apply. In other words, it is not a question of whether the output of the model is similar to the GPL code, but a theory that “since the model itself is a derivative containing GPL code, the GPL extends to the model.” While there were many voices supporting this theory around 2021, as mentioned earlier, it is no longer the mainstream of the discussion today. However, two major ongoing lawsuits can be cited as grounds that this theory has not been completely denied. These are Doe v. GitHub (the Copilot class action) filed in the United States and GEMA v. OpenAI filed in Germany. I will explain the history and current status of each lawsuit below.

Doe v. GitHub (Copilot Class Action): The Persisting Claim of Open Source License Violation

In the Copilot class action filed at the end of 2022 in relation to GitHub Copilot, anonymous developers became plaintiffs and argued that GitHub, Microsoft, and OpenAI trained their models on source code from public repositories without permission, inviting massive license violations through Copilot. Specifically, they viewed it as problematic that when Copilot reproduces part of the code that served as the training source in its output, it does not perform the author attribution or copyright notice required by licenses such as MIT or Apache-2.0 at all, and furthermore, it indiscriminately trains on and outputs code under licenses that impose copyleft conditions like the GPL, thereby trampling on license clauses. The plaintiffs claimed this was a contractual violation of open source licenses and also sought damages and injunctions, asserting that it constituted a violation of the Digital Millennium Copyright Act (DMCA) under copyright law.

In this case, several decisions have already been handed down by the United States District Court for the Northern District of California, and many of the plaintiffs’ claims have been dismissed. What were dismissed were mainly peripheral claims such as DMCA clause violations, privacy policy violations, unjust enrichment, and torts, but some DMCA violations and the claim of “violation of open source licenses” (breach of contract) are still alive. Regarding the latter specifically, the argument is that despite the plaintiffs’ code being published under licenses like GPL or MIT, the defendants failed to comply with the author attribution or the obligation to publish derivatives under the same license, which constitutes a contractual violation. Although the court did not recognize claims for monetary damages because the plaintiffs could not demonstrate a specific amount of damage, it determined that there were sufficient grounds for the claim for injunctive relief against the license violation itself. As a result, the plaintiffs are permitted to continue the lawsuit seeking an order prohibiting the act of Copilot reproducing others’ code without appropriate license indications.

... continue reading