FSF on Anthropic Copyright Settlement & LLM Training Data

by Priyanka Patel

The Free Software Foundation (FSF) doesn’t typically find itself embroiled in copyright lawsuits. But a recent class action against Anthropic, the AI company behind the Claude chatbot, has drawn the organization into the debate over the legal and ethical implications of training large language models (LLMs) on copyrighted material. The FSF, a staunch advocate for software freedom, received notice regarding the settlement in Bartz v. Anthropic and although it’s a small organization, it’s prepared to fight for its principles – and for user freedom – if necessary. This case highlights a growing tension between the rapid advancement of artificial intelligence and the rights of creators, and the FSF’s stance offers a unique perspective on the future of AI development.

The lawsuit, Bartz v. Anthropic, alleges that Anthropic infringed copyright by using datasets containing works from websites like Library Genesis and Pirate Library Mirror to train its LLMs. These datasets, readily available online, contain vast collections of books and other written materials. The district court initially ruled that the *use* of the copyrighted works to train the models constituted fair use, but left open the question of whether the *downloading* of those works was legal. Rather than proceed to trial on that point, Anthropic and the plaintiffs reached a settlement, and are now notifying potential copyright holders – including the FSF – about potential compensation. The details of the settlement are available on the settlement website here.

A Matter of Principle: Free Software and AI Training

The FSF’s involvement stems from the fact that it holds copyrights to numerous programs within the GNU Project, as well as several books available through its online shop here. Crucially, the FSF publishes all its copyrighted works under “free” licenses – meaning licenses that grant users the freedom to run, study, share, and modify the software or text. One such work, Sam Williams and Richard Stallman’s Free as in freedom: Richard Stallman’s crusade for free software, was identified as being present in the datasets used to train Anthropic’s LLMs. The book is published under the GNU Free Documentation License (GNU FDL), which explicitly permits use for any purpose, even commercial, without requiring payment.

However, the FSF’s concern extends beyond simply receiving compensation for copyright infringement. The organization believes that true freedom in the age of AI requires more than just acknowledging copyright. “Obviously, the right thing to do is protect computing freedom,” the FSF stated in a recent announcement. “Share complete training inputs with every user of the LLM, together with the complete model, training configuration settings, and the accompanying software source code.” This demand reflects the FSF’s core philosophy: that users should have complete control and understanding of the technology they use. The current practice of “black box” AI, where the inner workings of a model are opaque, is fundamentally at odds with this principle.

The Broader Implications for LLM Development

The FSF isn’t alone in raising concerns about the datasets used to train LLMs. Researchers and legal experts have been debating the ethical and legal implications of scraping vast amounts of data from the internet for AI training for some time. The Bartz v. Anthropic case is one of the first major legal challenges to this practice. The outcome could set a precedent for how AI developers approach copyright and data usage in the future. Some argue that training LLMs on copyrighted material constitutes transformative use, falling under fair use provisions. Others contend that downloading and using copyrighted works without permission, even for training purposes, is a violation of copyright law.

The FSF’s position is particularly nuanced. It acknowledges the potential benefits of LLMs but insists that those benefits should not come at the expense of user freedom. The organization is urging Anthropic and other LLM developers to adopt a more open and transparent approach, sharing not only the models themselves but also the data and code used to create them. This would allow users to understand how the models work, verify their accuracy, and contribute to their improvement. It would also align with the principles of free software, fostering a more collaborative and equitable AI ecosystem.

A Strategic Approach to Legal Battles

The FSF recognizes that it’s a relatively small organization with limited resources. It must therefore be strategic in choosing its battles. “We are a small organization with limited resources and we have to pick our battles,” the FSF stated. However, it made it clear that if it were to participate in a lawsuit similar to Bartz v. Anthropic and find its copyright or license violated, it would “certainly request user freedom as compensation.” This suggests that the FSF is willing to pursue legal action if necessary, but its primary goal is not simply financial compensation. It’s about establishing a legal framework that protects user freedom and promotes the development of truly open and accessible AI.

The organization’s stance is a call for a fundamental shift in how AI is developed and deployed. It’s a challenge to the current closed-source, proprietary model, advocating instead for a more open, collaborative, and user-centric approach. The FSF believes that AI has the potential to be a powerful tool for good, but only if it’s developed and used in a way that respects user freedom and promotes the common good. The outcome of the Bartz v. Anthropic settlement, and future legal challenges, will likely play a significant role in shaping that future.

The FSF will continue to monitor the Bartz v. Anthropic settlement and assess its impact on the free software community. The organization encourages individuals and organizations interested in protecting user freedom to learn more about its work and to support its efforts. The next step in the settlement process involves the distribution of funds to eligible copyright holders, a process expected to unfold over the coming months, according to the settlement administrator’s website.

What are your thoughts on the ethical implications of AI training data? Share your perspective in the comments below, and please share this article with your network to continue the conversation.

You may also like

Leave a Comment