AI “Reading” Is Legalized: The latest ruling from a US court states that it's legal to use purchased books for AI training without the author's consent.
Without the consent of the original author, AI can now use published books as training data.
In a recent court ruling, a US court decided that Anthropic, the company behind Claude, is allowed to use legally purchased published books to train its AI without the author's permission.
The court referred to the "Fair Use" principle in US copyright law and considered AI training as a "Transformative Use." That is, the new use of the original work does not replace the market for the original work and is beneficial to technological innovation and the public interest.
This is the first time a US court has recognized an AI company's right to use books, protecting AI companies from restrictions when using copyrighted text to train LLMs:
It significantly reduces the copyright risk of AI training data.
Many netizens hold the view that since it is uncontroversial for humans to read and understand books, it should also be reasonable for AI to do the same.
What's going on?
The accusation against Anthropic was initiated by three writers in August 2024.
Notably, the Anthropic case not only concerns using published books to train AI but also the source of the books:
In 2021, Ben Mann, a co - founder of Anthropic, downloaded 196,000 copyrighted books from pirate websites.
In 2022, Anthropic downloaded "at least 5 million copies" and "2 million copies" from LibGen and PiLiMi respectively to build a digital library.
Although Anthropic was aware of the legal risks of piracy at that time ("not so gung ho about pirated books for legal reasons"), it still kept all pirated copies.
In March 2023, Anthropic selected a subset of books from the digital library to train the Claude model, and the first version of Claude was released.
In February 2024, Anthropic hired Turvey, the former head of Google's Book Scanning Project, and shifted to legally purchasing and scanning books, buying millions of physical books.
Turvey sent "an email or two" to publishers but did not follow up ("let those conversations wither").
Based on the US court's ruling on Anthropic, the following points can be noted:
1. The main controversy in this incident is that Anthropic used purchased legitimate books or pirated books to train Claude without the creator's permission.
2. The plaintiffs' accusation against Anthropic is: illegally copying works (including pirated and scanned versions) for AI training, which infringes copyright.
3. The court ruled that Anthropic can use scanned copies of legally purchased books for data processing in AI training, considering AI training to be "highly transformative," not directly replacing the market for the original work, and the output not infringing on the plaintiffs' works.
4. The court also ruled that the use of pirated books does not constitute fair use, and piracy itself involves infringement. Issues related to liability and compensation for piracy need to enter the trial stage.
Some netizens simply summarized it as: The key lies in whether the source of the books used for training is pirated.
That is to say, AI companies can use legally purchased books to train AI without the original author's permission.
Some netizens said that this is a correct decision, as natural as humans going to the library or reading books they bought.
Similarly, this ruling also faces some controversies: Can AI be equated with humans? How should creators protect their knowledge?
Similar Cases
Similar cases have appeared in lawsuits involving other AI companies.
Google Books in 2015: The US Supreme Court recognized it as "Fair Use"
In 2004, Google launched the "Google's Library Project," which collaborated with major libraries to scan and digitize over 20 million books for direct search by Google users. The scanned books included public - domain works beyond the copyright protection period and those still under copyright protection.
Google Books' approach was to provide free full - text browsing and PDF downloads for public - domain works and content, and for works still under copyright protection, it only provided book titles, introductions, and a few chapters, while also providing links to purchase legitimate e - books or printed versions.
In 2005, organizations such as the American Writers Association sued Google Books, claiming that Google's unauthorized scanning of entire books constituted copyright infringement for the following reasons:
- Full - text digital copying infringes on the author's right of reproduction;
- The fragment - browsing function may replace the market for the original work;
- There is a commercial motive (derivative revenue from the search business);
- Storing digital copies poses a risk of hacker leakage;
- Distributing copies to partner libraries may harm the interests of copyright holders.
In 2013, the US federal court made the first ruling, dismissing the plaintiffs' claims and determining that Google's search and fragment - browsing functions only "transformed" the original work's use (from reading to information retrieval, without providing substantial substitute content, which can promote academic research and book discovery, meeting the conditions for fair use).
In 2015, the second - instance court upheld the original judgment.
GitHub Copilot in 2022: Prompted AI companies to launch the "Code Source Annotation" function
GitHub Copilot is an AI programming assistant developed by GitHub, a subsidiary of Microsoft. It is based on OpenAI's Codex model and generates code suggestions by analyzing public code repositories (such as open - source projects on GitHub).
In 2022, multiple open - source developers and organizations accused GitHub Copilot of:
- License violation: Copilot used code under "infectious" open - source licenses such as GPL during training, but the generated code did not follow the requirements of the original license (e.g., retaining copyright notices).
- Copyright infringement: The generated code is highly similar to open - source code, suspected of direct copying.
- Commercial abuse: Microsoft transformed free open - source code into a paid tool (Copilot Enterprise Edition), violating the open - source spirit.
Based on public reports and the progress of the lawsuit, the key conclusions are as follows:
- The court determined that using open - source code for AI training is a "Transformative Use" and does not constitute direct infringement (referring to the logic of the Google Books case);
- The plaintiffs failed to prove that Copilot systematically output infringing code, and occasional similar fragments do not constitute large - scale violations;
- The court required GitHub to strengthen its filtering mechanism to avoid outputting code restricted by strong licenses such as GPL, or clearly mark the source and license requirements, and provide tools for users to check the similarity between the generated code and open - source libraries.
In February 2023, GitHub officially launched the "Code Referencing" function, which is integrated into Copilot as a default service to help users identify the association between the generated code and open - source projects.
OpenAI & Meta in 2023: The case is still pending
In 2023, multiple writers, actors, and the Global Publishers Alliance accused OpenAI and Meta of using pirated data to train AI, including books from "shadow libraries" (such as Bibliotik, LibGen, Z - Library), which provide unauthorized copyrighted content.
ChatGPT can accurately summarize the plaintiffs' books, proving that the model "memorized" the protected text. Meta CEO Mark Zuckerberg and the AI team knew that LibGen was a pirated source but still decided to use its data to train Llama 3 to overtake OpenAI more quickly.
The cases against OpenAI and Meta are still under review, and no clear ruling has been made.
Anthropic's victory in this case is not an isolated incident. It reflects that the US judicial system leans towards technological innovation in the tug - of - war between "technological innovation and copyright protection." It is also the first time a US court has supported the principle of fair use, protecting AI companies from restrictions when using copyrighted text to train LLMs.
It means that from now on, AI can study content it purchases rather than obtains from pirate websites.
Some netizens believe that this ruling may influence the US court's handling of the cases against OpenAI and Meta.
Reference Links:
[1]https://x.com/rohanpaul_ai/status/1937598431947808893
[2]https://storage.courtlistener.com/recap/gov.uscourts.cand.434709/gov.uscourts.cand.434709.231.0_2.pdf
[3] https://githubcopilotlitigation.com/
[4]https://x.com/CeciliaZin/status/1740109462319644905
This article is from the WeChat official account "QbitAI," author: Buyuan. It is published by 36Kr with permission.