Nvidia Sued: Is Using Pirated Data to Train Large Models an Industry Unspoken Rule?

"Shadow libraries" illegally store and disseminate a vast amount of book resources and are willing to provide paid "priority download channels." Although this meets the demand of large model developers for high-quality data, it brings great risks of infringement. In response to the lawsuit, NVIDIA filed a formal motion, claiming that its actions fall under "fair use."

Recently, NVIDIA became the defendant in a class - action lawsuit over the copyright of AI training data.

The plaintiffs in this lawsuit are five writers who own multiple registered copyrighted works. The indictment alleges that when NVIDIA was developing its next - generation large language model using the NeMo Megatron framework, it used a dataset from pirated libraries containing the plaintiffs' copyrighted works. These pirated libraries are also known as "shadow libraries".

NeMo Megatron is an end - to - end framework developed by NVIDIA for building, training, and deploying large language models.

The plaintiffs filed the lawsuit in the United States District Court for the Northern District of California. On January 31, 2026, NVIDIA filed an official motion, arguing that the plaintiffs failed to provide sufficient evidence to prove the company's infringement, and requested the court to dismiss the plaintiffs' complaint, claiming that its actions were "fair use". The court has scheduled a hearing on April 2, 2026, to review NVIDIA's motion.

The internal records provided in the indictment show that NVIDIA was under competitive pressure from OpenAI. In order to showcase its leading technology at the 2023 Developer Conference, it resorted to obtaining millions of pirated books from "shadow libraries" to train its large language model.

In addition, the indictment also points out that NVIDIA provided tools and scripts to its customers, encouraging and assisting them to download pirated datasets.

Amid the boom of large models, NVIDIA is not the only one caught in copyright disputes over training data. AI giants such as OpenAI, xAI, Anthropic, and Meta have also faced lawsuits one after another. In an infringement case, Anthropic agreed to pay at least $1.5 billion to reach a settlement, which may set a record for copyright compensation.

01 Did NVIDIA's senior management approve the piracy cooperation?

The quality and quantity of training data play a crucial role in the development of large models. Books can provide a sufficient amount of data and are regarded as high - quality training data in the industry. For large - model developers, the data from "shadow libraries" is more accessible, meeting the demand for book - related data in training.

The indictment shows that NVIDIA released multiple large models in the NeMo Megatron series. According to its description on the Hugging Face website, these models were trained on the The Pile dataset released by the non - profit research institution EleutherAI.

The Pile contains a subset called Books3, which is sourced from the "shadow library" Bibliotik and includes approximately 190,000 books.

In addition to using The Pile, NVIDIA is also accused of directly collaborating with "shadow libraries" to train large models with pirated book resources, including Anna’s Archive, the world's largest "shadow library".

Anna’s Archive was established in November 2022, when the well - known e - book library Z - Library was massively shut down by the US government and its founder was arrested. It aims to integrate the resources of several shadow libraries such as Z - Library, Library Genesis (LibGen), Open Library, and Sci - Hub to achieve the "permanent backup" of knowledge. In January 2026, the US Federal Court in Ohio issued a permanent injunction, ordering it to delete all the data it scraped from WorldCat, the world's largest library catalog database.

The indictment discloses the whole process of NVIDIA's communication and negotiation with Anna’s Archive. Internal documents show that the most direct reason for NVIDIA to obtain pirated books was the fierce competition in the industry. In September 2022, NVIDIA released the NeMo Megatron series of large models. In the following year, ChatGPT launched by OpenAI was a great success, which increased investors' attention to artificial intelligence. Therefore, the annual Developer Conference in the fall of 2023 was considered an important time node by NVIDIA. Releasing a large - language model with leading performance at this conference could better cope with the fierce competition.

The indictment shows that when obtaining data for projects with internal codes "NextLargeLLM", "NextLLMLarge", and "Next Generation LLM" (hereinafter collectively referred to as NextLargeLLM), NVIDIA highly focused on the book corpus. In August 2023, NVIDIA negotiated with several book publishers to quickly obtain book data resources, but this demand was rejected, and no data authorization agreement was reached.

To meet the urgent need for book resources, NVIDIA wrote to Anna’s Archive to understand the specific form of the latter's "high - speed access rights". Anna’s Archive stated in its reply that considering its pirated resources were illegally obtained, it suggested that NVIDIA determine internally whether it could cooperate and then inform and proceed.

Within a week after contacting Anna’s Archive, NVIDIA's management quickly approved the cooperation plan between the two parties. After that, Anna’s Archive provided NVIDIA with access to millions of pirated book data, with a total volume of approximately 500TB.

The indictment states that in addition to Anna’s Archive and The Pile, NVIDIA also downloaded book resources from other "shadow libraries", including Z - Library, LibGen, and Sci - Hub.

Z - Library rose rapidly due to its extremely fast book updates and good user experience. In November 2022, the US Federal Bureau of Investigation seized more than 200 core domains of Z - Library. Two Russian founders were arrested in Argentina and face criminal charges of money laundering and copyright infringement. The US government is currently seeking their extradition. In addition, courts in the United States, Austria, Germany, India and other countries have repeatedly ordered domain registrars to cancel its domain names.

Library Genesis is known as the originator of "shadow libraries". In 2017, a court in New York, USA, ordered Library Genesis to pay Elsevier, a publisher, $15 million in compensation. In 2023, several US textbook publishers sued LibGen again, demanding that it hand over its domain name or be completely removed from the Internet.

Sci - Hub focuses on academic papers. Currently, courts in countries such as the UK, France, and Germany have ordered all major Internet service providers (ISPs) to block Sci - Hub. Since the end of 2020, Sci - Hub has basically stopped uploading new papers on a large scale.

In February 2024, four months after reaching a cooperation with Anna’s Archive, NVIDIA released its most powerful large model at that time, Nemotron - 4 15B. Public information shows that Nemotron - 4 15B has 15 billion parameters and was pre - trained with 8 trillion text annotation data. NVIDIA did not disclose the source of this large model's training data, but it once publicly stated that 70% of the model's training data came from the "English natural language" dataset, which itself contains 4.6% of book content. The indictment believes that based on this calculation, NVIDIA's training data needs to include millions of books. Unless it used pirated resources, the company could not have obtained a sufficient amount of book data.

In addition, the indictment shows that through the NeMo Megatron framework and the BigNLP platform, NVIDIA provided customers with scripts to automatically download and pre - process the The Pile dataset. NVIDIA also provided similar assistance to customers Persimmon AI Labs and Amazon for downloading and processing the The Pile dataset.

02 Does the demand for large - model training support the piracy business?

"Shadow libraries" illegally store and distribute a large amount of high - quality copyrighted content and are willing to provide paid "priority download channels" for large - model developers.

Anna's Archive stated on its official website, "Large language models rely on high - quality data to thrive. We have the world's largest resources of books, papers, journals, etc., which are the highest - quality text resources. We provide high - speed enterprise - level access rights in exchange for donations in the tens of thousands of dollars."

This business model also gave "shadow libraries" a glimmer of hope. Anna's Archive stated on its official website that not long ago, "shadow libraries" were on the verge of extinction. Due to litigation pressure, Sci - Hub, which contains a large number of pirated academic paper resources, has stopped accepting new works. "With the rise of artificial intelligence, almost all enterprises developing large language models have contacted us for data training. We have provided high - speed access rights to approximately 30 companies."

However, using pirated book resources poses great risks of infringement lawsuits for large - model companies. The "Copyright and Artificial Intelligence" series of reports issued by the US Copyright Office in May 2025 pointed out that the data collection and pre - processing stages involve downloading, converting, and modifying a large number of copyrighted works. Whether the data source is a public website or not, it may constitute multiple infringements of the right of reproduction, the right of editing, and the right of adaptation, and the risk is particularly prominent in commercial use cases.

In 2025, US courts made judgments in two cases where copyright holders sued large - model companies for using pirated book resources.

On June 23, 2025, the US District Court for the Northern District of California made a ruling on fair use in a copyright infringement lawsuit filed by writers such as Andrea Bartz against Anthropic, determining that using copyrighted works for artificial intelligence training was fair use. However, the act of downloading more than 7 million "known pirated" e - books from websites such as Library Genesis and Pirate Library Mirror "essentially and irredeemably constitutes infringement" and cannot be exempted by the principle of fair use. In September of the same year, media reported that Anthropic had agreed to pay at least $1.5 billion to reach a settlement in this case. A court in California, USA, has preliminarily approved this agreement. This will also become the largest - scale copyright compensation case with public reports in history.

On June 25, 2025, also in the US District Court for the Northern District of California, a summary judgment was made in a copyright infringement case where writers such as Richard Kadrey sued Meta Platforms for using pirated books to train the large model Llama. The judgment also determined that Meta's actions were fair use. However, the court ruled that Meta's act of obtaining and using pirated works did not separately constitute infringement because Meta used these works for the purpose of training an AI large model, and such use was a transformative use.

Specifically, Meta used the plaintiffs' books to train its large model Llama, which can generate diverse texts and perform a wide range of functions, while the original copyrighted works were mainly used for people to read for entertainment or education. Therefore, Meta's use of the plaintiffs' books had a "further purpose" and a "different nature", that is, it was highly transformative. Given the connection between the act of reproduction and Meta's transformative use purpose, the amount of reproduction was also reasonable and necessary. Moreover, in terms of market impact, the plaintiffs did not provide any effective evidence that their market was affected or diluted.

It is worth noting that the court limited the scope of the ruling's validity, stating that this case "does not constitute a class - action lawsuit" and "does not constitute a precedent for Meta's legal use of copyrighted materials to train language models".

03 Copyright issues of training data have triggered more lawsuits

After the Anthropic case, more writers or copyright holders have also filed lawsuits. On December 22, 2025, John Carreyrou, a journalist and writer from The New York Times, along with five other writers, filed a lawsuit in the US Federal Court in California, suing six companies including Google, OpenAI, xAI, Anthropic, Meta, and Perplexity, accusing them of using copyrighted books to train artificial intelligence systems without permission.

The plaintiffs clearly stated that they did not seek to initiate a larger - scale class - action lawsuit because it would be beneficial to the defendants, who might try to reach a unified settlement plan with a large number of plaintiffs to resolve multiple claims at once. The complaint stated, "Large - model companies should not be allowed to easily settle thousands of high - value claims at extremely low prices."

As early as December 2023, The New York Times, together with eight other media organizations, accused Microsoft and OpenAI of using articles published by the media to train artificial intelligence models, infringing their copyrights.

In March 2025, the US District Court for the Southern District of New York rejected OpenAI's motion to directly dismiss the core charges of the case, allowing the key disputes to enter the next stage. Whether OpenAI's use of The New York Times' news content to train its model constitutes copyright infringement will enter the substantial judicial review stage. The industry believes that this procedural ruling is extremely beneficial to the plaintiffs, meaning that the court believes that The New York Times' accusations have sufficient legal basis and are worthy of comprehensive evidence review and substantive judgment. In November 2025, OpenAI was required to provide user logs, which are of great value in proving the specific use of training data and the similarity of output content.

A lawyer who wished to remain anonymous said that currently, US courts are very cautious about such AI copyright issues, avoiding prematurely establishing universally binding rules based on a single case. He believes that as more copyright cases regarding large - model training data emerge, the judgment results will depend on the determination of disputed facts and the identification of technical details.

However, in May 2025, The New York Times reached a licensing agreement with Amazon, authorizing Amazon to use its news content for AI product enhancement and model training. The aforementioned lawyer believes that resolving disputes through authorized cooperation rather than litigation may become one of the common solutions in the industry.

In addition, publishers also started to safeguard their rights in 2025. Several large publishers, including Condé Nast, The Atlantic, Politico, and Vox, sued the AI startup Cohere, accusing it of using more than 4,000 copyrighted works to train large language models without authorization and publishing a large number of article contents or entire articles to users without accessing the publishers' websites.

In China, the case of iQiyi suing MiniMax also attracted attention. This is the first lawsuit over AI training data by a domestic video platform. In January 2025, media reported that iQiyi had filed a lawsuit in the Xuhui District People's Court of Shanghai, accusing MiniMax of infringing its copyright during the process of artificial intelligence model training and content generation, with a claimed compensation of approximately 100,000 yuan. iQiyi responded that the case was still in the legal process, and more information could not be disclosed.

Meanwhile, MiniMax faced a class - action lawsuit from Hollywood giants in 2025. Disney, Universal, and Warner Bros. sued its Conch AI for copyright infringement in the US Federal Court in California, with a maximum claimed compensation of $75 million (approximately 5.28 billion yuan). MiniMax first denied the copyright accusations in its prospectus at the end of 2025, believing that using relevant copyrighted content for AI training was fair use, and the claimed compensation of $75 million was "significantly overestimated". The company claimed that the number of independent works eligible for statutory compensation was far lower than what the plaintiffs claimed and emphasized that after receiving the complaint, it had taken technical measures to prevent relevant infringing outputs. Currently, the lawsuit is still in progress.

This article is from the WeChat official account "Caixin Magazine", author: Fan Shuo, editor: Zhu Tao, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Nvidia sued: Is using pirated data to train large models an industry unspoken rule?

01

Did NVIDIA's senior management approve the piracy cooperation?

02

Does the demand for large - model training support the piracy business?

03

Copyright issues of training data have triggered more lawsuits