HomeArticle

The New York Times and 400 other media outlets sue Microsoft: Who is the "accomplice" in AI copyright infringement?

知产力2026-06-29 11:49
The AI copyright battle is shifting from the question of "who trained the model" to "who helped the model complete its training".

The AI copyright battle is shifting from "who trained the model" to "who helped the model complete the training."

If OpenAI is the model developer in the spotlight, then Microsoft has been pushed into the limelight this time, raising a deeper question. When a company provides supercomputing power, customized systems, business entry points, and product distribution capabilities, can it still claim to be a "neutral infrastructure provider"?

On June 25, 2026, The New York Times applied to submit a third amended complaint in the copyright lawsuit against OpenAI and Microsoft. The new complaint further targets Microsoft, accusing it of not merely passively benefiting from OpenAI's use of copyrighted content to train AI models, but actively inducing, assisting, and facilitating large - scale copyright infringement by building a customized supercomputing system.

According to relevant reports, this supercomputing system contains more than 285,000 CPU cores and 10,000 GPUs. The New York Times tries to use this to show that Microsoft is not an ordinary cloud service provider, but has provided key infrastructure and technical conditions in the process of OpenAI's large - model training ability formation.

Almost simultaneously, publishers representing nearly 400 local US newspapers also filed a lawsuit against Microsoft and OpenAI, accusing them of scraping and copying a large number of news articles without permission for training AI products such as ChatGPT and Copilot. The plaintiffs also claim that when OpenAI uses text extraction tools such as "Dragnet" and "Newspaper," it deliberately strips the author's signature, copyright notice, and terms of use from the articles, which may violate the provisions of the US Digital Millennium Copyright Act (DMCA) regarding the protection of copyright management information.

These two sets of cases send a very clear signal that AI copyright lawsuits are no longer just targeting model companies, but are beginning to spread along the technical, business, and liability chains.

I. This time, the question is "who facilitated the infringement?"

In the past, when discussing AI training copyright disputes, the core issues usually focused on model developers like OpenAI. Does the training data contain copyrighted works? Does the training behavior constitute copying? Does the model output replace the original work? Can fair use be applied?

However, the focus of The New York Times' amended complaint this time is to put Microsoft in a more active position.

The plaintiffs attempt to prove that Microsoft is not an investor far removed from the training process, nor a simple cloud service provider selling computing power, but has provided key conditions for OpenAI's large - scale training through a deeply customized supercomputing system.

This pushes the focus of the case further towards the boundaries of "contributory infringement," "inducement of infringement," or "joint infringement."

In other words, the question that the court may need to answer in the future is: if a subject does not directly scrape articles or directly train the model, but it knows or should know that the relevant training relies on large - scale copyrighted content and still provides specially designed computing power, systems, and business support, can it become part of the AI infringement liability chain?

The true meaning of "accomplice" in the title can be understood as that AI infringement cannot always be traced only to the model developer.

II. Microsoft's trouble lies in its integration with OpenAI's industrial chain

The relationship between Microsoft and OpenAI is not an ordinary supplier - customer relationship.

Microsoft is both an important investor in OpenAI and a provider of cloud computing and computing power infrastructure. At the same time, Microsoft has integrated OpenAI's model capabilities into products and services such as Copilot, Bing, Office, and Azure.

This makes it difficult for Microsoft to describe itself as a completely neutral, external infrastructure provider with no knowledge of downstream uses.

If it is just a general cloud service, the liability boundary is relatively clear. Cloud providers offer servers, and customers upload data, train models, and deploy applications on their own. Cloud providers usually do not bear liability for customers' infringement using cloud resources.

However, if the infrastructure is highly customized for specific model training, if the service provider is deeply involved in the design of the training architecture, if it knows that the training behavior requires a large amount of text content, and if it continuously obtains commercial benefits from downstream AI products, then the defense of being a "neutral tool" will become difficult.

This is exactly the attack direction of The New York Times' complaint.

It is not simply saying that Microsoft is "rich," "technologically capable," or "has a partnership," but trying to prove that Microsoft has made a substantial contribution to the formation of OpenAI's training ability.

III. The news industry has entered a group - based counterattack

If The New York Times' lawsuit represents the strong rights protection of leading media, then the collective lawsuit of nearly 400 local newspapers represents the broader survival anxiety of the news industry.

Local newspapers are neither technology giants nor traffic platforms. Their value comes from long - term news gathering and editing, local investigations, community reporting, fact - checking, and public record maintenance.

If this content is scraped, copied, and used for training by AI systems without compensation and then repackaged and output through products like ChatGPT and Copilot, local media will face not only copyright losses but also losses in entry points, traffic, subscriptions, and advertising.

That is to say, AI not only "learns" news content but may also change the way users access news.

In the past, users needed to visit newspaper websites, read the original articles, generate clicks, subscriptions, and advertising revenue. Now, users may directly ask questions to AI and get summaries, answers, and organized information. Content producers bear the cost, AI products take away the entry points, platforms gain commercial value, but news agencies are squeezed out of the distribution chain.

This is exactly what content providers cannot accept.

The dispute over AI training data is essentially not an abstract conflict between "technological innovation and traditional copyright" but a very specific issue of interest distribution.

Who produces the content? Who bears the cost? Who takes the data? Who gets the valuation? Who is replaced by the market?

IV. The lethality of the DMCA accusation lies in its focus on "removing the copyright label"

In the lawsuits of these local newspapers, the claims related to DMCA deserve special attention. The plaintiffs not only claim that their articles have been copied but also assert that OpenAI removed copyright management information such as the author's signature, copyright notice, and terms of use from the articles when using text extraction tools to process the content.

If this accusation is established, its significance will exceed ordinary copyright infringement.

Because ordinary infringement focuses on "whether you took the work"; the DMCA copyright management information rules focus on "whether you removed the rights label before taking the work."

This is especially sensitive in the context of AI training.

Large - scale data processing usually goes through steps such as scraping, cleaning, deduplication, slicing, annotation, and vectorization. Many engineering systems regard the author, source, copyright notice, website terms, and license restrictions as "noise" and delete them during the cleaning process.

However, from the perspective of copyright compliance, these information are not noise but the boundaries of rights. AI training is not about cleaning the data as clean as possible. Some "dirty information" is exactly the protection boundary for the rights holders.

V. Fair use remains the main battlefield, but it is no longer a magic card

Microsoft and OpenAI are likely to continue to claim fair use.

This is also one of the most core defenses in US AI copyright lawsuits: large - model training is not for copying the original text but for learning language rules and knowledge associations; the training is transformative; if the use of public Internet content for AI training is prohibited, it will hinder technological innovation.

There is still room for this defense.

However, content providers are also strengthening their counter - argument: news works involve high investment, high timeliness, and high market value; AI products may generate summaries, replace reading, and intercept access; if the training data contains pay - wall content, access - restricted content, and there is even a situation where copyright management information is stripped, the legitimacy of "fair use" will be weakened.

AI copyright cases are evolving from value debates to evidence - based engineering.

Intellectual Property Power's Judgment

This set of cases provides a very direct reminder to Chinese enterprises.

Today, many enterprises do not train large models from scratch on their own but jointly develop vertical AI applications with model companies, cloud providers, data suppliers, and industry customers. The more complex the cooperation chain is, the more likely there will be a misalignment of responsibilities.

Technical teams care about model effects, business teams care about the speed of product launch. If the legal team only adds a clause at the end of the contract stating that "the other party guarantees the legality of the data," it is far from enough.

Enterprises should at least ensure five things in advance.

First, the data source should be explainable.

Second, the scope of authorization should be clearly stated.

Third, copyright information should be retained.

Fourth, the technical process should be documented.

Fifth, the cooperation responsibilities should be clearly defined.

AI compliance is not a packaging action before product launch but a basic project before data enters the system.

In future AI competition, it's not only about model parameters but also about data sources; not only about computing power scale but also about the rights chain; not only about product speed but also about the compliance foundation.

AI can train the world, but it cannot use others' content as ownerless fuel. Whoever uses content as fuel should be ready to answer: who lit this fire?

This article is from the WeChat official account "Intellectual Property Power" (ID: zhichanli), author: Shawn/MCP, published by 36Kr with authorization.