Neue Urheberrechtsbedenken bei Retrieval-Augmented Generation (RAG): Was Sie wissen müssen!

Von „Modelltraining“ zu „retrievalgestützter“ – AIGC tritt in die Phase 2.0 ein.

I. AIGC Enters the 2.0 Phase: Retrieval-Augmented Generation

In May and July 2025, Amazon successively reached cooperation agreements with The New York Times and media groups such as Hearst and Condé Nast, enabling its AI products to display real-time summaries and snippets of The New York Times. 1 Amazon's cooperation with The New York Times surprised the industry. Previously, The New York Times had taken a tough stance on AI copyright issues. In December 2023, it sued OpenAI in the U.S. Southern District Court of New York for copyright infringement, becoming the first mainstream media outlet in the United States to publicly sue a large model provider. 2

Notably, OpenAI also announced its cooperation with The Washington Post in April 2025. As a result, ChatGPT's output can now embed article summaries and original report links from The Washington Post. OpenAI stated that this is just a microcosm of its cooperation with more than 20 publishers. They share a common commitment to providing users with more reliable and accurate information, especially on topics that are highly complex and time-sensitive. 3

Cooperating copyright holders shown on OpenAI's official website

The cooperation between overseas large model providers and news publishing institutions reflects a significant evolutionary trend in the field of generative artificial intelligence: from the previous "AIGC 1.0 phase," which relied solely on "model training" (pre-training, fine-tuning, etc.) to obtain parameter capabilities and randomly generated answers to user questions; to the current "AIGC 2.0 phase," which integrates and embeds information from third-party authoritative sources to improve the accuracy, timeliness, and professionalism of the final generated content.

Technically, this is called "Retrieval-Augmented Generation" (RAG), which is essentially an integration of "language generation models" and "information retrieval technologies." Since 2025, domestic large model providers have added retrieval-augmented generation functions. That is, as users currently perceive, before receiving feedback from large models, they will first go through a "reference material retrieval" process, and the final integrated content results will be accompanied by "information sources."

II. Why is "Retrieval-Augmented Generation" on the Rise?

"Retrieval-Augmented Generation" was first proposed by the Facebook AI Research team in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" in 2020. Retrieval-Augmented Generation emphasizes combining the internal knowledge storage (parameter memory) of pre-trained models with external knowledge base retrieval (non-parameter memory) to address the inherent flaws in content generation of traditional large models - "model hallucination" and "timeliness gap."

It is a general consensus that large models often face the "hallucination" problem, outputting unreliable information and focusing on "telling a good story" rather than "verifying facts." This has led people to abandon the use of large models in many serious and important scenarios due to distrust. Meanwhile, people often see statements in large model user agreements such as "the model output may not always be accurate... using our service may result in output that does not accurately reflect real people, places, or facts."

As early as June 2023, ChatGPT fabricated false information about "fraud and misappropriation of foundation funds" against Georgia radio host Frederick Riehl due to its "hallucination," which led to OpenAI being sued for defamation for the first time. 4 In March 2025, regarding OpenAI's hallucination problem, the European Center for Digital Rights (Noyb) filed a complaint with the Norwegian data regulator, arguing that ChatGPT's behavior of generating inaccurate content violated the rule on "accuracy of personal data" in Article 5(1)(d) of the GDPR. 5

The answers provided by large models are limited to the data information used during training, so there is a "timeliness gap" problem. The term "pre-training" often mentioned actually reveals that large models are "pre-trained." Once the training is completed, the overall parameters of the large model are fixed and cannot be automatically updated. This means that the model's knowledge is limited to the scope covered by the training data at that time. If the training data does not include the latest information, the large model cannot generate relevant answers. For example, although ChatGPT was released in November 2022, its training corpus was up to September 2021; Gemini 2.0 was released in December 2024, but its training corpus was up to June 2024.

Retrieval-Augmented Generation enables large models to provide accurate answers using real-time external data without retraining the model parameters. It only requires proper matching and updating of knowledge sources. The Facebook AI Research team described Retrieval-Augmented Generation as "like an open-book exam, where students bring the most comprehensive reference materials they have organized and answer the questions on the test paper based on their memorized knowledge." In summary, this also explains the underlying reason for the active content cooperation between large model providers and news institutions mentioned at the beginning.

The entire process of Retrieval-Augmented Generation can be divided into two stages: "data retrieval and collection" and "content integration and display." In the first stage, after receiving user instructions, the large model will first perform semantic processing on the questions and then conduct a search in the external knowledge base. The knowledge base may be pre-established or the result of a real-time full-network search. In the second stage, the retrieved relevant information will be sent to the large model as "augmented context." The large model will then use these highly timely "augmented prompts" to generate the final answer. The operation process of Retrieval-Augmented Generation involves the collection and utilization of a large number of copyrighted works, and relevant copyright disputes have already emerged at home and abroad.

III. Real-World Copyright Disputes in "Retrieval-Augmented Generation"

As early as October 21, 2024, the first copyright infringement lawsuit against "Retrieval-Augmented Generation" emerged in the United States - the case of "Dow Jones & Company and New York Post Holdings v. Perplexity AI." The defendant, Perplexity AI, is an AI startup founded in 2022. After receiving user questions, it retrieves external information online and replies with summaries and web links. The plaintiffs claim that the defendant crawled hundreds of thousands of copyrighted articles from The Wall Street Journal and The New York Post through retrieval tools and stored them in the "Retrieval-Augmented Generation" database; then summarized and rewrote them according to user questions, and sometimes even copied them verbatim, enabling users to obtain high-quality paid content without clicking on the original news websites, which clearly constitutes copyright infringement. 6

On February 13, 2025, fourteen global leading news publishers, including The Atlantic and The Guardian, sued the Canadian AI company Cohere in the U.S. Southern District Court of New York, accusing it of relying on "Retrieval-Augmented Generation" technology to search, filter, and crawl the plaintiffs' content in real-time through the "Web Search Connector," and directly output the complete original text and alternative summaries of the plaintiffs' copyrighted works in the generated answers, constituting copyright infringement. 7

Similarly, on April 3, 2025, the first generative artificial intelligence copyright case accepted by the Court of Justice of the European Union (CJEU) also occurred in the field of Retrieval-Augmented Generation. This case originated from a copyright dispute between news vendor Like and Google's Gemini large model being heard in the Budapest Court in Hungary. Due to the complexity of the case, it was referred to the CJEU for handling. From the publicly reported facts, it can be ruled out that the plaintiff's articles were used by Gemini for corpus training. In fact, Gemini obtained the plaintiff's news highly relevant to the user's question (Can you provide in Hungarian the report on the "Kozsó plan to introduce dolphins into Lake Balaton?" that appears on the balatonkornyeke.hu website?) through Retrieval-Augmented Generation and generated a real-time summary for the user. The plaintiff accused Google of infringing on its neighboring rights as a news publisher. 8

Industry disputes in the field of Retrieval-Augmented Generation are also emerging in China. According to relevant reports, in August 2024, CNKI sent a 28-page infringement notice letter to a domestic AI retrieval platform, accusing it of using CNKI's content data without permission in the generated content. The AI retrieval platform claimed that it only collected publicly visible academic literature titles and abstracts and did not collect the full text of academic literature; users still need to jump to CNKI through the source link to read the full text, so no damage was caused. Finally, the AI retrieval platform said that after comprehensive consideration and balance, it decided to respect CNKI's wishes and no longer quote. 9

IV. Issues Related to Work Collection in "Retrieval-Augmented Generation"

In the "data retrieval and collection" stage, whether it is establishing an offline database in advance or crawling data online in real-time, it involves storing part or all of the works in a medium in a specific way. This has attracted attention regarding the determination of copyright infringement of the right of reproduction under copyright law. The discussion of the "right of reproduction" in the digital environment includes two issues: "long-term reproduction" and "temporary reproduction." The current consensus is that unauthorized long-term reproduction constitutes copyright infringement; however, the determination of infringement for temporary reproduction is still controversial in practice.

"Long-term reproduction" in the digital environment generally includes situations such as "fixing works on tangible carriers such as hard drives and CDs through various technical means," "uploading works to network servers," and "downloading works from network servers to local devices." "Temporary reproduction" in the digital environment refers to the automatic appearance of copies of works during the use of works, but these copies do not exist for a long time and "disappear after use." For example, when we enjoy digital music online, the server will first read and store the song information before converting it into data for transmission and playback; however, after the playback ends and the user exits, the copies will disappear immediately. 10

In Retrieval-Augmented Generation, the construction of databases usually includes converting external works into vector representations and then storing them locally. Then, according to user questions, relevant information is selectively provided to the large model. Different from automatic storage or browsing cache, the establishment of a retrieval-augmented database generally involves relatively stable storage processing of works, with the real possibility of constituting long-term reproduction. In the aforementioned case of "Dow Jones & Company and New York Post Holdings v. Perplexity AI," the plaintiffs believed that "Perplexity AI copied a large number of their articles without permission when constructing the retrieval-augmented database. This large-scale copying behavior in the 'input stage' itself already constitutes copyright infringement, regardless of the final output content." 11

In the real-time retrieval scenario, some people believe that if the search engine's information processing is based on "temporary reproduction" and only plays the role of a "centralized information administrator" or an "intermediary for Internet information dissemination," and users still jump to the original website when clicking on the search results, it does not constitute infringement. The report "The Development of Generative Artificial Intelligence from a Copyright Perspective" released by the European Union Intellectual Property Office (EUIPO) in May 2025 pointed out that in the dynamic retrieval scenario, RAG usually only temporarily stores content, which is more similar to the exceptions for text and data mining or temporary reproduction. 12 However, this still depends on the specific technical implementation path of the large model provider. If the obtained content is also stored locally after real-time retrieval, there is still a possibility of being recognized as "long-term reproduction."

V. Issues Related to Technical Protection in "Retrieval-Augmented Generation"

In Retrieval-Augmented Generation, if there are behaviors of bypassing IP restrictions or cracking dynamic loading restrictions to crawl copyrighted works, it may violate the provision in the "Copyright Law" that "intentionally avoiding or destroying technical measures is prohibited." The current "Copyright Law" of China defines "technical measures" as "effective technologies, devices, or components used to prevent or restrict unauthorized browsing, appreciation of works, performances, audio-visual recordings, or the provision of works, performances, audio-visual recordings to the public through the information network."

It is worth noting that "technical measures" are not a specific type of copyright right like the right of reproduction or the right of information network dissemination. Instead, they are a means given to copyright holders by the law from the perspective of "regulating illegal acts" to protect their own rights and interests. In practice, technical measures can be divided into "access control measures" and "utilization control measures." The former is to prevent others from accessing works without authorization; the latter is to prevent others from reproducing, disseminating, or otherwise utilizing works without authorization.

In the aforementioned case between CNKI and the AI retrieval platform, although some content on CNKI can be publicly browsed, it has also set system access permissions for the literature database through technical means such as login verification. Therefore, if in practice, third-party model providers bypass the access restriction technology set by CNKI to obtain relevant academic literature content when constructing their own retrieval databases, it involves the determination of illegality in the field of "technical measures."

In the case of "Dow Jones & Company and New York Post Holdings v. Perplexity AI," the long-term "paywall" set by The Wall Street Journal and The New York Post constitutes a typical "access control measure." If Perplexity AI intentionally circumvents this technical measure to crawl the plaintiffs' paid news, it may also violate the requirements of "technical measures." In the United States, Section 1201 of the Digital Millennium Copyright Act gives copyright holders a "dual protection system for technical measures": on the one hand, it prohibits others from directly engaging in activities to circumvent the "access control measures" set by copyright holders; on the other hand, it also prohibits others from providing tools and means to circumvent the "technical measures" of copyright holders.

VI. Issues Related to Work Utilization in "Retrieval-Augmented Generation"

In the "content integration and display" stage, it is necessary to evaluate whether the utilization of works by Retrieval-Augmented Generation falls within the scope of "direct infringement" and "indirect infringement" regulated by copyright law. So-called direct copyright infringement refers to the act of a person directly engaging in activities regulated by the exclusive rights of copyright law, such as directly uploading infringing works to a website server and disseminating them to others; so-called indirect copyright infringement refers to the situation where a person does not directly engage in direct copyright infringement but provides certain facilitating conditions or assistance, such as a platform intentionally using algorithm recommendation technology to help users expand the dissemination of infringing content.

At the level of direct infringement, the content output by large models may infringe on the right of reproduction, the right of adaptation, and the right of information network dissemination. For example, in the case of The New York Times v. OpenAI, the plaintiff not only accused OpenAI of using its news content without authorization to train the GPT series of models but also claimed that the "Browse with Bing" plugin developed in cooperation with Microsoft directly quoted a large amount of content from The New York Times' Wirecutter review website in the synthesized results through real-time search, constituting copyright infringement. 13

For the distinction between the infringement determination of the right of reproduction and the right of adaptation, we can refer to the "Guidelines for the Trial of Copyright Infringement Cases by the Beijing Higher People's Court." "Using the expression of the original work in the alleged infringing work without permission but not creating a

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。