AI Copyright Exit Mechanism: An Unachievable Promise

In the face of data scraping by AI companies, the "opt-out mechanism" seemingly grants creators the right to choose, but in fact, it shifts the burden of safeguarding rights onto them in reverse. This article analyzes how it becomes a "false choice" due to technical flaws, legal conflicts, and unfair burdens, and points out that the "opt-in" approach based on transparent authorization is the sustainable and correct path.

Editor's Note

With the rapid development of generative artificial intelligence, how to legally and compliant use massive amounts of data to train models has become a core controversy. Against this backdrop, a solution called the "opt-out mechanism" has been proposed, which allows AI companies to use all publicly available data by default and excludes it only when the copyright owner actively objects. This mechanism seems to give creators the right to choose, but a deeper analysis reveals that it is more like a "false choice" that is difficult to fulfill in the face of the technological tide.

Its "falsity" is rooted in several inherent trends in AI development: Data scraping has expanded from simple web crawlers to real-world capture devices such as smart glasses, rendering traditional "opt-out" methods based on URLs or metadata completely ineffective; The one-time, forward-looking nature of model training and the reuse of synthetic data mean that post-hoc "opt-out" cannot erase the contribution of the work in the previous training, which is actually an acquiescence to historical infringement; More importantly, this mechanism shifts the "authorization" responsibility that should be borne by AI companies back to a large number of scattered individual creators, requiring them to complete an "impossible task" of tracking countless AI systems.

Today, as artificial intelligence is increasingly penetrating all industries, a governance framework that truly respects the source of innovation and encourages fair cooperation should not be built on such a fragile foundation. Promoting transparent authorization and cooperation on the premise of "opting in", rather than relying on ineffective "opt-out", is the right way to guide the healthy and sustainable development of the AI industry.

I. Shaking Legal Foundation: "Opt-Out" Subverts the Essence of Copyright "Authorization"

According to the Copyright Law, the copyright owner has the exclusive right to use their copyrighted works and the right to authorize others to use them, which is essentially an opt-in mechanism. Unless there are applicable legal exceptions that allow users to use copyrighted works without the permission of the copyright owner (such as fair use), users must obtain prior authorization from the copyright owner (i.e., unless the copyright owner opts in) before using the work.

Usually, authorization exists in the form of a license agreement between the copyright owner and the user. The responsibility for seeking permission should fall on the user, and then the copyright owner decides whether to grant the user permission to use the work. Therefore, as copyright users, according to the law, AI companies must obtain the permission of the copyright owner before using copyrighted works.

However, many AI companies have not done so.

The opt-out mechanism proposed by AI companies actually gives them the right to use copyrighted works at any time and in any way. In other words, AI companies want to subvert the voluntariness and exclusivity of copyright because seeking permission is troublesome for them.

II. Difficult to Cross the Technological Gap: Existing Tools Cannot Prevent the "Invisible" Use of Works

There are a wide variety of AI models and systems, and it is simply impossible for ordinary copyright owners to understand all the opt-out mechanisms of each AI model and system. Requiring copyright owners to identify all opt-out mechanisms and enable the opt-out function for each work in their catalog is undoubtedly a huge burden - especially for high-output creators. As Ed Newton-Rex elaborated in detail in this article, "If you want most people (whether intentionally or unintentionally) to ignore the opt-out mechanism, then you should run an opt-out mechanism."

Even if copyright owners take opt-out measures to prevent future general AI from scraping their copyrighted works, a full opt-out is impossible because copyrighted works usually exist in multiple locations on the Internet, making it almost impossible for copyright owners to add opt-out identifiers to each copy of the work.

For example, a song can be played on a digital streaming platform, and downstream derivative works adapt, transform, and recreate the copyrighted work, making it extremely difficult to implement opt-out measures for the original work. In most cases, it is impossible for copyright owners to opt out in a way that can correctly mark the opt-out signal for every downstream use to prevent general AI from scraping and using the work. The universality of copyrighted works being appreciated and spread on the Internet shows how impractical it is to apply the opt-out mechanism to legal copies of the works.

In addition, it is almost impossible for users to opt out of illegally copied copyrighted works. It is well known that AI companies scrape, copy, and use pirated creative works illegally obtained from illegal sources to train their general AI models. Even if copyright owners can opt out of each legal copy of their works, these pirated works still exist and do not include an opt-out option. Unless AI companies wake up in the near future, they will continue to scrape these pirated works from illegal websites even if copyright owners opt out. The opt-out mechanism cannot solve or alleviate any such problems and definitely cannot give copyright owners any control.

III. Numerous Realistic Dilemmas: Massive Systems and Derivative Copies Render Opt-Out Meaningless

Currently available technical tools and those under development can theoretically help copyright owners prevent AI robots and crawlers from accessing and scraping their copyrighted works. However, as detailed below, these existing technical tools have significant limitations for the following reasons:

(i) They can only be effective if the user opt-out mechanism is recognized, respected, and not circumvented;

(ii) These tools were not created to solve the problem of AI data scraping, so they may do more harm than good in actual use.

In fact, in many cases, the robots and crawlers deployed by AI companies, developers, and other users often bypass or ignore these technical tools. Take Common Crawl as an example. It often ignores and bypasses paywall mechanisms and other technical tools, scrapes entire websites containing copyrighted works, and archives them for large AI companies to use in training AI models.

The robots.txt protocol is one of the technical tools often mentioned in the discussion of the opt-out mechanism. Although robots.txt does remind web crawlers not to scrape relevant copyrighted works, its effect is very limited. Part of the reason is that it can only work if it is recognized and respected. Another major problem is that the robots.txt protocol was originally designed to prevent search engines from indexing works, not to prevent web crawlers for artificial intelligence (GAI) purposes. Therefore, using robots.txt will not only prevent web crawlers for AI purposes but also prevent search engines from indexing works. Most copyright owners do not want their works to be scraped and used for AI training, but they want their works to be scraped by search engines - so that their works can be found on the Internet and they can profit from their creativity. The mechanism of robots.txt is not sufficient to distinguish between these two situations. Therefore, if copyright owners want to prevent their works from being scraped and used for AI training, their only option is to completely remove all their online information from Internet searches, which is likely to ruin their business.

Another limitation of robots.txt is that it does not target the copyrighted works themselves but operates at the URL or website level. This means that even if web crawlers or scraping tools comply with the robots.txt rules of a specific website, they cannot prevent the scraping or use of copies of copyrighted works that exist elsewhere on the Internet for AI purposes. For example, if copies of copyrighted works exist on pirate websites that are not under the control of the copyright owner and these websites do not use robots.txt, these copies will still end up in the training set.

It is worth noting that although the industry is collaborating to develop better tools to specifically solve the problem of AI robot scraping and stealing, these solutions are still in their infancy and are just a small part of the major problem of the ability of copyright owners to effectively enforce and protect their works in the digital environment.

IV. Imbalanced Interest Scale: "Binary Choice" Kills Cooperation

Most opt-out mechanisms are limited by their inherent binary nature: the work can either be used or not used. Unless the copyright owner and the AI company reach an agreement on the use of the creative work, there is no opportunity for the two parties to negotiate the terms of use.

However, according to the current copyright law, they can already achieve this through a license agreement. Currently, industry-led innovative technological solutions are being continuously developed and discussed to build a more flexible mechanism that links the opt-out mechanism with other terms such as payment - but in essence, this makes the mechanism no longer an opt-out mechanism but just a license agreement.

As shown on our AI licensing webpage, the AI copyright licensing market is booming, and many creative solutions, partnerships, and agreements have emerged between the AI and creative industries. The AI licensing market has also given rise to a number of small and medium-sized AI companies that build their businesses entirely on the commitments, partnerships, and license agreements reached with copyright owners within the framework of copyright law. Opt-in, license-based agreements and licenses have led to an increase in partnerships between the AI and creative industries, rather than a decrease. On the contrary, the opt-out mechanism lacks innovation because it assumes that AI training is necessarily a zero-sum game - killing creativity and innovation and hindering AI innovation.

V. The Opt-Out Mechanism Violates International Treaty Obligations

Implementing a legal system for the opt-out mechanism in the field of general artificial intelligence, especially in cases of legal exceptions, may violate the international treaty obligations stipulated in the Berne Convention.

The Berne Convention is an important international copyright treaty with 182 contracting parties. Article 5 of the convention stipulates that copyright protection shall not be subject to formalism. The opt-out mechanism is precisely a form of formalism. According to the Berne Convention, the opt-out mechanism, especially those implemented in cases of copyright exceptions related to general artificial intelligence, makes the precondition for copyright owners to exercise and enjoy their exclusive rights the fulfillment of the unacceptable formalistic obligation of opting out.

The copyright system of any country should not allow such a mechanism to exist.

VI. Strictly Enforce Transparency and Accountability Mechanisms, Otherwise the Opt-Out Mechanism Is Invalid

No matter what utility the opt-out mechanism may have in the collection and training of general artificial intelligence, it will be completely ineffective without corresponding transparency standards or the obligation to enforce the opt-out mechanism and hold AI companies accountable.

AI companies that provide the opt-out mechanism do not really have an obligation to copyright owners to ensure the effective operation of these systems. Imposing transparency obligations on AI companies is crucial for an AI ecosystem that is developed and used in a responsible, respectful, and ethical manner. If AI companies provide the opt-out mechanism to ensure that AI models are developed and used in this way and to ensure that rights are respected, then they should not avoid disclosing the copyrighted works they use for training. Transparency measures ensure that any opt-out mechanism provided by AI companies or required by legislation respects the rights of creators and copyright owners.

For the above reasons, the opt-out mechanism should not be mandatory. However, a voluntary notification mechanism may work, allowing copyright owners to notify AI companies that they do not want the company to use their works. When copyright owners notify AI companies (or other users) that their works shall not be used for general artificial intelligence (GAI) training, regardless of the form of notification, AI companies (and users) must respect these objections. If an AI company ignores the opt-out notice and scrapes, imports, or otherwise uses copyrighted works in violation of the notice, the AI company or developer shall be liable for intentional infringement and shall be subject to higher damages under the Copyright Law.

The Way Out Lies in Authorization: From "False Opt-Out" to "Transparent Licensing"

Ed Newton-Rex, a former executive of Stability AI and the current CEO of Fairly Trained, listed ten reasons why the opt-out mechanism is just a false "choice" (see Annex). In short, the opt-out mechanism does not work. They undermine the basic principles of copyright law and suppress the real creativity and innovation in the creative and technological fields.

Opting out is not the solution. The real solution always lies in respecting the rights of creators and copyright owners and whether and how they exercise these rights. Instead of damaging these rights like the opt-out mechanism, it encourages the development and training of generative AI models through copyright licensing. This solution, rather than opt-out, is the best way to ensure the prosperity of the AI industry and its complementarity with the creative economy.

Annex:

Ed Newton-Rex, a former executive of Stability AI and the current CEO of Fairly Trained, in the article "The Insurmountable Problems of Generative AI Opt-Out", elaborated in detail from ten aspects the same view as the author of this article:

1. Unable to Control Derivative Copies of Works: The opt-out mechanism based on URLs or metadata is completely ineffective for derivative copies (such as social media screenshots, advertisements embedded with works, etc.) that are scattered all over the Internet and are not under the control of the creator.

2. Most People Will Miss the Opt-Out Opportunity: Empirical data shows that the usage rate of the opt-out mechanism is extremely low, not because creators agree, but because they simply do not know about the existence of the opt-out mechanism or miss the short application window.

3. The "Black-and-White" Binary Choice Damages Rights: The opt-out mechanism is binary (allow or prohibit), which forces copyright holders to be unable to distinguish between "allowing AI search indexing" and "allowing AI to train models". If they choose to opt out of training, it may mean that their content cannot be found by search engines, thus damaging their traffic and income.

4. Emerging Technologies Render the Opt-Out Mechanism Ineffective: Devices such as smart glasses capture data from the real world for training. This type of data does not pass through web crawlers and cannot be attached with metadata, making any existing opt-out mechanism meaningless.

5. The Rapid Changes in Web Crawlers Are Difficult to Cope With: AI companies are constantly launching new crawler tools, and copyright holders are struggling to keep up. There is always a time lag, and they cannot stop all the companies that they do not want to use their data in a timely manner.

6. Implicitly Pardons Historical Infringement and Continues to Benefit: The opt-out mechanism is usually introduced after the AI model training is completed, which is equivalent to acquiescing to the previous unauthorized use. Moreover, even if AI companies no longer directly use the opted-out works in the future, they can still use the synthetic data generated based on these works to train new models, thus continuing to benefit.

7. The Administrative Burden of Opting Out of All Works Is Huge: Creators' works are usually scattered on many platforms over a long period of time. Performing the opt-out operation for each work one by one is an extremely heavy task that is almost impossible to complete.

8. The Opt-Out Deadline Puts Undue Pressure on Copyright Holders: The opt-out mechanism usually has a deadline, forcing copyright holders to make hasty decisions with incomplete information. If they miss the window, their works will be legally used before the next training cycle.

9. Difficult to Understand the Consequences of Opt-Out, Reducing the Willingness to Use: Due to the lack of understanding of the consequences of the opt-out mechanism (such as the impact on search engine rankings), many copyright holders will hesitate, thus reducing the opt-out rate.

10. More Unfair to Small Creators: Small creators and individual artists with limited resources are even less able to track and deal with various opt-out mechanisms, putting them in a more disadvantaged position.

This article is from the WeChat official account "Internet Law Review", author: Rachel King, published by 36Kr with authorization.