How can enterprises regain control of data in AI data competitions? Insights from Reddit v. Anthropic
As the competition for real-time data access intensifies, enterprises are facing increasingly severe legal and operational challenges: web data scraping.
Initially, data scraping was just a marginal strategy for amateurs. Now, it has evolved into a complex ecosystem worth tens of billions of yuan, driven by commercial data aggregators. Automated robots cast a wide net on publicly accessible websites, collecting price data, product listings, reviews, etc., usually at a speed faster than a human can click "refresh." Nowadays, these entities often bypass traditional access barriers. Instead of directly invading the platform, they use the access privileges of legitimate users to circumvent technical and contractual restrictions.
It is very important for enterprises and organizations to understand the mechanisms of web scraping and how aggregators use contractual workarounds. In this way, they can regain control of their data through well - constructed agreements and properly implemented and configured technologies, especially in terms of application programming interfaces (APIs) and direct data licensing.
I. How do data aggregators obtain enterprise data?
1. Web scraping
On February 9, 2025, the Organisation for Economic Co - operation and Development (OECD) released a special report titled "Intellectual Property Issues of Artificial Intelligence Based on Scraped Data." The report defined "data scraping" as "the act of extracting information from third - party websites, databases, or social media platforms through automated tools." Its core processes include data collection, pre - processing, storage, and model training.
Data shows that currently, about 70% of AI training data sets lack clear source licensing information. For example, more than 80% of the training data for large language models (such as GPT - 3) comes from publicly available web - scraped data sets like Common Crawl. An audit of 1,800 commonly used data sets in 2023 found that some data sets contained pirated content.
Although data scraping itself is not a malicious act, it can even be used for legitimate purposes, such as supporting academic research projects, digital archiving, or competitive benchmarking. It automatically scrapes, collects, and organizes data scattered across different websites and platforms, making it easier for users to break the isolation of data under the distributed architecture of the Internet and integrate originally scattered data resources.
In fact, a successful Internet enterprise may have the dual identities of both a data scraper and a scraped entity. Intelligent agents play a role in both the inflow and outflow of data. In the era of the full - scale application of general artificial intelligence and the comprehensive digital transformation of society, the act of web data scraping has highlighted its important value.
2. End - user consent
To deal with lawsuits and strong public opposition, many large data aggregators now avoid direct data scraping. Instead, they use a more subtle approach: directly signing contracts with the end - users of the platform and asking them to provide access to their accounts.
For example, a financial aggregator may ask bank customers to log in to their online banking interfaces to "link accounts." After linking, the aggregator will collect transaction histories, balances, or other account data, either by scraping website data using the customer's credentials or through an authorized API connection. Even if the platform itself (in this case, the bank) never grants permission, the aggregator's access can be considered legal because of the customer's consent.
This workaround allows aggregators to avoid many direct law - enforcement tools. Since aggregators do not invade the platform system, they rely on the guise of user consent and use the customer's access privileges to do what the aggregator itself cannot do directly. Therefore, the remedies provided by traditional network security regulations may be very limited.
II. Why is it important? Risks faced by platforms and data hosts
When data scraping is used for commercial purposes, it triggers many legal issues. Unauthorized data scraping may violate the terms of service, exceed the access authorization stipulated in the "Anti - Unfair Competition Law" and the "Regulations on the Management of Network Data Security," or infringe on intellectual property rights.
In addition to legal risks, data scraping also puts pressure on servers, distorts website analytics, and weakens an enterprise's ability to control or commercialize its own information. Initial technical solutions may quickly become a fuse for commercial and legal issues.
Unauthorized web scraping and end - user access workarounds can cause serious damage to the platform hosting the data, such as:
- Loss of control: Aggregators determine how data is stored, used, and monetized. The platform loses control over how its proprietary or sensitive data is distributed, reformatted, or resold. If an organization relies on data as its source of revenue, the aggregator's copying and reuse of data can undermine the organization's business model and reduce the value of the content.
- Security risks and operational costs: Credential sharing (especially when aggregators use data scraping instead of API access) creates network security vulnerabilities and increases the risk of violations or unauthorized transactions. It may also lead to increased operational costs, overloaded servers, and potentially reduced performance for legitimate users.
- Erosion of brand and trust: If an aggregator misuses data or suffers a breach, customers usually blame the original platform, even if the platform is not involved.
- Regulatory risks: In industries such as finance, healthcare, or insurance, if customer data is accessed or transmitted in a way that violates privacy laws (even indirectly), the platform may face compliance risks.
III. Lessons from the Reddit v. Anthropic case: Contracts as the new AI law
On June 4, 2025, the lawsuit filed by Reddit against the artificial intelligence startup Anthropic shocked the tech world. Reddit accused Anthropic of illegally stealing user data for training its artificial intelligence and defended users' rights and the right to digital consent.
Reddit filed the lawsuit on the grounds of breach of contract, conversion of chattels, tortious interference, and unfair competition. Reddit's core allegation is that Anthropic, the developer of the Claude AI model, scraped its content on a large scale without authorization, and the scraping violated its user agreement. This is not a typical copyright dispute but delves into the enforceability of online service terms and the ownership of digital public resources.
Reddit claims that since July 2024, Anthropic has scraped its content more than 100,000 times and continued scraping even after being explicitly told to stop. This raises some fundamental questions: how artificial intelligence companies obtain training data and what the rights of the platform whose content is used are.
We are witnessing a shift: contract terms, rather than traditional copyright laws, may become the main legal framework for managing who can use publicly available data to train artificial intelligence models. This means that artificial intelligence developers need to carefully review and comply with the service terms of the platforms where they obtain data. This case may accelerate the trend of platforms moving towards general licensing for AI data access, rather than just reaching customized agreements with specific companies.
Notably, Reddit announced a partnership with OpenAI in May 2025, which will allow the company to use Reddit content to train its AI models. The company also signed a similar agreement with Google. This lawsuit may not just be a direct legal battle but could be a strategic move by Reddit. Lawsuits are often a powerful lever for promoting negotiations and redefining industry norms. Reddit suing Anthropic may aim to force the artificial intelligence startup to reach a licensing agreement similar to the one with OpenAI. This highlights the evolving role of lawsuits as a commercial strategic tool rather than just a dispute - resolution tool.
IV. Solutions: Control through API agreements and direct licensing
On June 27, 2025, China passed a revised bill of the "Anti - Unfair Competition Law," which will come into effect on October 15, 2025. In this revision, the "Anti - Unfair Competition Law" for the first time clearly prohibits unauthorized access to or use of data held by other operators through improper means (such as bypassing technical protection measures).
In practice, if an enterprise or organization has valuable or sensitive user data, it is very likely that the enterprise has been targeted by commercial data aggregators, whether it is aware of it or not. Even if anti - scraping measures are taken, aggregators are still unscrupulously stealing data on a large scale through indirect access channels.
However, there is no pre - determined specific legal interest model for the anti - unfair competition protection of data rights, and there is a lack of operability in measuring competition means and competition results. Therefore, the anti - unfair competition claims of the party whose web data has been scraped (usually the plaintiff) are destined to be only a transitional choice, not a final solution.
Therefore, enterprises need to take proactive measures to reduce the risk of commercial web data scraping. By allowing aggregators to directly sign contracts with the platform, the platform can impose restrictions, track data usage, and avoid the risks of downstream scraping or shadow access.
1. Strengthen terms of use: Guide access through API agreements, providing a secure, structured gateway that allows third - parties to access specific data fields under specified conditions and has built - in security, usability, and compliance protection measures; review service terms and data sharing policies to ensure they clearly prohibit unauthorized scraping and downstream use:
n Specify permitted uses and storage restrictions
n Require regular security audits and data retention practices
n Prohibit re - authorization or resale of data
n Include indemnity and enforcement clauses
n Allow contract termination if the terms are violated
n Ensure users clearly accept such terms
2. Evaluate access control and use technical barriers: Evaluate how users share or delegate access privileges and whether such access privileges effectively circumvent the platform's control. Consider taking technical measures to make it more difficult for web crawlers to access data on a large scale, including limiting access rates to prevent a large number of requests, using robot detection tools to analyze traffic patterns, and using captchas to distinguish human users from robots.
3. Control potential data leaks: Consider adopting an API licensing model to maintain the platform's security, business model, and legitimate rights while providing structured access. This includes restricting access to high - value data, avoiding data leakage through unauthenticated APIs, and delaying the loading of key content when appropriate.
4. Actively safeguard rights: Web data scrapers usually claim that "the scraped party forms a data monopoly, there is no competitive relationship between the two parties, there is no subjective malice in data collection, there is no damage, and data rights have not been legally confirmed." Therefore, once scraping behavior is detected, before issuing a notice to stop infringement, a deletion notice, or a claim for breach of contract, enterprises should consult legal counsel to understand the legal and reasonable remedies that can be taken to avoid unnecessary legal and public relations crises.
References:
https://www.jdsupra.com/legalnews/web - scraping - and - the - rise - of - data - 5313726/
https://sghexport.shobserver.com/html/baijiahao/2025/07/29/1617514.html
https://www.lexology.com/library/detail.aspx?g = dfd6f12d - 8ad4 - 4725 - 8e35 - cb1cfca7acd7
https://opentools.ai/news/reddit - vs - anthropic - the - ai - data - showdown - of - 2025
https://natlawreview.com/article/beyond - copyright - reddit's - lawsuit - against - anthropic
This article is from the WeChat official account "Internet Law Review." Author: Internet Law Review. Republished by 36Kr with permission.