HomeArticle

Cyber Bodhisattva Cloudflare, the sternest father of AI crawlers

极客公园2026-07-05 12:07
The biggest "security guard" of the internet is now poised to become the biggest "cashier".

On July 1st, Cloudflare published a blog post with a rather mild title, "Your Website, Your Rules." However, the content was anything but mild - starting from September 15th, all websites using Cloudflare will block mixed - purpose AI crawlers by default. As long as there are ads on your page, AI training crawlers and Agent crawlers won't be able to access it. Unless you manually enable it in the backend. 

Note this logical reversal: previously, it was "allowed by default, and you could choose to block," but now it's "blocked by default, and you can choose to allow."

This is the first time the Internet infrastructure layer has systematically "legislated" on the way AI acquires data.

The background for this decision is a landmark event: the bot traffic on the Internet has exceeded human traffic.

Cloudflare CEO Matthew Prince said that this milestone came earlier than everyone expected. It was originally predicted to happen in 2027. In other words, for most web pages you open today, the main "viewers" are not humans but machines.

How to regulate the traffic from AI may determine the future of all websites and also the development trajectory of the Internet gatekeeper, Cloudflare itself.

01 The Strictest "Crawler Policy"

According to the official introduction, Cloudflare has divided AI crawlers into three categories.

The first category is called "Search." These are traditional crawlers that build indexes for search services, like what Google has been doing for over two decades.

The second category is called "Agent." These are AI agents that visit web pages on behalf of users in real - time. For example, when you ask ChatGPT to look up information or fill out a form for you, there's an Agent crawler behind it doing the work for you.

The third category is called "Training." These are crawlers that scrape content on a large scale for model training.

The three categories are marked separately. Website owners can set "allow" or "block" for each category respectively. Do you want search engines to find you? That's okay. Do you want AI agents to help your users look up information? That's also okay. But if you don't want AI companies to use your content for free to train their models? Then you can turn off the Training category separately.

This classification itself is like a knife, stabbing directly at Google.

Google's Googlebot is a typical "mixed crawler" - it simultaneously builds indexes for Google Search and collects data for Google's AI features (such as AI Overviews). Google does offer a tool called Google - Extended that allows websites to opt out of AI training. However, the problem is that the core Googlebot crawler itself still collects data for the AI features built into the search engine.

The data requirements for search and AI have never been truly separated in Google's architecture.

What does this mean? Cloudflare's data makes it clear: because websites want to remain visible in Google Search, they have to allow Googlebot in. Once Googlebot is in, the data for AI training is also taken away. As a result, Google has about twice as much access to web content as other AI companies.

Cloudflare has also added a principle of "strictest rule prevails" this time. If a crawler performs both search and training functions simultaneously, all applicable rules will take effect at the same time - following the strictest one. That is to say, as long as you choose to block Training crawlers, mixed crawlers like Googlebot, Applebot, and BingBot will all be blocked.

This "knife" cuts at the "bundling" - if you want to be searched, you have to accept being used for AI training. Cloudflare says this bundling is unfair and must be separated.

A set of data can show how badly the old "social contract" has broken down. Cloudflare has published the crawl - to - referral ratios of various AI companies: Google's is about 14:1 - for every 14 pages crawled, there is 1 referral click. OpenAI's is 1,700:1, and Anthropic's is 73,000:1.

In the era of search engines, the deal was "I crawl your content, and you get traffic." In the AI era, this equation no longer adds up.

02 From "Security Guard" to "Cashier"

If Cloudflare only helps website owners block AI crawlers, the significance of this would end at "defense." But Cloudflare clearly isn't satisfied with just being a security guard.

In July last year, Cloudflare launched "Pay Per Crawl" - charging AI companies based on the number of crawls. This year, it upgraded this model to "Pay Per Use." The difference is that instead of charging every time a crawler visits, users will only receive payment when their content truly generates value in an AI system, such as being used to generate an answer or appearing in an AI search result.

The ambition behind this shift from "charging per crawl" to "charging per value" is considerable. It means that Cloudflare wants to build not a wall but a market.

The current initial partners are two AI search companies, Ceramic.ai and You.com. When publishers choose to join, they will receive payment when their content appears in Ceramic's AI search results or is accessed by You.com's Agent. Big publishers have shown their support - the CEO of Condé Nast said it's a "game - changer," and the co - founder of Reddit said "the entire ecosystem will benefit."

It sounds like a perfect story. But I think it's necessary to mention a less - than - perfect detail.

In March this year, Cloudflare itself released a crawler API. You give it a URL, and it can scrape an entire website at once, returning HTML, Markdown, or structured JSON. This made some publishers quite uneasy - the company that has always helped me block crawlers has built its own crawler?

What's even more embarrassing is that when some publishers tried to block Cloudflare's own crawler, they found that the settings didn't work. Although Cloudflare later fixed this problem, the comments on the Internet have spread - "We protect websites from crawlers... unless it's our own crawler."

Cloudflare's explanation is that its crawler is a "compliant crawler." It respects robots.txt and follows its own AI Crawl Control rules. If website owners choose to block AI crawlers, Cloudflare's own crawler will also be blocked. In the words of a developer, this is a strategy of "hedging bets to always win."

This raises a fundamental question: Is Cloudflare a neutral infrastructure referee or a new - type middleman?

The answer may be the latter.

It plays three roles simultaneously: rule - maker (defining three types of crawlers), rule - enforcer (blocking crawlers at the infrastructure layer), and market participant (operating its own crawler and content trading platform).

This doesn't mean that what it does has no value - pulling AI crawlers from "disordered plunder" into a framework of "clear classification and requiring permission" is indeed a step forward. But it would be naive to regard Cloudflare as the "savior" of content creators.

It is building an "AI content tax station" centered around itself.

03 Can Ordinary People Get a Share of the Pie?

This may be the most sobering part of the whole thing.

Condé Nast, Dotdash Meredith, Reddit - those who have come out to support Cloudflare are all large publishers and platforms. They have content scale, legal teams, and bargaining chips. These companies can sign licensing agreements with AI companies without Cloudflare - in fact, more than 50 large - scale content licensing agreements have been signed globally in the past year. For them, Cloudflare is an additional tool, not the only way out.

But what about individual bloggers? What about an independent developer writing technical tutorials on WordPress? What about a self - media person writing in - depth analyses on WeChat Official Accounts?

Theoretically, Cloudflare's infrastructure allows small content owners to set permissions and get compensation without having to negotiate with each AI company individually. But the word "theoretically" is the key. So far, Pay Per Use only has two partners, Ceramic.ai and You.com, both of which are small players. None of the big companies like OpenAI, Google, and Anthropic, which are consuming content on a large scale, have joined.

And there is a more practical contradiction: for small creators, exposure itself is the scarcest resource. Blocking AI crawlers may mean fewer chances of being discovered. When big media block crawlers, Google Search will still index them; but when small blogs block crawlers, they may really disappear in the noise of the Internet.

Here is a set of data that is even more eye - opening.

The referral traffic brought by AI chatbots is about 96% less than that of traditional searches. The probability that users click on the citation source in an AI answer is only about 1%. Publishers have lost between 20% and 90% of their traffic and revenue due to AI search features in the past year. A study found that Google's AI Overviews have reduced the click - through rate of external links by about 40%.

This means that even if Pay Per Use is fully rolled out, the payment scale may be far from enough to make up for the advertising revenue that publishers have already lost. This is not a major change but more like an attempt to stop losses - and it may not even succeed.

Cloudflare reports that more than 50% of AI crawler traffic is spent on repeatedly scraping unupdated pages. Solving this inefficiency is indeed valuable. But solving the efficiency problem and making creators really earn money are two different things.

04 Even the "Bodhisattva" Has Its Own Temple

Cloudflare has always been praised by users as the "Cyber Bodhisattva" because it is indeed doing something valuable - bringing the data plunder in the AI era into the open and forcing AI companies to clarify "what I want your data for." On an Internet where bot traffic has exceeded human traffic, it's worthy of affirmation that someone is willing to stand up and say "there must be rules."

But even the "Bodhisattva" has its own temple.

Cloudflare manages about 20% of the global network traffic. This number is both large and not large enough. The other 80% of websites are not under its protection. AI companies can simply shift the focus of their data collection to non - Cloudflare sites.

The crawlers of Google and Apple have already provided formal opt - out tools, which may be used to bypass Cloudflare's interception. The UK Competition and Markets Authority (CMA) is putting pressure on Google from a regulatory perspective, requiring it to allow publishers to opt out of AI training without affecting their search rankings.

The policy of an infrastructure company won't bring an end to this redistribution of content rights.

But it reveals a deeper trend: the "toll - booths" on the Internet are shifting from search engines to the infrastructure layer.

In the past two decades, Google was the one standing in the middle of the road, deciding who could be seen. Now Cloudflare wants to set up a barrier at a more fundamental level - if you want to pass, first clarify what you're here for, and then follow the rules.

The toll - booth has changed. But the one collecting the toll may not have changed.

This article is from the WeChat official account “GeekPark” (ID: geekpark), author: Jing Yu. Republished by 36Kr with permission.