Rust's Big Problem: Cloudflare's Six - Year Worst Mistake after 53 - Day Rewrite Triggers ChatGPT and Claude Collective Outages

Half of the internet is down again and again and again.

Half of the internet is down again.

Just now, Cloudflare, a company, suffered a multi - hour outage, causing many popular websites and AI services to go offline. According to reports, this service interruption lasted about five and a half hours. OpenAI's ChatGPT and Sora were among the affected applications. There were also glitches on the official websites of Claude, Shopify, and the public transportation system in New Jersey, USA.

Mysterious traffic surge leads to widespread outage

According to foreign media reports, around 5:20 a.m. Eastern Time on November 18th, Cloudflare first noticed abnormal traffic on its platform. About an hour and a half later, the company updated an announcement on its status page to inform customers of this outage. The service interruption was manifested as error messages and increased latency. "There is a malfunction in Cloudflare's internal services. Some services may be intermittently affected," Cloudflare stated in an announcement released shortly before 7 a.m. Eastern Time.

The outage didn't only affect the CDN service for websites. It also hit its application service product suite, which provides CDN capabilities for cloud and on - premise workloads and protects the application programming interfaces of these workloads from malicious traffic attacks.

Cloudflare pointed out in a blog post in July this year that about 20% of the world's websites rely on it to manage and protect their traffic. According to DownDetector, this outage affected many services, including X, Spotify, OpenAI's ChatGPT, Trump's social media site Truth Social, the online design platform Canva, and the movie - rating app Letterboxd. Even DownDetector's own website was briefly affected.

This outage also affected at least two other services. During the troubleshooting process, Cloudflare engineers shut down the WARP virtual private network (VPN) service in the London area. Additionally, some users couldn't use the company's Cloudflare Access zero - trust network access (ZTNA) tool properly. The ZTNA product is similar to a VPN but offers better security and performance.

At 8:09 a.m. Eastern Time on November 18th, the company said that the problem "had been identified and a fix was being implemented," but the recovery process wasn't smooth. Around 8:13 a.m. Eastern Time on November 18th, Cloudflare restarted the WARP service in the London area. According to Cloudflare, the control panel service was restored at 9:34 a.m. Eastern Time. At 9:42 a.m., the company announced on its status page that engineers had fixed the root cause of the outage. In the following hours, Cloudflare continuously monitored the recovery process and "looked for ways to accelerate the full recovery." Finally, this service interruption ended at 11:44 a.m.

A Cloudflare spokesperson confirmed to foreign media that before issuing the first status update, they noticed "an abnormal traffic surge in one of their services," which "caused errors in some traffic flowing through the Cloudflare network." "We mobilized all our resources to ensure that all traffic was error - free. Then, we'll focus on investigating the cause of the abnormal traffic surge," Cloudflare said in a statement.

Notably, on the X platform, some netizens commented, "Cloudflare's Rust - rewritten version didn't stand the test of time." On September 26th, Cloudflare rewrote its core code in the "memory - safe" Rust language. The company said that thanks to the features of the Rust language, this refactoring was "faster and more secure."

Cloudflare's outage report specifically pointed out the line of Rust code that caused this outage.

"A single line of Rust code crashed, paralyzing half of the world's traffic." Many people think that those who have written in Rust know that using unwrap casually is not a good habit. Some also pointed out, "unwrap only fails when there is a problem with the configuration file."

A person who claimed that "a friend works at Cloudflare" said, "The outage was because an engineer tried to modify an old configuration file and deleted a bunch of seemingly outdated lines of code. It turned out that these lines of code were maintaining the stability of their routing system. Once the configuration file was deployed, half of the monitoring systems turned red and alarmed, and the entire network started to show some abnormal phenomena that even their internal documents couldn't fully explain. To fix it, they had to retrieve a long - forgotten backup, roll back a series of automatic reload operations, and figure out how to get a completely chaotic server cluster back to normal operation."

Moreover, the person revealed, "At that time, the Cloudflare office was full of Red Bull cans. Everyone was secretly panicking, and a senior developer kept repeating, 'Don't touch anything.'"

Official disclosure: The root cause of the outage

Cloudflare operates a content delivery network (CDN) that about 20% of the world's websites rely on. This platform works by creating multiple copies of website content and distributing them across data centers around the world. When a user visits a web page, Cloudflare loads the content from the data center closest to the user. The company said that this architecture can provide a latency of 50 milliseconds or less for 95% of the world's population.

In addition to improving website speed, Cloudflare's platform has other uses. Offloading traffic - handling tasks to the CDN can reduce the server load on website operators, thereby improving operational efficiency. Moreover, Cloudflare also provides network security features that can filter out malicious bots and other threats.

Regarding the cause of the traffic surge, that night, Cloudflare's Chief Technology Officer Dane Knecht revealed in a post on the X platform that this outage was triggered by the company's malicious bot traffic - filtering function, not an attack. The executive emphasized, "There is a potential vulnerability in a service that our bot - protection function relies on. It started to crash after a routine configuration change, which led to a widespread performance decline in our network and other services."

Meanwhile, a Cloudflare spokesperson also provided more detailed updates to foreign media. It was reported that "the root cause of this outage was an automatically generated threat - traffic management configuration file. The number of entries in this file exceeded the expected scale, causing the software system that processes traffic for many Cloudflare services to crash." The spokesperson said, "It's important to note that there is currently no evidence that this was caused by an attack or malicious activity. We expect that there will be a natural traffic surge after the incident, and some Cloudflare services may experience a brief performance decline, but all services will return to normal within the next few hours."

In a subsequent blog post, Cloudflare further explained the complete process of the malfunction, the affected systems, and the handling procedures. It was reported that "the problem was triggered by a permission change in our database system, which caused the database to output multiple entries into a feature file used by the Bot management system. The size of this feature file then doubled. The unexpectedly large feature file was then propagated to all the machines that make up our network. The network - traffic routing software running on these devices reads this feature file to ensure that the bot management system can respond to changing threats in a timely manner. This software has a limit on the size of the feature file, and after the file size doubled this time, it exceeded this limit, causing the software to malfunction."

Specifically, the "bot management" module was the root cause of this outage. It is reported that Cloudflare's bot management module consists of multiple systems. One of the machine - learning models generates a bot score for each request flowing through its network. Customers use these scores to decide whether to allow specific bots to access their websites. The input data for this model is a "feature" configuration file. This feature file is updated every few minutes and synchronized across the entire network so that it can adapt to changes in internet traffic.

It was a change in the underlying ClickHouse query behavior that caused a large number of duplicate "feature" lines to appear in the generated file. This change altered the size of the previously fixed - size feature configuration file, triggering an error in the bot module. As a result, the core proxy system responsible for handling traffic for customers returned an HTTP 5xx error code for all traffic that relied on this bot module. This problem also affected the Workers KV and Access services that depend on the core proxy.

The change they made was to allow all users to obtain accurate metadata for the tables they have access to. However, the problem was that their previous code had a preset assumption: the column list returned by such queries would only contain content from the default database, and the query wouldn't filter by the database name. As they gradually rolled out this explicit permission to users in the target ClickHouse cluster, the above - mentioned queries started to return "duplicates" of columns, which came from the underlying tables stored in the r0 database. Unfortunately, the feature - file generation logic of the bot management module builds each input "feature" in the file mentioned at the beginning of this section through such queries.

Since users gained additional permissions, the query response now contains all the metadata of the r0 database schema, causing the number of response rows to more than double, which ultimately affected the number of rows (i.e., the number of features) in the output file. At first, they misjudged that the observed symptoms were caused by a large - scale distributed denial - of - service (DDoS) attack, but then they accurately identified the core problem, successfully stopped the propagation of this unexpectedly large feature file, and replaced it with an earlier version.

Link to the detailed report: https://blog.cloudflare.com/18 - november - 2025 - outage/

The most serious outage in six years. Is the "truth" being ridiculed?

During the widespread outage, Cloudflare's stock price dropped by about 3%.

"Given the importance of Cloudflare's services, any outage is unacceptable. The network was unable to route traffic properly for a while, which deeply saddened every member of our team. We know that we let everyone down today," Cloudflare also said in a blog post.

Moreover, the company explained the subsequent steps to strengthen the system to prevent such malfunctions, including the following aspects:

Strengthen the reception verification of configuration files generated internally by Cloudflare according to the protection standards for user - generated input;

Add more global emergency shut - off switches for relevant functions;

Avoid core dumps or other error reports from occupying too many system resources;

Conduct a comprehensive review of various error - scenario failure modes of all core proxy modules.

Cloudflare admitted that this outage was the most serious one since 2019. "We've had outages in the past, such as the console becoming inaccessible or some new functions being temporarily unavailable. But in the past six years, there has never been a situation where most of the core traffic couldn't be transmitted through our network."

It is understood that the company's last major outage occurred in June, when more than six of its services went offline for about two and a half hours. That outage was triggered by a malfunction in the Workers KV data - storage platform.

Some netizens commented, "Cloudflare messed this up themselves. A small glitch became the first domino." Others thought, "This outage itself was a minor thing, but it exposed the excessive coupling problem between Cloudflare's own services, which made the control panel inaccessible. If the control panel had been available, many services could have partially recovered their functions more quickly."

Some people also questioned, "Does the internet really need to rely so heavily on a single provider?" At the same time, some critics said that such outage events fully exposed the vulnerability of the internet, especially when everyone relies on the same service provider.

Reference links:

https://siliconangle.com/2025/11/18/cloudflare - outage - briefly - takes - chatgpt - claude - services - offline/

https://arstechnica.com/tech - policy/2025/11/widespread - cloudflare - outage - blamed - on - mysterious - traffic - spike/

This article is from the WeChat official account "AI Frontline". Compiled by Hua Wei. Republished by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Rust has caused a big problem. After 53 days of rewriting, Cloudflare made its biggest mistake in six years, leading to the collective outages of ChatGPT and Claude.

Mysterious traffic surge leads to widespread outage

Official disclosure: The root cause of the outage

The most serious outage in six years. Is the "truth" being ridiculed?