The "token crusher" poses five major challenges. How can AI Infra handle OpenClaw?
On the surface, people are "raising lobsters." Beneath the water, a tough battle is raging around the underlying AI Infra.
"Lobster (OpenClaw)" is becoming the hottest phenomenon-level keyword at present. In just over a month, the WeChat index soared from 0 on January 29th to 165.6 million on March 10th, with the popularity almost showing an explosive growth. As of March 20th, OpenClaw has received 325,000 stars on GitHub, ranking first on the platform. Meanwhile, a report from Qi'anxin shows that the daily new deployment instances globally have jumped from 5,000 to 90,000, a growth of up to 18 times. Among them, the United States and China have become the two most important battlefields, accounting for more than 65% in total.
From the tech circle to people's daily lives, "raising lobsters" is quickly going mainstream. From septuagenarians to children, everyone is joining this national craze.
Data source: Qi'anxin report
Major companies have "dived into the water to compete," launching various Claws, and some even have more than one. Cloud deployment, local deployment, online hosting... Various versions are emerging one after another. Government agencies and research institutions have also launched their own Claws, such as the government Claw in Futian District, Shenzhen, the operation and maintenance Claw of Beijing Mobile, and the teaching Claw of Tsinghua University, with a wide range of application types.
Major enterprises have also quickly packaged their star skills into skills and connected them to the open ecosystem. For example, Baidu Search skill, the walking skill of Unitree Technology's humanoid robot, and McDonald's ordering skill... As of mid-March, there have been more than 25,000 skills on GitHub. On the official skill platform ClawHub of OpenClaw, the number of skills is also close to 28,000, among which there are many contributions from Chinese enterprises. For example, Baidu Search has ranked first globally in the download volume of "search engine skills."
However, beneath this popularity, there is invisible smoke. Behind the popularity of OpenClaw, a tough battle is raging around the underlying AI Infra.
One "Lobster" Stirring up the Global Agent Ecosystem
As OpenClaw enters all industries, more people realize that OpenClaw is not just a popular product. It may be a turning point in an era, which will have an all-round impact on the implementation of Agents.
In fact, in the past two or three years, the implementation of Agents has not been very smooth. Enterprises need a process to accept its framework and logic. However, the emergence of OpenClaw provides an Agent model that is open-source, does not limit models or channels, and fully opens skills, enabling all industries to quickly reach a consensus.
People have solved complex and even long-tail problems that no one cares about through OpenClaw. "In the future, software may become very fragmented, but the software solutions for solving problems will become highly unified, that is, by integrating skills through the OpenClaw framework," observed Shen Dou, the executive vice president of Baidu Group and the president of Baidu Smart Cloud Business Group. "Those who understand the business and can turn problem-solving solutions into skills will be able to gain the maximum benefits in the entire ecosystem."
In this development trend, the "traffic islands" of major platforms in the era of mobile Internet are expected to be connected into a continent by OpenClaw. After all, all application manufacturers dare not ignore this future super entrance.
In addition to software, OpenClaw is also quickly entering various hardware used by people. Xiaodu speakers, Unitree robots, Huawei mobile phones, Raspberry Pi, Lenovo PCs... This is a more grand perspective. OpenClaw is expected to break the original barriers between hardware and form a larger and more unified intelligent ecosystem.
Therefore, Huang Renxun, the founder of NVIDIA, clearly stated at the GTC conference held this week that "OpenClaw is the operating system for personal AI" and proposed that the CEO of each company must think about: "What is your OpenClaw strategy?"
Obviously, Agents represented by OpenClaw have brought us into a new era.
The "Token Crusher" Stemming from the Unique Model of the Lobster
However, while this national "lobster-raising" craze is spreading rapidly, a real problem has emerged. OpenClaw is truly a "Token crusher."
In the past nearly a month, the global Token call ratio has soared to 17%. The industry describes OpenClaw as "devouring more than one-sixth of the global computing power."
Why does OpenClaw "consume so many tokens"? It stems from its three unique models: the popularization of traffic, the intelligentization of interaction, and the community-based ecosystem.
First is the popularization of traffic. The user scale and request volume of intelligent agents like OpenClaw show a tidal explosion characteristic, with no fixed peak rules.
Traditional large model conversations are "use and leave," and the total traffic is relatively stable. In the future, everyone may have a 24/7 exclusive AI assistant. When tens of millions of users "raise lobsters" at the same time, the originally predictable traffic model will completely fail. The traffic pressure not only comes from more people but also from machines that never rest. For example, the operation and maintenance Claw monitors, troubleshoots, and schedules tasks continuously on behalf of people 24 hours a day. These irregular, all-weather, and high-density traffic changes are a qualitative change compared to the past.
Second is the intelligentization of interaction. A single user operation will trigger multiple rounds of thinking, tool calls, and logical verifications, forming a request amplification effect.
Let's take OpenClaw completing a task as an example to understand the specific inference call link and computing power consumption structure behind it.
When a user sends the instruction to OpenClaw, "Help me plan a trip to Shanghai Disneyland with my 6-year-old child this Saturday with a budget of 2,000 yuan, avoiding the peak flow and returning to the urban area before 8 p.m.," OpenClaw immediately constructs a huge initial input. It not only includes the user's brief instruction but also loads the preset role documents of the Agent, the usage instructions of tools such as browsers, command lines, and file reading and writing, as well as the memory of past conversations. It is normal to consume dozens or hundreds of thousands of tokens in a single request.
In this example, the initial input is about 15,000 tokens, which drives the large model to complete the first-round planning inference and breaks the task into sub-goals such as checking ticket prices, looking at real-time queues, calculating travel time, and finding restaurant recommendations.
Next, OpenClaw enters the ReAct cycle. The so-called ReAct is "think, do, and reflect while doing, correct if it's wrong, until it reaches a deliverable level," and it is not a single call.
In the first round of action, the model calls the browser tool to crawl the queuing data of the Disneyland official website on that day. After the network returns, the system injects the web page content into the context, triggering the second-round inference: The queue for the "Soarin' Around the World" project is 120 minutes. It is recommended to buy a premium card or adjust the visiting order, and then call the calculation tool to check if the budget is exceeded. Each round of "decision - execution - reflection" requires the large model to calculate completely, and the context also expands continuously, rapidly growing from the initial 15,000 tokens to more than 35,000 tokens.
The entire task accumulates 8 - 12 times of large model inferences, and finally outputs an itinerary of thousands of tokens, with a total token consumption of about 300,000. In contrast, traditional large models only need a single call and a few hundred tokens to give "several Disneyland guides." In short, in the past, large models were "called by times," while now OpenClaw is "called by processes," and its "practical ability" magnifies the token consumption by at least dozens to hundreds of times.
Actually, Huang Renxun once revealed that the token consumption of Agents represented by OpenClaw when performing complex tasks has increased by about 1,000 times compared to traditional generative large models, and it can reach one million times for continuous monitoring Agents. Industry insiders told Shuzhi Frontline that heavy users of OpenClaw consume up to 30 million to 100 million tokens per day. Calculated by international top models, the daily cost is 90 - 1,000 US dollars, and for mid - level models, it is 42 - 140 US dollars.
Finally, the community-based ecosystem is the third unique model of OpenClaw. Intelligent agents initiate conversations, collaborate, and respond in a chain autonomously, forming a self-excited interactive closed-loop without human intervention.
Some users had a creative idea and connected all the "little lobsters" from different manufacturers to Feishu, pulled them into the same group chat, set their respective tasks, and then let them go completely. Since then, these "little lobsters" have started to initiate conversations and collaborate autonomously: one is responsible for crawling market information, one analyzes investment decisions, and one specifically checks the work quality, forming an "AI team."
This model shifts the traffic from "human-machine dialogue" to "machine self-circulation," and the interaction frequency between intelligent agents increases exponentially, further intensifying the tidal explosion of computing power demand.
Generally speaking, if Agents like OpenClaw are popularized, three forces will superimpose and resonate, turning "1" into countless "N"s - N concurrent tasks, N chained calls, and N AI teams. And each N is challenging the throughput limit, scheduling efficiency, and cost boundary of AI Infra. The more cruel reality is that the iteration speed of inference may far lag behind the explosion speed of the Agent ecosystem, and AI Infra is facing huge challenges.
Five Challenges Faced by the AI Infra Inference System
This leap of OpenClaw forces the underlying AI Infra inference to face five challenges that have not appeared before:
Challenge 1: Withstand the peak, from "single short link" to the ultimate reconstruction of self-excited explosion
Traditional AI services follow the short-link logic of "request - inference - end," but the ReAct mode of OpenClaw requires multiple rounds of cycles of "request - judgment - action - reflection," and each round is an independent inference request. In the human-machine interaction scenario, a single user instruction can be amplified into several to dozens of inference requests. Once entering the multi-Agent collaboration mode, Agents shuttle back and forth at high frequency on multiple nodes at machine speed without any buffer of human rhythm. The number of requests processed per second can be magnified dozens or hundreds of times instantly, forming a "self-excited traffic peak" that traditional services can never foresee within a millisecond-level window.
This requires the infrastructure to have the ultimate throughput capacity of ultra-high concurrency, low latency, and anti-avalanche, and at the same time be compatible with the complex scenario where high-frequency short inferences and long conversations coexist, solve the chronic problems of queue congestion, full links, and low GPU utilization, and support the exponentially amplified request volume.
Challenge 2: Computing power scheduling, from "whoever is free goes" to precise matching throughout the life cycle
The tasks of OpenClaw are naturally serial and chained, just like a relay race. Agent A is responsible for opening the browser and taking a screenshot, then handing it over to Agent B to understand the page content. Only after B has analyzed it can Agent C be triggered to generate the final report. These three steps must be executed in sequence and cannot be parallelized. If any link gets stuck, the whole chain will stop and wait, and each Agent still occupies the video memory without releasing it during the waiting period. In addition, the requests also show the characteristics of being mixed with light and heavy tasks and multi-level jumps, and the extensive scheduling mode of "whoever is idle is scheduled" completely fails.
In this case, the infrastructure must evolve into an intelligent orchestration system. For serial chained calls, once Agent A has completed its output, the video memory it occupies should be immediately released or downgraded and retained. It will be reactivated after Agent B has completed the upstream result transmission, instead of letting each link of the whole chain occupy resources and "idle waiting." Moreover, light intentions match light computing power, complex inferences match high computing power, and high-priority tasks have resource guarantees. This upgrades the computing power scheduling of AI Infra from load balancing to fine-grained management of the whole-link resource life cycle.
Challenge 3: Memory extension, from "clear after use" to breaking through the memory wall under dynamic interaction
KV Cache is the "short-term working memory" of the model, which saves the calculated context for reuse next time, saving time and video memory. This is relatively simple under the traditional service logic. One user, one conversation, and one cache are cleared after use. However, in the scenarios of multi-round interaction, tool calls, and multi-Agent collaboration of OpenClaw, the fragmented intermediate results generated each time are continuously inserted, and the "working memory" increases exponentially. The traditional cache reuse logic cannot be hit at all. The consequences are either a sharp increase in latency or the collapse of the whole task link.
The infrastructure needs to have the capabilities of multi-role session isolation, dynamic KV pruning, and cache optimization and reuse to solve the memory wall problem caused by long context and intelligent agent dynamic interaction.
Challenge 4: Elastic expansion, from "adding machines to save the situation" to seamless connection in seconds
At zero o'clock on Double Eleven, hundreds of thousands of users simultaneously send the instruction "Help me snap up limited items," and the traffic surges within 3 seconds. The traditional service's response is to "add machines and divert the requests." However, at this time, OpenClaw's Agent remembers which page it opened, which button it clicked, and which Agent it is waiting for to return the result. All these contexts are in the memory and bound to a specific server. Once migrated, the context will break instantly, the task will fail, and it will spread step by step along the Agent collaboration link, triggering a cascading avalanche.
Therefore, OpenClaw requires the infrastructure to complete the expansion in seconds, and at the same time, it must migrate the context completely and connect seamlessly, and have the capabilities of full-link fusing, current limiting, and downgrading. It is difficult to do either of these two things alone, and achieving them simultaneously is a proposition that traditional architectures have never considered.
Challenge 5: Model adaptation, from "default to run on NVIDIA first" to no-time-difference adaptation of domestic chips
OpenClaw requires the collaborative operation of a matrix of cutting-edge models, and models are iterating even daily, just like software versions. The differences between different models are extremely large. This is a combination of speed and complexity that traditional adaptation has never had.
The infrastructure must be able to be compatible with the next new model with a "changed format" at any time. The rule of the open-source community is realistic. When a new model is released, developers default to running it on NVIDIA GPUs first. Domestic chips need secondary development, operators need to be re-adapted, and the precision needs to be aligned. Sometimes, it takes an extra few weeks just to "run" the model. This is not because domestic chips are not working hard, but because the ecological debt is still being repaid. As a result, the model adaptation of domestic chips is always one step behind, and the ability iteration of OpenClaw is also dragged down. This is a problem that domestic chips need to overcome.
How to Make a Preemptive Layout for Intelligent Agents and Reconstruct AI Infra
Facing the wave of intelligent agents, Shen Dou, the executive vice president of Baidu Group and the president of Baidu Smart Cloud Business Group, clearly pointed out the fundamental change in industry demand in a public speech in August last year and judged that the underlying technology would undergo a major iteration.
He mentioned that large models