HomeArticle

On the eve of AI awakening, find a brain that understands the physical world better.

时氪分享2026-03-23 17:48
AI is the future infrastructure, and embodied intelligence and data centers are the keys.

AI is one of the most powerful forces shaping the world today. In the public's perception, AI is an intelligent application or a single model. However, in fact, like electricity and the Internet, it is an indispensable infrastructure for the future. The future of AI is a question that each of us needs to think about. To understand the future of AI and examine the fundamental changes taking place in the computing field, the best way is to return to the first - principles, that is, to think from the original principles.

As an investor who has been engaged in cutting - edge technologies at home and abroad for a long time, I am clearly aware that we are now on the eve of the AI awakening. What we are truly witnessing is not just the evolution of algorithms, but also an era revolution of "how AI is implemented in reality".

So, what are the fundamental factors that determine AI's transition from the laboratory to the trillion - level real industries?

The answer is: Embodied intelligence. In this article, I will also start from the three layers of "application layer, model layer, and infrastructure layer" in Huang Renxun's five - layer architecture, return to the essence of AI and the product itself, and analyze and discuss the evolution of algorithms and how AI intelligence becomes a medium to enter thousands of households.

The "cerebellum" of embodied intelligence has achieved leadership

The "application layer" is where real economic value is generated. For example, embodied intelligence, self - driving cars, industrial robots, etc.

In 2026, it officially entered the "big year of listing" for embodied intelligence. Before the general intelligent algorithm matures, building a good physical body is the foundation of embodied intelligence. Embodied intelligence has two clear development directions. One is the ability to disassemble and operate tasks in the "brain", and the other is the motor ability, that is, at the "cerebellum" level. The Spring Festival Gala demonstrated the motor ability of one of the top humanoid robot products in our country, such as running, jumping, maintaining balance, and adapting to complex terrains. There is no doubt that the motor ability of the G1 robot has reached the international top - level. The high - end and precise manufacturing of the fuselage, mechanical arms, and actuators has been recognized globally.

From an investment perspective, the dexterous hand still has a large room for imagination in the motor ability of the embodied intelligence sector. The dexterous hand is also known as the "entrance" of embodied intelligence. The core difficulty lies in guiding this hand to achieve the desired functions. For example, different pressures need to be applied to a mug and a glass, or a raw egg and a cooked egg. The data collected by hands of the same model may be different. Once it comes to fine and targeted operations, it highly depends on the stability and consistency of the hardware.

Currently, we recommend paying attention to vision - based tactile sensors. They are homologous to vision, suitable for integration into the model end, and have higher resolution, providing a new path for fusion perception. This is the direction that we investors are more optimistic about regarding dexterous hands. There is a huge breakthrough space in the entire industry from hardware to software.

The "cerebellum" of embodied intelligence has achieved global leadership, and what follows is the ability of the "brain", that is, the collaborative breakthrough between large AI models and embodied intelligence, which is a relatively clear dividend direction in the next 3 - 5 years.

The brain understands the world - the paradigm shift from LLM to VLA and the world model

Robots' ability to dance and perform kung fu is all about motion control and physical hardware. To make robots truly do work on their own, it still depends on the "brain" - the AI model.

The watershed in the development of AI models is clearly visible. It can be said that OpenAI's large - language model route has only gone halfway. The end - point of AI models is by no means just to let AI have conversations with users on the screen. The text - processing ability of large - language models is worthy of recognition, but it also exposes its fundamental limitation of being detached from the physical world. Language models only need to learn the statistical associations within the text and do not need to understand the physical world. The dividends of large - language models have almost been exhausted, which has also given rise to the current mainstream VLA model. Let AI grow hands and feet and physically cooperate with robots to take over the world.

VLA - Vision - Language - Action. The VLA model breaks through the barriers between "seeing what", "understanding what", and "executing what". However, in the process of practice, the problem that the VLA large - model is "insufficient" is quickly exposed. In essence, the VLA still relies on training data from large - scale scenarios. It does not have human wisdom, lacks generalization ability, and has insufficient real - interaction data. It is a genius executor, but only an "executor".

From an investment perspective, models that rely on the accumulation of data and computing power have reached their ceiling. In the short term, large - language models are still the commercial protagonists, with strong monetization ability and clear demand. However, the spatial - intelligent world model has become a new investment focus. The potential value of the world model is far beyond that of LLM and VLA, which represents a trillion - level real - industry space for global investors.

The core of the world model is a causal thinking. First, let the model conduct internal inferences. If a certain action is taken, what kind of result will be produced, endowing the model with thinking ability. This is the biggest difference from the VLA model, rather than the VLA's scenario - triggered instructions.

Feifei Li made ordinary people intuitively understand what the world model is this year. The staff only used a mobile phone camera to scan an office, and then generated an identical high - fidelity 3D model on the computer, a real digital world that one can enter by wearing VR glasses. The 3D world - generation model Mable shows that given a photo of a window, AI can imagine the "spring flowers in full bloom" outside the window. These all belong to human cognitive abilities. The Real - Time Frame Model generates frames in real - time while the user is operating, with almost zero delay for each frame. This means that most people can play games in the infinite world generated by AI. Wherever they go, AI will generate the corresponding scene. This is one of the real - time interaction world models with the lowest current memory requirements. The large - scale implementation of embodied intelligence requires a simulated world for practice, and the 100 million 3D worlds that meet physical laws generated by World Labs are like a top - level school. The emergence of World Labs marks that AI begins to try to understand physics, which is the necessary path to general embodied intelligence and robots capable of physical work.

World Labs has achieved a valuation of $5 billion in just 2 years since its establishment and is a benchmark enterprise in the global spatial - intelligent and event - model track. At the same time, Google, the leader in the global Internet technology field, is also accelerating the layout of the AI era. Google's strategy is to build a general AI brain platform to achieve the generalization and implementation of the AI brain. Simply put, it is to build an Android - like platform for robots and put it into robots. With the world model on the platform, the robot has a simulator in its "brain" and can try to make the bed, open the refrigerator, take a glass, and pour water in the digital twin world 10,000 times first. Train AI/embodied intelligence quickly with massive data in the virtual simulation, and cooperate with SIM to real, migrating from the virtual to the real world, and seamlessly transfer the learned strategies to the real world. This will solve the famous Moravec's paradox in the embodied intelligence industry - it is easy for AI to perform adult - level reasoning, but difficult for AI to run like a one - year - old child.

A new question arises: Are end - to - end, VLA, and the world model contradictory and opposing? The answer to this question is to integrate the three.

For the large - scale implementation of embodied intelligence, end - to - end has become an industry consensus. Its core is imitation learning, which will lead to two problems: data scarcity and the inability to cover low - frequency scenarios; high imitation difficulty and the lack of causal reasoning ability.

To solve the data problem, the world model can be used. Under the end - to - end backbone network, use the world simulator to generate virtual low - frequency data, and then use reinforcement learning for post - training. Or build a virtual world, let the model train in this world to solve all problems, and finally achieve video input and control signal output at the terminal. In essence, they are all world models, but it is not easy to generate virtual data that is the same as the real world. Time and cost are major obstacles. 3D Gaussian Splitting is an excellent choice at present.

3D GS builds the real scene into a trainable, renderable, and fine - tunable 3D scene, and then makes adjustments based on real data. For example, the world model can make an apple in the refrigerator suddenly roll or a cup fall when a person opens the refrigerator, generating these small - probability events in the real world. This solves the problems of less data, difficulty in generation, and inaccuracy in low - frequency scenarios.

The key for AI to reach AGI lies in data processing. Deviations in the quality and quantity of input data will cause the model to have "hallucinations" and the output will be inaccurate. Sort and clean the data first, and then put it into the model for training. Only in this way can the data fed to the model be valuable. In the field of data sorting, it is believed that several mainstream players will gradually emerge in the future.

However, this only solves the problem of data scarcity and does not solve the problem of reasoning ability. It then returns to the core of the VLA model's reasoning architecture - by disassembling the problem and dealing with it separately, continuously stacking tokens and conducting layer - by - layer transmission, so that the model gradually forms the ability to understand during the derivation process. From 3D GS to the VLA reasoning process, it all takes place in the integrated native network and is implemented in the cloud, while the terminal still uses the end - to - end mode to solve the problem of a large - scale architecture. The most core problems of low - frequency data and reasoning ability in end - to - end imitation learning are cleverly resolved, which is also one of the recognized optimal solutions in the self - driving and embodied intelligence industries. Being general and having predictive ability is the real moat of embodied intelligence, and only in this way is it possible for embodied intelligence to enter thousands of households.

Global capital giants have entered the AI industry. In 2025, new AI startups globally attracted 48% of the total venture capital, flocking crazily to world models with core technological barriers. The world model is the "preferred track" for global AI investment. When the moment of technological breakthrough in embodied intelligence arrives, more than 50% of the world's resources will also pour in, and AI will be the biggest driving factor for the global economy. As an investor in the cutting - edge technology field, one should be good at seeing the development trends in the hard - technology field in the next 5 - 10 years or even longer.

Investing in the world model is essentially investing in AI's "intuition" about the physical world. Behind this intuition, as an investor, what we need to do more is to look forward to every opportunity to advance with the country, conduct on - the - spot investigations and follow - ups all the way, strengthen our understanding of the target industry, and grasp every detail. Truly recognize the important position of the world model, which is the core of understanding the physical world and realizing the ability of general robots, and is the key direction towards AGI. Sudden market risks will not hinder the progress and breakthrough of AI technology, nor will they change the country's support for the development of new - quality productivity through a prosperous capital market.

The computing power base of the AI world - the data center

The two bottlenecks in the "ChatGPT moment" of embodied intelligence. The biggest pain point is the world model described above, and the second is data. Training embodied intelligence requires a huge amount of training data. The efficiency of real - machine data collection is low, and to build a world model that can hold the whole world, a computer room that can accommodate the world is needed.

This computer room is the third layer of the AI 5 - Layer Cake, the "infrastructure layer", which includes land, power transmission, network connection, etc. It is a system that integrates countless processors into one machine - the AI factory, that is, the data center.

The world model needs to model, predict, and deduce the physical world in real - time and with high precision, which puts extremely high requirements on computing power, storage, and bandwidth. The data center is the base for its operation. There are several reasons why the world model must use a larger and more powerful data center.

First, traditional large - language models are mainly based on text. The existing Internet corpus is large in quantity and easy to annotate. The world model requires a super - large storage cluster and a high - speed read - write architecture. 1 second of high - definition video is equivalent to tens of thousands of words of text. Second, both training and reasoning require "giant computing power". The world model needs a GPU/AI chip cluster at the level of ten - thousand cards to model physical rules, predict future multi - step states, and conduct high - dimensional space modeling. Third, the requirements for real - time and parallelism are extremely high, requiring ultra - high - bandwidth, low - latency networks and distributed scheduling systems. Typical scenarios are self - driving and embodied intelligence. Fourth, the model itself is continuously expanding. The world model is a collection of large - language models, large - vision models, time - series prediction models, physical engines, and world memories. A single card or a small cluster simply cannot run it.

Based on the above four points, it is sufficient to prove that only ultra - large - scale data centers can provide what the world model needs.

According to the latest industry data, by the end of 2025, the number of Internet data center racks providing services to the public in China reached 938,000, and this number is expected to exceed 1 million in 2026. A trillion - level market is taking shape, and the new growth engine must be AI computing power.

The world model needs to model, predict, and deduce the physical world with high precision, which puts extremely high requirements on computing power, storage, and bandwidth. The data center is the base for its operation, and the world model must use a larger and more powerful data center. Different from large - language models, which are asset - light and change rapidly, the commercialization rhythm of the world model will be relatively slow and steady, following the heavy - industry route. "Computing - power and electricity coordination" was also clearly listed as a new infrastructure project for the first time this year, and its strategic level has jumped from technological exploration to the country's top - level design.

When investors investigate data center targets, they need to focus on the "AI factory efficiency" of each company to evaluate its core competitiveness, dig out high - quality enterprises triggering the Davis double - play, strictly assess the safety - cushion moat, fundamentals, and industrial logic of the enterprise, discover technological breakthroughs, study the feedback from upstream and downstream manufacturers, and explore performance inflection points as a safety base. Standing from the perspective of the era to make a layout, judge the enterprise's imagination space. Only by comprehensively understanding the industrial research and enterprise logic can a decision on investment be made.

In addition, in the face of the characteristics of long cycles, complex deployment, and high verification costs of AI models and data centers, it requires us investors to have resilience. Once on the path of the AI revolution, it will not be smooth sailing. We must constantly break ourselves and then reassemble. Thank every setback. Resilience forges character, and character achieves greatness.

 Conclusion: Seize the opportunity and wait for the dawn of the full - stack AI ecosystem

Open Claw has opened the era of intelligent agent computing. Enterprises have moved towards intelligent agents, and embodied intelligence is being implemented on a large scale. Self - driving cars, industrial robots, and humanoid robots together constitute the next major opportunity for Physical AI. The era urges humans to focus on high - certainty cutting - edge technology fields such as Physical AI, world models, embodied intelligence, computing power, electricity, chips, biopharmaceuticals, and innovative drugs.

The Internet revolution and the mobile cloud revolution have each given birth to a group of epoch - making enterprises. Developers in the AI transformation era are constantly creating new scenarios and making breakthroughs. A group of highly influential companies are also building a mutually beneficial and win - win ecological chain, using the power of the ecosystem to accelerate the construction of the AI intelligent era.

When the world model matures and embodied intelligence wakes up overnight, the AI ecosystem will have a complete explosion, which will have cross - era significance. By then, the global GDP will grow at a high speed, AI will completely liberate humans, people can do what they want to do, the labor cost will approach zero, complex situations that humans cannot solve will be resolved, productivity will be unlimited, and true high - income for all will be achieved, and the full - stack AI ecosystem will prosper.

This article is from a contribution. Author: Zhou Xin