Nur weil es in der Lage ist, "Ziegel zu tragen", hat die physikalische KI plötzlich über Nacht Bekanntheit erlangt.
Since the beginning of 2026, a new buzzword has emerged in the field of artificial intelligence: "Physical AI".
Jensen Huang mentioned several times at the beginning of the year at the CES that "the next wave of AI will be the AI that operates in the physical world." Justin Sun also recently declared forcefully: "The utility of virtual AI is exhausted. Physical AI is the greatest opportunity in the next three years."
In the industrial field, the star company Figure AI conquered the world with a five - day live broadcast of robot sorting work. The Chinese company Zhipu Robotics announced that its 10,000th universal, physical robot model rolled off the production line...
The statements of industry giants and the real changes in physical AI have drawn the industry's attention to this grand narrative that leads from virtual intelligence to physical execution. However, many people wonder whether this so - called "physical AI" is really an inevitable turning point in technological development or just a cleverly packaged new concept.
01 From "Conversational" to "Doing"
Before answering the above question, let's first break down this somewhat rigid technical term.
Physical AI literally means artificial intelligence that deeply integrates AI with the physical world. However, looking deeper into the core, virtual AI is responsible for "thinking and communication", while physical AI has to "perceive and act". Thus, it is no longer just an intelligence agent on the screen, but should enable machines in the real physical world to perceive, understand and execute complex operations.
In other words, physical AI is a technology that "enables autonomous machines (such as robots, self - driving cars, etc.) in the real physical world to perceive, understand and execute complex operations." Wang Xiang, an executive member of the China Computer Federation, systematically explained this concept at the third China International Supply Chain Expo: "Physical AI means that the AI system has the ability of a closed - loop of 'perceive - reason - act - feedback' in the real world."
Simply put, AI used to be "conversational", and now physical AI is "doing". When AI steps out of the ChatGPT dialogue and enters the factories, warehouses and households of the real world, this is the problem that physical AI has to solve.
This difference is particularly evident in the recent developments of two star robot companies.
One is the American company Figure AI, which proved through a five - day live broadcast that "robots can actually work". The live broadcast started on May 14th. The theme was that three Figure 03 humanoid robots took turns sorting express packages on the production line. The robots' task was to scan barcodes, pick up packages, readjust the direction and place the packages with the barcode down on the conveyor belt.
During the live broadcast, a robot worked continuously for over 33 hours and processed more than 40,000 packages. The founder Brett Adcock explained that the robots use the company's latest Helix 02 model and work in "fully autonomous mode".
The significance of Figure AI's live broadcast lies not only in demonstrating its technological capabilities, but also in showing the world with real - time images that the technology of physical AI has exceeded the "laboratory demonstration" critical point. A company that live - shows that its robots can work on the production line for several days without major problems is in itself a strong technological statement.
The Chinese company Zhipu Robotics also conducted a similar live broadcast. Its robot Zhipu Spirit G2 was sent to the MMIT (Multimedia - Integration) tablet production line of the Nanchang Longqi Technology Industrial Park to work with humans. The live measurement data shows that the robot worked continuously for 8 hours without major disruptions. The overall work success rate was over 99.5%. A single work operation only took 18 - 20 seconds, and the robot could produce 310 products per hour. One robot can take over the work of two work processes.
Compared with Figure AI, Zhipu Robotics also had the global premiere and delivery of its 10,000th universal, physical AI robot in March. From December 2025 to March 2026, it only took a little over three months to go from 5,000 to 10,000 robots.
Besides the delivery volume, Zhipu Robotics announced that it wants to achieve a turnover of 10 billion yuan in 2027. Considering the development experiences of other top industries such as new energy, autonomous driving or chips, it is remarkable for a company that has only existed for less than two years to have achieved mass production of thousands of units and set a turnover target of 10 billion yuan. This can be described as phenomenal in the field of hardtech.
The two companies mentioned above have proved with concrete data and scenarios that physical AI no longer relies on remote control or pre - set scripts to "perform", but has the ability to autonomously fulfill complex tasks in the real environment.
The most important thing is that Zhipu Robotics was the first to cross the threshold of 10,000 delivered units and linked the mass - production ability with the existing orders. This shows that in this field, a transition from "technological validation" to "commercial implementation" has taken place. In other words, the "feasibility" of physical AI is no longer an issue, and the real competition has penetrated into the depth of "availability" and "economic efficiency".
02 The technological drivers for the breakthrough of physical AI
Now the question is, why did physical AI suddenly explode this year? Looking back now, besides the real commercial demand, a series of technological breakthroughs are the biggest drivers.
First of all, the Large Language Model (LLM) has given robots the "ability to understand". Traditional robots rely on deterministic codes and rules, which means that engineers have to write a "script" in advance. Every movement of the robot is strictly executed according to the pre - set requirements of the "script". However, this model has a big problem: if the robot's working environment changes a little, the code has to be rewritten. The robustness is weak, and it is difficult to cross the commercial threshold.
However, after Google tried to combine the LLM with the physical execution of robots and launched Embodied Multimodal Large Models such as Google PaLM - E and RT - 2 in August 2023, robots can now automatically break down and execute complex tasks into several steps by using natural language commands. The Large Language Model has thus overcome the ability from "dialogue understanding" to "physical execution".
Jensen Huang pointed out the essence of this technological evolution in his speech at CES 2026: Physical AI is actually a transition of the lowest - level control. When physical AI has exceeded the critical point of technological evolution, the control will be transferred from the deterministic codes written by humans to a neural network with generalization ability and understanding of physical laws.
From this point on, robots not only have the ability to "execute codes", but also the ability to "understand commands and plan their own movements".
When the Large Language Model solves the problem of "understanding", the world model solves the problem of "acting in the physical world". The core of the world model is that AI learns an internal understanding of the working laws of the physical world.
The release of NVIDIA's physical AI world basic model platform Cosmos at CES last year was a remarkable event. The core ability of this model is that it can generate motion data that conforms to physical laws from text or images. Developers can use Cosmos to accelerate the development of physical AI for intelligent cars, robots and video analysis agents.
According to NVIDIA, Cosmos was trained based on over 20 million hours of real data, which significantly reduces the difficulty of the simulation and model training processes. With the world model, the AI system can conduct a variety of simulation exercises in a virtual environment and then transfer them to the real physical world.
The ultimate ability of a robot is not to "see" or "understand", but to "act correctly". Vision - Language - Action models enable robots to process visual inputs, language understanding and motion control simultaneously, so as to close the closed - loop from "seeing" to "doing".
DeepMind launched the new Multimodal Embodied Intelligence Large Model Gemini Robotics 1.5 last September, which is called the world's first thinking model optimized for physical reasoning. NVIDIA launched the open - source model Isaac GR00T N1.6, which was specially developed for humanoid robots and enables whole - body control.
At the same time, the Beijing Humanoid Robot Innovation Center open - sourced the Embodied Cerebellum Large Model XR - 1. This model is the first in China that meets the national standards for physical AI. It was trained based on over one million data and can execute complex two - arm operations such as grasping, pushing, pulling and turning.
Thus, physical AI has "collected" the necessary basic technological capabilities for implementation. The LLM enables machines to "understand" human intentions, the world model enables machines to "predict" physical consequences, and the VLA closes the last mile from "seeing" to "doing". The combination of these three factors enables robots for the first time to autonomously execute tasks in an open environment.
Of course, there are still bottlenecks in fine manipulation. The fine control of two arms and hands still has many problems to be solved. In other words, physical AI has got the "ticket" to work in the factory, but to actually "enter households and serve tea", it has to cross the qualitative threshold from "coarse movements" to "fine manipulations".
03 From technological vision to delivery ability
It is important to understand the past and present of physical AI. Now the physical AI industry faces the question of which core aspects the upcoming competition will revolve around.
We can learn lessons from the development of autonomous driving. The data war was inevitable for autonomous driving, and physical AI, which has a similar logic to autonomous driving, cannot avoid it either. Generally speaking, the one who has better training data has the say.
Today, NVIDIA has built the barrier of the world model with Cosmos. Its training based on over 20 million hours of real data is difficult to replicate. Zhipu Robotics has achieved mass production and delivery of 10,000 robots, which means that it has the ability to collect real, feedback - controlled data. This is regarded as a kind of data protection wall in the industry.
It should be noted that the data required for the competition in physical AI are not simply measured by size, but require the cooperation of synthetic data and real data.
Relying solely on real data leads to problems in scaling and hardware wear - and - tear costs. Over - relying on synthetic data leads to a gap in the transfer from simulation to reality (sim2real). The "Cross - Data - Source - Learning" program of the Beijing Humanoid Robot Innovation Center is the result of this thinking. It enables robots to train with a variety of human videos, which significantly reduces training costs and improves training efficiency.
So it is easy to understand that the one who can really close the closed - loop of "Synthetic data training - Real data fine - tuning - Real - time scenario feedback" in the future will have an overview in this competition.
After the data problem is solved, the efficient integration of physical AI and virtual AI is the key to the further development of physical AI.
When we talk about physical AI, we often forget that physical AI and virtual AI are not contradictory. From a technological structure perspective, a complete physical AI system can be roughly divided into three layers: the lowest layer is the perception layer (sensors, visual recognition), the middle layer is the cognitive decision - making layer (AI reasoning), and the top layer is the action execution layer (mechanical control).
Virtual AI is mainly responsible for the middle layer, while physical AI has to close the entire chain from perception to execution.
NVIDIA's "Chip + Model + Tool" overall package is the implementation of this idea. The Jetson Thor Edge Computing Platform provides the computing power, the GR00T model provides the intelligence, and the Isaac platform provides the development toolchain. According to this solution, the one who can do a good job in the deep integration of hardware and software in the future can not only close the closed - loop from the "brain" to the "body" of physical AI, but also build its own technological protection wall.
The last point is the commercial process of physical AI. Three years ago, the imagination space of capital for the robot sector was characterized by "technological vision". Now the capital market has a more realistic evaluation standard, namely the delivery ability.
According to media reports, the total financing volume in the field of physical AI in China in 2025 was 73.5 billion yuan, and there were 744 investment and financing events. Since 2026, more than 3.7 billion yuan has been added, and the sum has exceeded 110 billion yuan. But behind...