HomeArticle

What's going on with the Physical AI resulting from the collaboration between Alibaba and NVIDIA?

字母榜2025-09-26 09:25
Alibaba announced that its AI platform will integrate the complete NVIDIA Physical AI software stack into its developer options menu.

At the Yunqi Conference, Alibaba announced that its AI platform would include the complete NVIDIA Physical AI software stack in its developer options menu. This seemingly technical announcement actually marks an important turning point in the development of artificial intelligence. Jensen Huang, the CEO of NVIDIA, clearly stated at the 2025 CES that the next frontier of AI is Physical AI, which holds great potential and opportunities.

According to market research data, the global industrial robot market is expected to grow from 154.4 billion yuan in 2024 to 300 billion US dollars in 2025. Among them, the application market of AI technology in industrial robots is expanding rapidly at a compound annual growth rate of 21.9%.

However, most current industrial robots are still traditional automation equipment, performing fixed actions according to pre - set programs. Once the environment changes, such as a slight shift in the position or shape of parts, manual reprogramming is required. Physical AI robots can autonomously adapt to these changes and complete tasks through real - time perception and decision - making.

The growth potential brought about by the upgrade from traditional industrial robots to Physical AI is the fundamental reason for the cooperation between Alibaba and NVIDIA. But before that, we need to understand one question: What is Physical AI?

A

If we have to summarize what Physical AI is in one sentence, it is a technology that enables artificial intelligence to step out of the screen and truly enter the physical world.

Take a simple example: Traditional AI can recognize a cup and tell you what it is; while Physical AI can not only recognize the cup but also judge its weight, material, calculate the force required to grasp it, and figure out how to avoid spilling the liquid inside. This difference determines that their application scenarios are completely different.

Jensen Huang emphasized that the core of Physical AI lies in combining physical laws with artificial intelligence technology. By integrating real - world physical rules, it optimizes the content generated by AI to make it more in line with the logic and laws of the real world. As the name suggests, Physical AI is physics + AI, which means the content fed back by artificial intelligence should conform to physical laws.

The concept of Physical AI did not emerge overnight but is the result of NVIDIA's years of technological accumulation and strategic layout. As early as 2021, NVIDIA began to mention the concept of Physical AI at the GTC conference. However, it was not until the GTC 2024 conference in March 2024 that it was officially launched as a core strategy. At that conference, Jensen Huang systematically elaborated on the vision of Physical AI for the first time and released relevant technology platforms and toolchains.

In Jensen Huang's view, the development of AI has gone through three distinct stages: Initially, there was Perceptual AI, which can understand images, text, and sounds. The representative technologies of this stage are computer vision and speech recognition; then came Generative AI, which can create text, images, and sounds, represented by ChatGPT, DALL - E, etc. Now we are entering the era of Physical AI, where AI can not only understand the world but also reason, plan, and act like a human being.

The technical foundation of Physical AI is built on three key components: the World Model, the Physics Simulation Engine, and the Embodied Intelligence Controller. The World Model is the cognitive core of Physical AI. Different from traditional language models or image models, it needs to build a complete understanding of the three - dimensional space, including the geometric shape, material properties, motion state, and interrelationships of objects. Technically, this is usually achieved through methods such as Neural Radiance Fields (NeRF), 3D Gaussian Splatting, or Voxel Grids for spatial representation. The model needs to learn the implicit representation of physical laws, such as parameters like gravitational acceleration, friction coefficient, and elastic modulus, and be able to predict future physical evolution based on the current state.

The Physics Simulation Engine is responsible for real - time calculation of physical interactions. This is not a simple set of pre - set rules but a dynamic computing system based on partial differential equation solvers, which needs to handle complex physical phenomena such as rigid - body dynamics, fluid mechanics, and soft - body deformation. In terms of technical implementation, the Finite Element Method (FEM), Particle System, or deep - learning - based differentiable physics simulators are usually used. The key lies in the balance between computational efficiency and accuracy - the system needs to complete complex physical calculations within milliseconds while ensuring sufficient accuracy to support accurate decision - making.

The Embodied Intelligence Controller is the bridge connecting virtual reasoning and physical execution. It receives the prediction results from the World Model and the calculation output from the physical simulation and generates specific control instructions. Technically, this is usually based on Model Predictive Control (MPC) or Deep Reinforcement Learning (DRL) algorithms. The controller needs to handle high - dimensional state and action spaces while considering the physical limitations, delays, and noise of the actuators.

From the perspective of the system architecture, Physical AI adopts a hierarchical design. The perception layer integrates a multi - modal sensor array, including RGB - D cameras, lidars, IMUs, force/torque sensors, etc. The key technical challenge lies in sensor fusion and real - time processing. The system needs to unify the data from different sensors into the same coordinate system, handle time synchronization, calibration errors, and data noise. Technically, methods such as Kalman filtering, particle filtering, or deep - learning - based sensor fusion networks are usually used.

The cognitive layer runs the World Model and the Physics Simulation Engine. This layer has extremely high computational intensity and requires specialized hardware acceleration. NVIDIA's solution is to use GPU clusters for parallel computing and has developed specialized CUDA kernels to optimize physical simulation algorithms. Memory management is also a key technical point - the system needs to maintain large - scale 3D scene representations and physical states in limited GPU memory.

The execution layer is responsible for motion planning and control. The core technology is inverse kinematics solving and trajectory optimization. For multi - degree - of - freedom robot systems, complex constrained optimization problems need to be solved in real - time. Modern methods usually combine analytical solutions and numerical optimization, use the pseudo - inverse of the Jacobian matrix to handle redundant degrees of freedom, and adopt Quadratic Programming (QP) or Sequential Quadratic Programming (SQP) to handle constraints.

When NVIDIA released Physical AI, it also launched a corresponding complete technology ecosystem, including the Omniverse simulation platform, the Isaac robot development kit, the Cosmos world foundation model, etc.

This is because the training of Physical AI requires a large amount of physical interaction data, but the cost of collecting data in the real world is extremely high. The solution is data generation based on simulation. So NVIDIA uses the Omniverse and Cosmos platforms to generate large - scale synthetic training data, including various physical scenarios, material properties, and interaction modes. However, models trained in the simulation environment often perform poorly in the real world, which is known as the "reality gap." What NVIDIA is currently doing is using Sim - to - Real Transfer technology to bridge the gap between virtual and real data.

Physical AI has much higher requirements for computing resources than traditional AI applications. A single Physical AI system may require hundreds of GPU cores to run in real - time. NVIDIA has specifically developed the RTX PRO server and the DGX Cloud platform to support this computing demand. The system architecture adopts distributed computing, assigning different computing tasks to specially optimized hardware. This technical architecture enables Physical AI to achieve real - time perception, reasoning, and action in complex real - world environments, truly realizing the leap of AI from the virtual world to the physical world.

Another point is that, different from traditional AI systems that mainly process digital information such as text and images, Physical AI is driven by large models. It enables machines not only to process data but also to understand the spatial relationships and physical laws of the three - dimensional world. This technology endows AI systems with spatial perception abilities similar to those of living organisms, enabling them to perform complex physical operations in the real environment.

Take a specific example to illustrate this difference: If an AI generates a video of a robot grasping an object, traditional generative AI may create scenes where the object floats in the air, the robotic arm passes through solid obstacles, or violates the law of gravity, because it only imitates at the pixel level based on training data. Physical AI, on the other hand, ensures that the generated content fully conforms to the operation mode of the physical world - objects will fall under the influence of gravity, the robotic arm must bypass obstacles, and the grasping force must match the weight of the object.

The profound significance of this technological innovation is that it transforms AI from a pure information - processing tool into an intelligent system that can truly understand and operate in the physical world. Traditional AI is like a scholar who only reads books but has never practiced, possessing rich theoretical knowledge but lacking practical experience; while Physical AI is like an engineer with both theoretical knowledge and practical experience, who not only knows what and why but more importantly, knows how to do it and can transform abstract knowledge into concrete actions.

B

Jensen Huang is extremely optimistic about the future of Physical AI. He once said at CES that Physical AI will trigger an industry transformation worth over $50 trillion, involving 10 million factories, 200,000 warehouses, billions of humanoid robots in the future, and 1.5 billion cars and trucks. This figure is shocking, but there is solid logic behind it.

There are one billion knowledge workers in the world, and AI agents may be the next big thing in the robotics industry, likely a trillion - dollar opportunity, Jensen Huang said at CES 2025. He believes that Physical AI means AI is no longer confined to the virtual world but is starting to enter the real world and will become the mainstream application in various industries such as robotics, logistics, automotive, and manufacturing.

In Jensen Huang's plan, there will be two high - volume robot products in the future: The first is self - driving cars, and the second is most likely humanoid robots. Both types of machines need to have human - like perception abilities, be able to cope with rapidly changing environments, and make instant reactions with almost no room for error. He is particularly excited about the potential of humanoid robots because they are most likely to adapt to environments designed for humans.

Jensen Huang also predicted that the era of robots has arrived, and in the future, all moving objects will operate autonomously. Behind this prediction is a deep judgment on the maturity and application potential of Physical AI technology. From a technological development perspective, with the improvement of computing power, the reduction of sensor costs, and the optimization of algorithms, Physical AI is approaching the critical point of moving from a laboratory concept to commercial application.

NVIDIA's layout in the field of Physical AI can be traced back to its investment in robotics technology many years ago. The core of the Physical AI concept proposed by the company is to combine physical laws with artificial intelligence technology. By integrating real - world physical rules, it optimizes the content generated by AI to make it more in line with the logic and laws of the real world.

However, NVIDIA cannot move too aggressively. Different from traditional AI applications, Physical AI systems interact directly with the physical world, and errors in these systems may lead to serious safety consequences. This requires Physical AI systems to meet higher reliability and safety standards.

NVIDIA's current solution is the Halos safety system. It is a full - stack safety system that can unify the hardware architecture, AI models, software tools, and safety standards to ensure the stable operation of Physical AI systems in various environments. From data collection, model training to deployment and application, every step requires strict safety verification.

Looking at Alibaba, their decision to include NVIDIA's Physical AI software stack in the developer options has deep - seated strategic considerations. Current large - scale AI model applications are mainly concentrated in online scenarios, while Physical AI aims to integrate the entire real world into AI. This leap from the virtual to the real world is the high - ground that Alibaba Cloud needs to seize in the AI era.

Wu Yongming, the Chairman and CEO of Alibaba Cloud Intelligence Group, said at the Yunqi Conference: The greatest potential of generative AI is not to create one or two new super apps on the mobile phone screen but to take over the digital world and change the physical world. This statement clearly shows Alibaba's understanding of the importance of Physical AI.

Zhou Jingren, the CTO of Alibaba Cloud, once said that Tongyi Qianwen has open - sourced more than 300 models, with a cumulative download volume of over 600 million.

However, in the face of the development trend of Physical AI, the Tongyi large - scale model also faces the challenge of transforming from two - dimensional understanding to three - dimensional interaction. Traditional large - language models are good at processing text and images but have inherent limitations in understanding the spatial relationships and physical laws of the physical world. This is the fundamental reason why Alibaba needs to introduce the Physical AI technology stack.

But this is also Alibaba's bottleneck. Most of Alibaba's data comes from the Internet rather than offline sources. This forces them to find a new way to help Tongyi make the transition from the virtual to the physical world.

Feifei Li once expressed a similar view. She believes that for AI, if it cannot build a three - dimensional world model, it cannot truly understand, operate, or reconstruct the real world.

By integrating NVIDIA's Physical AI software stack, Alibaba can endow the Tongyi large - scale model with spatial understanding and physical interaction capabilities. This integration is not just a technical superposition but a strategic transformation from language intelligence to spatial intelligence. Developers can use Alibaba Cloud's infrastructure and the language capabilities of the Tongyi large - scale model, combined with NVIDIA's physical simulation and robot control technology, to build AI systems that can truly work in the physical world.

On the other hand, the development of Physical AI is not isolated. It needs to be deeply integrated with the existing AI technology ecosystem. Large - language models provide powerful language understanding and reasoning capabilities, computer vision technology provides environmental perception capabilities, and robotics technology provides physical execution capabilities. Physical AI is the product of the integration of these technologies.

In this integration process, the data flow and processing architecture are crucial. Physical AI systems need to process massive amounts of data from multiple sensors in real - time, make quick decisions, and control actuators to complete actions. This places extremely high requirements on the computing architecture and algorithm optimization.

Cloud - edge collaboration is an important deployment mode for Physical AI. Complex AI inferences can be carried out in the cloud, while real - time control decisions are made on edge devices. This architecture can not only utilize the powerful computing power of the cloud but also meet the real - time requirements.

So, in a sense, Alibaba also provides nourishment for the development of Physical AI.

C

If the first - generation Perceptual AI enabled machines to see and hear, and the second - generation Generative AI enabled machines to create, then Physical AI enables machines to truly learn to act.

However, the development of Physical AI also faces many challenges. Firstly, there are technical challenges. How to make AI systems operate stably in complex physical environments and how to reduce the huge computational cost to achieve the popularization of the technology are urgent problems to be solved. In addition, the "reality gap" between simulation training and real - world applications is also a major problem. Although simulation can provide a large amount of data, ensuring the applicability of this data in the real world is a key issue.

Physical AI may not disrupt all industries as rapidly as some predictions suggest, but it will gradually change our work and life. It is not only a technological innovation but also a subversion and reshaping of traditional industries. With the continuous development of technology and the expansion of application scenarios, Physical AI will become an important driving force for global economic growth and social progress.

This article is from the WeChat official account “Zimu Bang” (ID: wujicaijing), author: Miao Zheng. Republished by 36Kr with permission.