Physical AI answer: "How many steps does it take to put an elephant into a refrigerator?"
"How many steps are needed to put an elephant into a refrigerator?" In the past, the standard steps were: open the refrigerator door, put the elephant in, and close the refrigerator door. Then, if a robot were to carry out this instruction in an engineering practice, how many steps would it need? In the current era of rapid development of physical AI technology, we are not aiming to reproduce this scenario in reality. Instead, we use it as a concrete example to explore the technical capabilities of physical AI in the entire link of virtual simulation, logical reasoning, and real - world deployment, and verify how this technology can break the boundary between the information world and the physical world, providing new paths for solving complex engineering tasks.
When a robot needs to understand the physical properties of an elephant and the spatial structure of a refrigerator, and also plan a coherent sequence of actions, what lies behind it is the full - link technical support of virtual environment construction, large - model reasoning training, and real - world deployment. NVIDIA, with its in - depth integration in the fields of computer graphics, physical simulation, and AI, has built a complete bridge from the virtual to the real for physical AI with Omniverse + Cosmos at its core, making the engineering implementation of "putting an elephant into a refrigerator" possible.
Step 1: Build an "Elephant - Refrigerator" Scene Model in the Virtual World
In the engineering practice of robots performing complex tasks, the virtual environment is the "testing ground" for technical verification. Without elephant and refrigerator models that conform to physical laws, the subsequent AI training and reasoning for the task of "putting an elephant into a refrigerator" will lose a reliable foundation.
NVIDIA's core advantage lies in using Omniverse to build a digital twin space that can reproduce physical laws, and then endowing it with generative modeling capabilities through Cosmos, making the virtual existence of elephants and refrigerators both real and flexible.
NVIDIA Omniverse is not an ordinary 3D modeling tool. It is a real - time collaboration and simulation platform based on the OpenUSD (Universal Scene Description) standard. Its core is to reproduce the physical world at the millimeter level, ensuring a high degree of consistency between the virtual environment and real - world laws. When constructing a physical scene, Omniverse's physics engine will accurately calculate every detail: for an elephant, it will simulate its physical properties such as weight, muscle movement inertia, and skin elasticity, and can even restore the force distribution on the four limbs when the elephant walks, ensuring that the force feedback when the robot interacts with the elephant conforms to real - world laws; for a refrigerator, it will disassemble the hinge mechanics of the door opening and closing, the friction of the sealing strip, the volume limit of the internal space, and even simulate extreme scenarios such as door malfunctions (e.g., jamming, inability to close due to aging of the sealing strip), providing comprehensive scene coverage for subsequent tests.
More importantly, Omniverse supports multi - tool collaboration and real - time rendering. Designers can create the appearance model of an elephant in Maya and adjust the structural details of the refrigerator in Blender. All modifications will be synchronously updated to the Omniverse platform in real - time, avoiding problems such as file format incompatibility and version chaos in traditional modeling, and significantly improving the efficiency of virtual scene construction.
NVIDIA Cosmos, as a generative world foundation model platform for physical AI, can lower the threshold for virtual scene construction, enabling engineers to quickly generate training environments that meet the requirements. All generated scenes are based on technical feasibility, without exaggerated designs that deviate from reality.
As NVIDIA's generative world foundation model platform for physical AI, Cosmos has completely changed the way of virtual scene construction. In traditional scene construction, engineers need to manually model and adjust parameters, which may take weeks or even months. With Cosmos, simply inputting text (e.g., "an adult African elephant, a double - door refrigerator 2.5 meters high, placed in an indoor space of 20 square meters") or a reference image can automatically generate a virtual scene that conforms to physical laws.
The core of this generative ability lies in two aspects: first, common - sense understanding based on training with a large amount of physical data. For example, it can automatically recognize the basic sequence that "the elephant is larger than the refrigerator door, so the door needs to be opened first and then the elephant needs to be guided in", ensuring that the scene logic conforms to real - world cognition; second, in - depth collaboration with the Omniverse physics engine. The generated elephant model will automatically match the force feedback parameters of Omniverse, and the door opening and closing logic of the refrigerator will be directly connected to the simulation system without additional debugging. This means that for different scenarios, engineers do not need to rebuild the scene. They can quickly generate a new training environment through text instructions, significantly reducing the development threshold of physical AI.
Step 2: Teach AI to Understand Elephants and Refrigerators
With the virtual scene in place, the next step is to enable the robot to recognize the target and figure out the steps. This requires the large model to have the ability of physical understanding and logical reasoning. NVIDIA's Cosmos Reason was developed precisely to solve this problem. It enables robots to think about the task process like humans, rather than mechanically executing preset instructions.
The virtual task of "putting an elephant into a refrigerator" essentially simulates the scenario of "the interaction between a large object and an enclosed space", which involves multi - dimensional decision - making requirements: the AI needs to recognize the positional relationship between the object and the space, judge the operating state of the equipment, plan its own movement path, control the operation force to avoid malfunctions, and guide the object to move while avoiding obstacles. These requirements are highly consistent with the logic of engineering scenarios such as "industrial equipment handling" and "large - scale household appliance installation" in reality, providing a basis for simulated training for the engineering application of AI.
Cosmos Reason is an open, customizable, and commercially applicable vision - language model (VLM) with 7 billion parameters, specifically designed for physical AI. By integrating physical understanding, prior knowledge, and common - sense reasoning ability, this model enables robots, autonomous driving cars, and visual AI agents to operate intelligently in real - world environments.
Through Cosmos Reason, robots can interpret the environment, decompose complex commands into tasks when receiving them, and execute these tasks using common sense, even in unfamiliar environments.
Through visual input, Cosmos Reason can analyze the size of the "elephant" and the capacity of the "refrigerator" in real - time and determine whether "the elephant can fit into the refrigerator". It can also break down complex tasks into executable action scripts: "Move in front of the refrigerator → Detect the state of the door → Start the door - opening motor → Stop when the door is opened to 90 degrees → Move to the side of the elephant → Send a guiding signal → Adjust its own position as the elephant moves → Confirm that the elephant has completely entered → Close the refrigerator door". If the "refrigerator door gets stuck" in the virtual scene, Cosmos Reason will not apply excessive force repeatedly (to avoid motor damage). Instead, it will first detect the jamming position (e.g., a foreign object in the sealing strip) and then adjust the door - opening angle (slightly lift the door). This is based on the prior knowledge of "mechanical fault handling", rather than a single action instruction.
In robots, usually two AI models are needed: one VLM is responsible for understanding instructions and planning actions, and another vision - language - action model (VLA) is responsible for quick reaction and action execution. With Cosmos Reason as the VLM, robots can better understand vague instructions and derive specific action plans.
Step 3: Transition the Robot from Virtual Training to Real - World Deployment
How can the AI capabilities trained in the virtual world be applied in reality? In response, NVIDIA has proposed the concept of "three computers", providing complete technical support for physical AI from training to deployment, covering the entire life cycle of robot intelligence: one is DGX for training AI, another is AGX for deploying AI, and the last is Omniverse + Cosmos.
DGX: Training Physical AI
To enable a robot to learn how to "put an elephant into a refrigerator", a large amount of virtual scene data (e.g., elephants of different sizes, refrigerators of different structures, and different environmental interferences) is needed to train the model. The huge computing power required for this kind of training can only be achieved by relying on specialized super - computing infrastructure. Therefore, the computer used for training is crucial. NVIDIA's DGX system, with its powerful computing power, can efficiently process this data: on the one hand, it can quickly iterate the Cosmos Reason model and optimize the task - decomposition logic; on the other hand, through reinforcement learning, it can enable the robot to adjust its strategy in "failure scenarios" (e.g., closing the door before the elephant enters, or damaging the door due to excessive opening force), improving its robustness.
AGX: Deploying Physical AI
The trained model needs to be "installed" on real - world robots. NVIDIA's Jetson AGX series (e.g., NVIDIA Jetson Thor) is an edge - computing platform designed for this purpose, which can run the lightweighted Cosmos Reason model. In real - world scenarios, AGX can receive data from the robot's sensors (cameras, lidars) in real - time and quickly output action instructions. For example, after detecting the position of a real elephant, it can plan a movement path within 0.1 seconds, ensuring that the robot's actions are not delayed.
Omniverse + Cosmos: Simulation and Synthetic Data Generation Platform
This is the core link among the "three computers" and also the "buffer zone" between the virtual and the real. Researchers of large - language models are fortunate to have access to a large amount of Internet data for pre - training, but there is no such resource in the field of physical AI.
In reality, the cost of obtaining training data for the task of "putting an elephant into a refrigerator" is extremely high (it may damage the robot or harm the elephant), and it is difficult to cover all extreme scenarios (e.g., sudden power failure, slippery ground). At the same time, data collection is time - consuming and labor - intensive, making it costly and difficult to scale up. In Omniverse, engineers can simulate thousands or even more extreme scenarios to obtain a large amount of data for training physical AI.
Rev Lebaredian, vice - president of NVIDIA Omniverse and simulation technology, emphasized that physical AI is the bridge connecting the information world and the physical world, extending the influence of computing from the $5 - trillion information industry to the $100 - trillion physical - world market. "If you want to build a robot system that can operate safely in the real world, the only way is simulation. We must repeatedly test all possible extreme scenarios with simulation before deployment - real - world testing is too slow, too expensive, and too dangerous."
Beyond "Putting an Elephant into a Refrigerator": Physical AI Reconstructs All Industries
When a robot successfully "puts an elephant into a refrigerator" in reality, it also means that physical AI has taken a crucial step from a technical closed - loop to application implementation. But this is just the beginning. NVIDIA's physical AI, with Omniverse + Cosmos at its core, is penetrating into all industries such as industry, logistics, and healthcare, extending the influence of computing from the $5 - trillion information industry to the $100 - trillion physical - world market.
The virtual case of "putting an elephant into a refrigerator" is essentially a microcosm of NVIDIA's physical AI technology - it proves that through the closed - loop of virtual scene generation (Omniverse + Cosmos) → model reasoning training (Cosmos Reason + DGX) → real - world deployment optimization (AGX), AI can truly understand and transform the physical world. Now, NVIDIA is collaborating with partners such as Accenture, Avathon, Belden, DeepHow, Milestone Systems, and Telit Cinterion to strengthen global operations through perception and reasoning based on physical AI, integrating this technology into the global industrial ecosystem.
The virtual case of "putting an elephant into a refrigerator" is not about realizing an absurd real - world scenario. Instead, it marks the starting point of human technical exploration to break the boundary between the information world and the physical world with physical AI. And NVIDIA is at the forefront of this revolution.
This article is from the WeChat official account "Semiconductor Industry Insights" (ID: ICViews), written by Pengcheng, and is published by 36Kr with permission.