Google Releases Local VLA Model: Is the "Android System" of the Robotics World on the Way?

In the past, limited by bandwidth and computing power, many robot AIs could only serve as demonstrations.

"Previously, limited by bandwidth and computing power, many robot AIs could only serve as demonstrations. Google's progress this time means that the general model can truly run on hardware terminals, and in the future, complex operations can be performed without relying on the Internet."

On June 25th, Google DeepMind officially released the first Visual-Language-Action (VLA) model, Gemini Robotics On-Device, that can be fully deployed locally on robots.

This also means that Embodied AI is entering a crucial turning point from relying on cloud computing power to local autonomous operation, opening a new window of possibility for industrial implementation.

Quickly learn with a small number of demonstrations and have the ability to generalize across robot forms

All along, the deployment of Embodied AI has faced two major challenges: one is the heavy dependence on cloud computing resources, which limits the independent operation ability of robots in unstable or network-free environments; the other is that the model is bulky and difficult to run efficiently on the limited computing resources of robots.

According to the official introduction, Gemini Robotics On-Device can run locally on robot devices with limited computing power while demonstrating excellent generality and task generalization ability. Since the model does not need to rely on a data network, it has significant advantages for applications sensitive to latency.

More importantly, the model demonstrates a high level of general ability and stability in actual operations. In the demonstration video shown by Google DeepMind, the robot completed tasks such as "put a Rubik's Cube into a packaging bag" and "unzip a bag" without an Internet connection, covering multiple links including perception, semantic understanding, spatial reasoning, and high-precision execution.

DeepMind researchers said that it has the generality and flexibility of Gemini Robotics and can immediately handle various complex two-handed tasks. Moreover, it only needs 50 - 100 demonstrations to learn new skills. An engineer in the robotics field told reporters that currently, most robots need to be trained thousands of times to complete a single task. This means that Google's new model greatly expands the application scope and deployment flexibility of the model.

It is worth noting that although the model was initially trained for specific robots, it can generalize to different robot forms, such as dual-arm robots and humanoid robots, greatly expanding its application potential. In the demonstration video, it can be seen that on the dual-arm Franka, the model can execute general instructions, including handling previously unseen objects and scenarios, completing delicate tasks such as folding clothes, or performing industrial belt assembly tasks that require precision and dexterity.

In addition, Google has opened the fine-tuning function of the VLA model for the first time, which means that engineers or robot companies can customize and train the model based on their own data, thereby optimizing its performance in specific tasks, scenarios, or hardware platforms and further improving application efficiency and practical value. At the same time, Google has also launched the Gemini Robotics SDK to facilitate developers to evaluate and quickly adjust the model. From these actions, it can be seen that Google hopes to provide an open, general, and easy-to-develop platform for the robotics field, just as the Android system has done for the smartphone industry.

Embodied AI is entering the "edge device era"

"This marks that robots can finally enter the real environment," an expert in the field of Embodied AI told a reporter from LanJing Technology. "Previously, limited by bandwidth and computing power, many robot AIs could only serve as demonstrations. Google's progress this time means that the general model can truly run on hardware terminals, and in the future, complex operations can be performed without relying on the Internet."

Embodied AI was once considered a bridge for AGI to the real world. The VLA model with local deployment ability is a key link for this bridge to be opened to traffic. The aforementioned expert told a reporter from LanJing Technology that the local VLA model will make robots more suitable for sensitive scenarios such as home, medical, and education, solving core challenges such as data privacy, real-time response, and safety and stability.

In the past few years, the "edge device deployment" of large language models has become one of the important trends. From initially relying on large-scale cloud computing resources to now being able to run locally on edge devices such as mobile phones and tablets, continuous progress has been made in model compression and optimization, inference acceleration, and hardware collaboration.

The same evolutionary path is gradually taking place in the field of Embodied AI. As the core architecture of Embodied AI, the VLA model (Visual-Language-Action) essentially enables robots to understand tasks from multi-modal information and take actions. Previously, such models often needed to rely on powerful cloud resources for reasoning and decision-making. Limited by network bandwidth, computing power consumption, and real-time bottlenecks, they were difficult to run efficiently in complex real-world environments.

Google's release of Gemini Robotics On-Device this time means that Embodied AI is entering an "edge device era" similar to that of language models. It not only achieves stable operation with limited computing power but also has good generality and migration ability, supporting rapid learning and adaptation to different tasks and robot forms. This release may also trigger a chain reaction in the industry. With the continuous evolution of AI computing power and model architecture, "edge intelligence" is moving from the traditional Internet of Things (IoT) to a more advanced stage represented by Embodied AI.

The local VLA model will become the next battleground. "Currently, the differences in the body structure, degrees of freedom, and sensor configurations of various robots make it difficult to achieve a unified software architecture." An investor focusing on the robotics field said. "Once the hardware standards tend to be unified, just like the specifications formed by common components such as USB interfaces, keyboards, and screens in the smartphone ecosystem, it will greatly promote the standardization of algorithms and the realization of local deployment." He believes that Google's vision of building a "robot Android ecosystem" indicates that a more standardized, easy-to-develop, and popular Embodied AI is expected to emerge.

However, the challenges in actual implementation cannot be underestimated. The diversity and complexity of robot hardware remain prominent issues. The wide variety of robot hardware in the market means that even a powerful general model needs to be carefully adapted and optimized for each specific hardware. In addition, to truly implement in a large number of diverse real-world application scenarios, the cost of data collection and annotation may be extremely high, especially in industrial or specific service scenarios that require professional operation knowledge and equipment.

More importantly, robots need to maintain robustness in an extremely complex, dynamic, and unpredictable real-world environment. Changes in lighting, object occlusion, unstructured and cluttered environments, and subtle differences in human-machine interaction will all pose severe tests to the model's real-time perception and decision-making ability. Ensuring that robots can maintain a high level of stability and safety in various real-world scenarios is a difficult problem that must be continuously overcome in the future development of Embodied AI.

This article is from the WeChat official account "LanJing TMT", author: Wu Jingjing, editor: Chen Ye, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Google releases a local VLA model. Is the "Android system" of the robotics world coming?

Quickly learn with a small number of demonstrations and have the ability to generalize across robot forms

Embodied AI is entering the "edge device era"