Google launches an on-device VLA model, and the robot version of Android makes its debut. Robots can learn new actions after just 50 demonstrations.
Real-world model + Android for robots? Google launches the most powerful device robot model.
According to a report by Zhidx on June 25th, early this morning, Google launched the first on-device robot model, Gemini Robotics On-Device, further bringing the multimodal reasoning and real-world understanding capabilities of Gemini 2.0 into the physical world.
In March this year, Google launched its most powerful VLA (Visual Language Action) model, Gemini Robotics. The Gemini Robotics On-Device launched today is an optimized version of Gemini Robotics and also the first fine-tunable VLA model, which can run on local robot devices and has strong general flexibility and task generalization capabilities.
As shown in the video, Gemini Robotics On-Device introduces AI into robots and can handle various complex two-handed operation tasks out of the box, such as folding clothes and opening bags.
At the same time, Google also launched the Gemini Robotics SDK to help developers evaluate the performance of Gemini Robotics on devices, including testing in the MuJoCo physics simulator. Developers only need 50 - 100 demonstrations to complete model evaluation and enable robots to learn new skills.
Once the model was released, it attracted nearly 300,000 users on social platform X. Some users said, "These on-device models have firmly put Gemini Robotics on the path to becoming the 'Android of the robotics world'. Eventually, OEM (integration) manufacturers only need to focus on building the best robot hardware, and Gemini can simply serve as the 'brain'."
01. Designed for dexterous manipulation, enabling robots to open bags and fold clothes
Gemini Robotics On-Device is a foundational model designed for dual-arm robots that can minimize computational resource requirements. It is based on the task generalization and flexibility features of Gemini Robotics and has the following characteristics:
1. Designed for rapid experimentation in dexterous manipulation.
2. Adapt to new tasks through fine-tuning to improve performance.
3. Optimized for local operation and low-latency inference.
Gemini Robotics On-Device achieves strong visual, semantic, and behavioral generalization in a wide range of test scenarios. It can follow natural language instructions and smoothly complete highly dexterous tasks such as opening bags and folding clothes. All of these are done directly on the robot.
In Google's evaluation, Gemini Robotics On-Device showed strong generalization performance when running completely locally. The following figure shows the comparison results between it and Google's flagship Gemini Robotics model and the previous best on-device model. Gemini Robotics On-Device scored the highest in all three tests: Visual Gen, Semantic Gen, and Action Gen.
In more challenging distributed tasks and complex multi-step instructions, the Gemini Robotics On-Device model also outperforms other on-device alternatives. The following figure shows the evaluation results of the instruction tracking performance of Gemini Robotics On-Device. It scored higher than the flagship Gemini Robotics model and the previous best on-device model.
For more details, you can read the Gemini Robotics technical report "Gemini Robotics: Bringing AI into the Physical World" published by Google in March this year.
Report link: https://arxiv.org/pdf/2503.20020
02. The first fine-tunable VLA model, suitable for various forms such as robotic arms and humanoid robots
Gemini Robotics On-Device is Google's first fine-tunable VLA model.
Although many tasks can be run directly, developers can also choose to adjust the model to achieve better performance for their applications. Gemini Robotics On-Device can quickly adapt to new tasks, which can be completed with only 50 - 100 demonstrations. This fully demonstrates the ability of this on-device model to generalize its basic knowledge to new tasks.
Google demonstrated how Gemini Robotics On-Device outperforms the current best on-device VLA in tasks involving fine-tuning to a new model. They tested the model in seven dexterous manipulation tasks of different difficulties, including opening the zipper of a lunch box, drawing on a card, and pouring salad dressing.
The following figure shows the task adaptation performance of Gemini Robotics On-Device, which includes nearly 100 examples.
Google further adjusted Gemini Robotics On-Device to make it suitable for different robots. Although they only trained the model on the ALOHA robot, they were able to further adapt it to the dual-arm Franka FR3 robot and Apptronik's Apollo humanoid robot.
On the dual-arm Franka, the model can execute general instructions, including handling previously unseen objects and scenarios, completing dexterous tasks such as folding clothes, or performing industrial belt assembly tasks that require precision and dexterity.
On the Apollo humanoid robot, Google adjusted the model to adapt to a completely different form. The same general model can follow natural language instructions and manipulate different objects in a general way, including objects that have never been seen before.
Google is developing all Gemini Robotics models in accordance with its artificial intelligence principles and applying a comprehensive safety approach covering semantic and physical safety.
03. Conclusion: Large models are accelerating their implementation in the physical world
Gemini Robotics On-Device marks an important step forward in the accessibility and adaptability of powerful robot models, and is expected to help robot developers address important latency and connectivity challenges.
It is worth mentioning that the Gemini Robotics SDK supports developers to adjust the model according to their own needs, further accelerating innovation. In the next step, we are expected to see more robot developers use these new tools to build robots with innovative applications.
This article is from the WeChat official account "Zhidx" (ID: zhidxcom). Author: Li Shuiqing, Editor: Xinyuan. Republished by 36Kr with permission.