Google unveils its most powerful robot brain, capable of sorting garbage with a single sentence. Here comes the analysis of the key technologies.
According to a report by ZDONGXI on September 26th, today, Google DeepMind launched the Gemini Robotics 1.5 series of robot models. Through the chain-of-thought mechanism and model collaboration, it further enhances the autonomy of robots, enabling them to perceive, plan, think, use tools, and take actions to better solve complex multi-step tasks. Google calls it an important step in bringing AI agents into the physical world.
The two models released by Google DeepMind this time are Gemini Robotics 1.5 and Gemini Robotics-ER 1.5. The former is Google's most powerful VLA (Vision-Language-Action) model at present, which can convert visual information and text instructions into control commands for robots and mainly acts as the cerebellum of robots. This model thinks before taking actions and shows the thinking process. It can also learn on different robot bodies to improve learning efficiency.
Gemini Robotics-ER 1.5 is Google's most powerful VLM model (Vision-Language Model) to date, which can reason about the physical world and is more like the brain of robots. It natively has the ability to call digital tools and create detailed multi-step plans to complete tasks. This model achieved state-of-the-art performance in the spatial understanding benchmark test. Its embodied reasoning ability far exceeds that of models such as GPT-5 and Gemini 2.5 Flash.
Robots equipped with the above two new models have thus unlocked the ability to complete complex long-chain tasks. For example, you can ask the robot to query the local garbage classification requirements and put the items on the table into the correct trash cans. The model can accurately understand this complex requirement and drive the robot to complete the task.
Developers can use the Gemini Robotics-ER 1.5 model through the Gemini API in Google AI Studio, while Gemini Robotics 1.5 is currently available to some partners. Google also released a technical report on the Gemini Robotics 1.5 series of models.
Technical Report:
https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf
Model Link:
https://deepmind.google/models/gemini-robotics/gemini-robotics/
01. Built on the Gemini foundation model, with training data from 3 types of robots
For robots, most daily tasks require contextual information and multiple steps to complete, which is quite challenging for current robots. To help robots complete complex, multi-step tasks, Google DeepMind made the two models, Gemini Robotics 1.5 and Gemini Robotics-ER 1.5, work together in the same Agent framework.
The embodied reasoning model Gemini Robotics-ER 1.5 coordinates the activities of robots like a brain. This model is good at planning and making logical decisions in the physical environment. It also has advanced spatial understanding ability, can interact with users in natural language, judge whether the task is successful and the progress of the task, and can call tools such as Google Search to find information or use any third-party user-defined functions.
Gemini Robotics-ER 1.5 provides natural language instructions for each step, while Gemini Robotics 1.5 directly executes specific actions using its visual and language understanding. Gemini Robotics 1.5 also helps robots think about their actions to better solve semantically complex tasks and can even explain its thinking process in natural language, making its decision-making more transparent.
Both models are built on the Gemini series of models, which enables them to inherit Gemini's general abilities in multi-modal world knowledge, advanced reasoning, and tool use. After that, the two models were fine-tuned using different datasets to focus on their respective roles. When combined, they can improve the ability of robots to generalize to long tasks and diverse environments.
The training dataset commonly used by the Gemini Robotics 1.5 series of models consists of three modalities: images, text, and robot sensor and action data.
The robot dataset used for training is multi-embodiment, covering thousands of diverse tasks, from grasping and manipulation to two-arm collaboration and humanoid robots performing complex daily tasks. These data were collected from multiple heterogeneous robot platforms, including ALOHA, Bi-arm Franka, and Apollo humanoid robots.
The Gemini Robotics 1.5 series of models can complete cross-embodiment tasks out of the box
In addition to the robot-specific dataset, the training data also includes public text, image, and video datasets from the Internet, enabling the model to not only have robot-related skills but also improve generalization ability with the help of large-scale world knowledge.
To ensure the high quality and safety of training, all data must be strictly processed before use. Google DeepMind ensures that the data complies with relevant policies and removes low-quality samples and non-compliant content through multi-stage screening.
Each image in the dataset is equipped with an original description and a synthetic description. These synthetic descriptions are generated by the Gemini and FlexCap models to help the model capture details and contextual semantics in the images.
The latest generation of hardware, including TPU v4, v5p, and v6e, was used in the training process, combined with the JAX and ML Pathways frameworks to achieve efficient parallel training and cross-platform expansion.
02. Achieve cross-embodiment through the motion transfer mechanism, allowing robots to "think before acting"
As a VLA model, the mission of Gemini Robotics 1.5 is to "understand instructions and convert them into actions". To achieve this goal, researchers introduced a key mechanism - Motion Transfer (MT) in the training.
The role of MT is to break the "barriers" between different robots. In traditional methods, if a robot learns a certain skill, it often needs additional training to transfer it to another robot.
With the support of MT, Gemini Robotics 1.5 can directly achieve zero-shot transfer between different entities. That is to say, even if the model only learned "opening a drawer" on the ALOHA robot platform, it can complete the same task on the Apollo humanoid robot.
This ability comes from the unified modeling of motion and physics by the MT mechanism, which can align the data of different platforms and extract the commonalities.
In addition, Gemini Robotics 1.5 also has the ability of embodied thinking. Before performing an action, it will generate a "thinking trajectory" presented in natural language. This trajectory helps the model break down complex tasks into more detailed steps.
For example, when receiving the instruction "Help me clean up the table", the model may first break it down into small steps such as "Pick up the cup", "Move to the sink", and "Put down the cup" in its thinking. This method not only reduces the difficulty of direct mapping from language to action but also makes the model more robust during execution.
If the cup drops during the movement, it will immediately adjust the thinking trajectory to "Pick up the cup again" instead of simply judging the task as failed.
Different from Gemini Robotics 1.5, Gemini Robotics-ER 1.5 does not directly control robots to perform specific actions but focuses on embodied reasoning and is responsible for high-level task planning and decision-making.
During training, Gemini Robotics-ER 1.5 was specially optimized for the key abilities required for robot tasks. First, it can complete complex task planning and break down long-term goals into a series of reasonable subtasks.
Second, it has strong spatial reasoning ability and can combine visual and temporal information to understand the relative positions and motion trajectories of objects. Finally, it can also estimate the task progress, judge in real-time whether the task is successful and the degree of completion, and adjust subsequent actions accordingly.
Some tasks that Gemini Robotics-ER 1.5 can complete
Gemini Robotics-ER 1.5 achieved the highest comprehensive performance in 15 academic embodied reasoning benchmark tests, exceeding models such as Gemini Robotics-ER 1.0 and GPT-5.
It can accurately map language descriptions to visual targets, such as "Point to the blue cup in the lower left corner of the table", or judge in real-time whether the robot's actions achieve the goal based on multi-view information, which is crucial for the stable execution of long-sequence tasks.
In the entire system, Gemini Robotics-ER 1.5 is positioned as an orchestrator. It receives human instructions and environmental feedback, formulates overall plans, and then converts these plans into specific action instructions that Gemini Robotics 1.5 can execute. It also has the ability to call external tools (such as web search) to ensure that robots can still respond flexibly in complex scenarios.
However, robots with higher autonomy and execution ability may also bring security risks. For this reason, Google DeepMind has developed new security and alignment methods, including a top-level security judgment mechanism and a more underlying security subsystem (such as a system for avoiding collisions).
Google DeepMind also released an upgraded version of the robot security benchmark test ASIMOV, which is a comprehensive dataset for evaluating and improving semantic security, with better edge scenario coverage, improved annotations, new types of security issues, and a new video mode.
In the ASIMOV benchmark test, Gemini Robotics-ER 1.5 showed state-of-the-art performance. Its thinking ability greatly helps to improve the understanding of semantic security and better comply with physical security constraints.
03. Conclusion: The consensus on cross-embodiment of robot models is gradually forming
Different from the traditional training methods that rely on single data and specific platforms, the Gemini Robotics 1.5 series of models enables robots to transfer skills across platforms and show human-like adaptability in complex environments through multi-embodiment data, the motion transfer mechanism, and the paradigm of embodied thinking and reasoning, expanding the generality of robot models.
This has also become one of the goals of many manufacturers in building robot models. Recently, the open-source robot world model UnifoLM-WMA-0 by Unitree, although with a different architecture, also has the ability to adapt to multiple robot bodies. Cross-embodiment may have gradually become the consensus and a new track in the industry.
This article is from the WeChat public account "ZDONGXI" (ID: zhidxcom), author: Chen Junda, editor: Yun Peng, published by 36Kr with authorization.