Google hat das neueste "Gehirn"-Modell für Roboter veröffentlicht. Es hat die Spitzenleistung in Bezug auf Denkfähigkeit und kann sogar "über Speziesgrenzen hinweg" lernen.
The newly released Gemini Robotics 1.5 series of models by Google enables robots to truly learn to "think" and acquire skills across different embodied forms. This means that future robots will become intelligent partners that collaborate with humans and actively complete complex tasks.
Google has once again "upgraded the brains" of robots!
Just now, DeepMind released the Gemini Robotics 1.5 series of models for robots and embodied intelligence, a new generation of "brains" specifically designed for robots and embodied intelligence.
The Gemini Robotics 1.5 series includes Gemini Robotics 1.5 and Gemini Robotics-ER 1.5.
- Gemini Robotics 1.5, the most advanced vision-language-action model, can convert visual information and instructions into robotic motion commands to execute tasks.
- Gemini Robotics-ER 1.5, the most powerful vision-language model, is capable of reasoning about the physical world, directly invoking digital tools, and creating detailed multi-step plans to complete tasks.
Combined, they form a powerful intelligent agent framework.
In the following 1 minute and 40-second video, Google's research scientists had two robots complete two different tasks.
The first task was garbage sorting.
Ask Aloha to sort items into compost (green bin), recycling (blue bin), and trash (black bin) according to San Francisco's garbage sorting standards.
Aloha completed the sorting task by referring to the rules and observing the items.
The second task was packing luggage.
Ask Apollo to help pack luggage for a trip to London and include a knitted hat.
Apollo also actively checked the weather, reminded that it would rain in London for several days, and thoughtfully put an umbrella in the bag.
Overall, with the support of the latest series of models, robots now increasingly resemble those in science fiction movies!
Enable the intelligent agent experience for physical tasks
Imagine a robot that can not only understand the clutter in your living room but also plan, think, and clean it up with its own "hands".
Gemini Robotics 1.5 is a crucial step towards this goal.
It enables robots to have the ability to "think before acting", understand, reason, and complete multi-step tasks in complex environments like humans.
This breakthrough is expected to open a new era for general-purpose robots.
Gemini Robotics-ER 1.5 excels in planning and logical decision-making in physical environments, has top-notch spatial understanding capabilities, supports natural language interaction, can evaluate task success rates and progress, and can directly invoke tools such as Google Search to obtain information or use any third-party user-defined functions.
Subsequently, Gemini Robotics-ER 1.5 will provide natural language instructions for each step to Gemini Robotics 1.5, and the latter will use its visual and language understanding capabilities to directly execute specific actions.
Gemini Robotics 1.5 can also assist robots in reflecting on their own actions to better solve semantically complex tasks and can even explain its thinking process in natural language - this makes its decision-making more transparent.
Both models are built on the core Gemini model family and are fine-tuned with different datasets to specialize in their respective functions.
When they work together, they can significantly improve the generalization ability of robots for long-term tasks and diverse environments.
Understand the "environment" before "acting"
Gemini Robotics-ER 1.5 is the first thinking model optimized for embodied reasoning.
It has achieved state-of-the-art performance in both academic and internal benchmark tests.
The following shows some of the capabilities of Gemini Robotics-ER 1.5, including object detection and state estimation, segmentation masks, pointing recognition, trajectory prediction, and task progress evaluation and success detection.
Think thrice before "acting"
Traditionally, vision-language-action models directly convert instructions or language plans into robotic movements.
However, Gemini Robotics 1.5 can not only translate instructions or plans but also think before acting.
This means it can generate internal reasoning and analysis sequences in natural language to execute tasks that require multi-step or deeper semantic understanding.
In the following 3 minutes and 40-second video, Google scientists demonstrated how robots can complete more complex tasks.
For example, in the first segment, sort fruits of different colors into corresponding plates. The robot needs to perceive the environment, analyze colors, and complete actions step by step.
In the second segment, Apollo was asked to help sort laundry and pack items. It can think independently and demonstrate chained task planning and reaction capabilities during execution, such as adjusting the basket to better pick up clothes or reacting immediately to temporary changes.
Cross-embodied learning for robots of different forms
Robots come in various shapes and sizes, with different sensing capabilities and degrees of freedom, which makes it difficult to transfer actions learned from one robot to another.
Gemini Robotics 1.5 demonstrates excellent cross-embodied learning capabilities.
It can transfer actions learned from one robot to another without specifically adjusting the model for each new form.
This breakthrough accelerates the learning process of new behaviors and helps robots become more intelligent and practical.
In the following 2-minute video, Google scientists demonstrated how robots of different "species" can generalize learning.
In Gemini Robotics 1.5, one model can be used across multiple robots.
For example, Aloha has experience in a wardrobe scenario, while Apollo has never seen it before but can complete new actions such as opening the door and picking up clothes through transfer learning.
This shows the potential of "cross-embodied learning".
In the future, robots in different scenarios (such as logistics and retail) can learn from each other, greatly accelerating the research and development process of general-purpose robots.
Reference materials:
https://deepmind.google/discover/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/
This article is from the WeChat public account "New Intelligence Yuan", author: Ding Hui. It is published by 36Kr with authorization.