Generalist discovers the Scaling Law of embodied intelligence and enables the model to think and act simultaneously
Generalist is an embodied intelligence model company founded by Pete Florence, a senior research scientist at Google DeepMind. Recently, it released a new embodied foundation model called GEN-0. This model can predictably scale with the growth of physical interaction data, rather than just text, image, or simulated data. During the process of training this model, they also confirmed to some extent the Scaling Law of embodied intelligence.
Early investors in Generalist include investment institutions such as Spark Capital, NVIDIA, Boldstart Ventures, Bezos Expeditions, and NFDG, but the investment amounts are undisclosed.
Experts from DeepMind and Boston Dynamics Explore the Scaling Law of Embodied Intelligence Together
Generalist was co-founded by Pete Florence, a senior research scientist at Google DeepMind. He led the research and development of visual or embodied intelligence models such as PaLM-E and RT-2 at Google, and his papers on Google Scholar have been cited more than 19,000 times.
Andrew Barry (CTO) and Andy Zeng (Chief Scientist) also co-founded Generalist AI with Pete Florence. Andrew Barry previously worked at Boston Dynamics, and Andy Zeng worked with Pete Florence on projects like PaLM-E at Google. The core team of Generalist also includes senior researchers from top companies such as OpenAI and Waymo, who have all done high-quality work.
What Generalist aims to do is to create general-purpose robots. Founder Pete Florence said, "Our goal is unwavering, which is to create all-capable robots. So, imagine a world where the marginal cost of physical labor drops to zero."
Currently, Generalist first focuses on the dexterity of robots and continuously explores the frontiers in aspects such as models and data.
Generalist's first stage result is the brand-new embodied foundation model - GEN-0. This model is based on multi-modal training of high-fidelity raw physical interaction data. Its architecture goes beyond by drawing on the advantages of visual and language models, and its native design aims to capture human-level reaction reflexes and physical common sense.
"Harmonic Reasoning"
One of the core features of GEN-0 is "Harmonic Reasoning", which means the model is trained to think and act seamlessly at the same time. For language models, it is feasible to spend more time thinking before responding. However, for physical systems acting in the real world, the model must provide immediate feedback, and the shorter the reaction time, the better.
For example, if you throw a glass to a robot and the robot's reaction time is too long, the glass will break directly. Or in the scenario of logistics robots, if a robot moves slowly in traffic or a crowd, it is very likely to have a collision.
To solve the problem of rapid reasoning (reaction) of robots in the physical world, there have been many solutions. For example, Figure's Helix uses a "System 1 (fast thinking) + System 2 (slow thinking)" architecture, but it still requires explicit design of switching logic.
"Harmonic Reasoning" can achieve thinking and acting in continuous time. The model can maintain two asynchronous, continuous time streams simultaneously:
Perception stream: Continuously receive sensor data
Action stream: Continuously output control instructions
These two streams are "harmoniously" intertwined in the continuous time domain without explicit synchronization points. This allows the model to avoid using the more complex "System 1 (fast thinking) + System 2 (slow thinking)" architecture and can scale to a very large size.
The "Phase Transition" Point of the Intelligence Scale of Embodied Intelligence Models
Generalist's scaling experiments show that the GEN-0 model must be large enough to absorb a large amount of physical interaction data.
During this training and scaling process, they discovered the "phase transition" point in the model's intelligence capacity.
The 1B (1 billion) parameter model has difficulty absorbing complex and diverse sensorimotor data during pre-training - the model weights gradually become unable to absorb new information over time.
The 6B (6 billion) parameter model begins to benefit from pre-training and shows strong multi-tasking capabilities.
Models with more than 7B (7 billion) parameters can internalize large-scale robot pre-training data and only need thousands of steps of post-training to transfer their capabilities to downstream tasks.
Scaling up the size of the GEN-0 model can improve the model's performance in a completely unseen (i.e., zero-shot) long-sequence downstream task. The performance metric is the next-action verification prediction error (y-axis, the lower the better).
This is the first time the solidification phenomenon of the model has been observed in the field of embodied intelligence. The solidification phenomenon was previously observed in the literature on large language models in high-data environments, but the models involved were much smaller, with the order of magnitude of parameters in the tens of millions rather than billions. In the field of embodied intelligence, this phase transition occurs at a much larger parameter scale than in language models. This observation also echoes Moravec's Paradox: the computational complexity of perception and dexterous actions that humans find easy far exceeds that of abstract reasoning.
After that, Generalist scaled the size of GEN-0 to more than 10B (10 billion) parameters and observed that the model could quickly adapt to new tasks with less and less post-training data.
The Scaling Law of Embodied Intelligence Models
During the training process, the GEN-0 model shows a relatively obvious Scaling Law, that is, more pre-training data and computing resources can continuously and predictably improve the model's downstream post-training performance in many tasks.
Specifically, after the model reaches a large enough scale, a strong power-law relationship can be observed between the scale of pre-training data and downstream post-training performance. This applies to various robot test tasks, including application scenarios and workflows in multiple industrial fields such as clothing, manufacturing, logistics, automotive, and electronics.
Generalist also fitted a prediction formula in the paper:
Where:
L(D) is the validation error of the downstream task given the pre-training data volume D
Dc is the characteristic data scale constant
αD is the scaling exponent
With this formula, we can answer key questions such as "How much pre-training data do we need to achieve a specific next-action prediction error?" or "How much post-training data (for a specific task) can we save by increasing the pre-training data volume?"
The paper points out that combined with the Scaling Law, these results can predict the optimal computing and data allocation for any downstream post-training task.
Since the Scaling Law of embodied intelligence models has been proven, the quantity and quality of data are very important. The GEN-0 model is trained on a huge self-owned dataset, which contains 270,000 hours of real-world manipulation trajectories collected from diverse activities in thousands of homes, warehouses, and workplaces around the world, and this number is still growing at an accelerating rate.
The amount of real-world manipulation data used in the training of GEN-0 is several orders of magnitude higher than some of the largest existing robot datasets to date.
Through large-scale experiments, Generalist found that data quality and diversity are more important than pure data quantity. A carefully constructed data mixture can produce pre-trained models with different characteristics.
Due to the design advantages of the data and the GEN-0 model architecture, it can be applied to different robots. This model has been successfully tested on 6-degree-of-freedom (DoF), 7-degree-of-freedom, and semi-humanoid robots with more than 16 degrees of freedom.
Embodied Intelligence Models are Still in the Early Stage of Development, but Every Breakthrough Brings Them Closer to Real-World Applications
Many top startups have previously explored the foundation models of robots. Among them, Physical Intelligence follows a similar path of foundation model + fine-tuning as Generalist. Its model has been iterated to π 0.6. The new model can make espresso, completing the whole process from pouring/grinding/wiping, and can make it continuously from morning to night, which reflects its ability to complete continuous long-sequence tasks and its robustness.
The model of Skild AI emphasizes generalization. It supports generalization across multi-form robots (humanoid, quadruped, arm, etc.) and can complete tasks such as climbing stairs, balance recovery, and grasping in cluttered environments in demonstrations.
The models of these two companies also have a common technical feature, which is to evolve autonomously based on the "experience" accumulated during the actual operation of the robots.
As mentioned before, Figure's Helix uses a "System 1 (fast thinking) + System 2 (slow thinking)" architecture, which can support robots to complete complex operations in the actual factory environment and achieve multi-robot collaboration.
It can be seen that although many top companies have invested in embodied intelligence foundation models, the technical ideas in this field have not converged, and the data is still not rich enough. Moreover, so far, there are still not enough cases of the actual commercialization and real-world applications of embodied intelligence.
However, we can see that the dawn is getting closer. The Scaling Law of embodied intelligence has been discovered to some extent, and problems such as multi-form generalization of models, action delay, and completion of continuous long-sequence tasks have been or are being solved.
Every problem solved will unleash greater potential in the entire embodied intelligence industry and bring better prospects for future commercialization and real-world applications.
Chinese entrepreneurs have advantages in starting businesses in the field of embodied intelligence. China has a more mature hardware industry chain, rich scenarios, and great potential for data sources. If entrepreneurs can develop both hardware and software (including but not limited to models) simultaneously, combine hardware and software in their startups, and continuously create value in one or two specific scenarios, they may stand out.
This article is from the WeChat official account "Alpha Startups" (ID: alphastartups). Author: Alpha Startups. Republished by 36Kr with permission.