BeingBeyond Unveils World's Most Powerful Model: Embodied Industry Enters the "Brain Era"

Enable robots to gradually learn to understand actions themselves, just as they understand language.

What does it mean when 200,000 hours of human videos are compressed into an edge-side chip?

Perhaps it represents the arrival of the first commercially viable embodied world model that can be deployed on the edge side.

This is also the perfect answer given by BeingBeyond, the pioneer of the human video route, at the watershed moment of this embodied world model.

When robots move from demo performances to real environments, they have difficulty truly understanding the environment and tasks, and it's even harder for them to make continuous judgments in changing situations. More and more people are realizing that the way robots learn to act is worth rethinking.

As an embodied intelligent world model trained on large-scale human behavior data, capable of migrating between different robots and performing complex tasks in real environments, Being-H0.7 is BeingBeyond's latest answer to this question.

Being-H0.7 abandons the video generative solution that requires high computing power, has high latency, and is difficult to deploy. Instead, it uses a latent space reasoning method closer to human physical intuition to directly judge future states and action results within the model.

This enables Being-H0.7 to further compress the world model into edge-side hardware and real-time operation scenarios, becoming the industry's first commercially viable world model that can be deployed on the edge side.

As the first player in the industry to propose human video pre-training, BeingBeyond has demonstrated a full-stack technical system of human video pre-training, model deployment, and data collection in a relatively short period.

This closed-loop technical system is enabling the leap of embodied intelligence from a general foundation to expert capabilities, presenting an opportunity for large-scale implementation in the industry. For an industry that has long remained at the demonstration level, BeingBeyond, as a representative player in the field of embodied brain models, is showing great commercial value.

Paper link:

https://research.beingbeyond.com/projects/being-h07/being-h07.pdf

Another Way to Understand the World

As embodied intelligence has developed to the present, the industry has actually defaulted to a relatively mainstream technical advancement logic: first make the robot move, then make it move accurately, and finally approach more complex task understanding and execution capabilities.

Following this idea, several mainstream methods have gradually emerged in the industry in the past few years. The first is VLA, the second is the world model, and the third is to directly collect real-machine data through teleoperation. They correspond to three different expectations: VLA hopes to solve "understanding", the world model hopes to solve "prediction", and teleoperation hopes to solve "implementation".

These methods are all valuable and have promoted the rapid progress of robot capabilities. The problem is that most of them are based on the relatively limited premise that the training data mainly comes from the robot itself. This means that the capabilities learned by the model are easily locked in specific hardware, specific tasks, and specific scenarios.

Especially in the case of the world model, the problems become more obvious at the real deployment stage.

Solutions such as NVIDIA's Cosmos Policy and DreamZero still rely on predicting the next frame of the picture, hoping to assist current action decisions by imagining future video frames. However, on the one hand, video generation itself requires high computing power and is difficult to run in real-time on the edge side; on the other hand, images are ultimately two-dimensional information, and their expression of three-dimensional dynamic processes such as fluids, flexible objects, and complex contacts is very limited. In many cases, they can only generate actions that seem reasonable but are difficult to support real operations.

At this point, BeingBeyond offers another perspective. In their view, if robots are ultimately going to face the human world, the data used to train them should not only come from the robots themselves, but rather from large-scale human behavior data that is more representative of the real world.

Rather than having the robot repeatedly learn "how a particular hand grasps a particular object", it may be more crucial to first let it understand how humans perform actions, organize tasks, and handle interactions in the real world.

This is why BeingBeyond chose to start with human videos. Compared with relying on real machines and teleoperation, human videos are larger in scale, cover more scenarios, and involve more diverse tasks, providing the model with a behavior prior closer to the real distribution. Along this path, robots have the opportunity to learn action capabilities that can be transferred across scenarios, tasks, and robot bodies.

Based on this idea, Being-H0.7 does not continue to develop along the video generative world model. Instead, it turns to a path closer to human physical intuition. Being-H0.7 introduces a latent space within the model to compress the current observations, task goals, and judgments about future changes, and then uses this intermediate representation to directly guide action generation.

This approach is more similar to the way humans react in reality. When playing table tennis, athletes don't first generate a complete picture of the next second in their minds and then decide how to swing the racket. More often, they rely on the quick judgments accumulated from long-term experience, knowing how objects will move, what will happen after being subjected to force, and which actions are likely to fail. What Being-H0.7 tries to make the model learn is this kind of "subconscious" physical intuition.

To make this judgment truly valid, BeingBeyond added an additional foundation: pre-training on over 200,000 hours of human videos. The significance of the massive human behavior data lies not only in its large scale but also in the fact that it naturally contains a large number of implicit physical laws and task structures. What the model learns from this data is not just the actions themselves, but also the conditions, results, and constraints behind the actions.

In the experimental results, Being-H0.7 ranked first globally in 6 lists (topping 4 of them), becoming one of the most comprehensive embodied world models in terms of coverage.

Finally, Being-H0.7 compressed the information of the world model by at least a hundred times and began to truly enter edge-side hardware and real-time operation scenarios. Being-H0.7 can be deployed in real-time on the edge-side computing platform Orin NX (about 75 TOPS). This means that BeingBeyond has become the first team in the industry to deploy a world model for real-time operation on a chip with the same computing power.

The Next Evolution of Robots

In the highly engineering-oriented field of embodied intelligence, the divergence in paths often stems from a non-technical source - how the team defines the problem.

The robot body is the starting point for most Chinese teams, as it is a path that Chinese teams are more proficient in and easier to implement. Starting from this point, teams often optimize control strategies around specific hardware, accumulate data through teleoperation, and then refine the model capabilities on a single robot body.

This approach is both a continuation of the ability structure and an easier path to follow, which has promoted the rapid improvement of robot capabilities for a long time. However, it also implicitly strengthens a premise - the data comes from the robot itself, and the capabilities are thus locked in specific hardware and scenarios.

BeingBeyond's starting point is different from most Chinese teams. This difference largely stems from the way Lu Zongqing, the founder, views problems. Different from many teams that repeatedly refine control strategies around specific robot bodies, as a scientist, Lu Zongqing is more accustomed to first asking a more fundamental question: If the goal is to achieve general capabilities, what kind of data should the model learn from?

For most robot teams, the data mainly comes from teleoperation, real machines, and is strongly bound to specific hardware. But for him, since robots will ultimately face the physical world where humans live, data that is closer to the real task distribution may not only exist in the robot itself but also in human behavior.

Based on this understanding, BeingBeyond was the first in the industry to propose pre-training models with human videos and built a closed-loop technical system for model training, deployment, and data collection.

Following this idea, the team gradually developed a training paradigm centered around human behavior. On the one hand, it constructs a behavior prior through large-scale human videos, so that the model doesn't have to learn actions from scratch. On the other hand, by unifying the action space, it maps different robot bodies to the same expression system, enabling these priors to be transferred between different hardware. Combined with multi-modal modeling capabilities, it unifies vision, language, and actions into the same sequence for training, forming the so-called human-centric learning path.

The Being-H series of models is a natural extension of this cognitive path.

The earlier Being-H0.5 has verified a key assumption: with the combined effect of a sufficient amount of human behavior data and multi-robot body data, the model can be transferred between different robots and maintain stable performance in complex tasks. For the first time, a general model has approached the ability boundary of a dedicated model in the cross-robot body dimension.

Being-H0.7 starts to enhance stability and task completion in real environments, including continuous operation capabilities in more complex scenarios, error control in multi-step tasks, and more efficient adaptation capabilities between different robot bodies.

H0.5 proves that the human-centric learning approach is feasible, while H0.7 proves that this approach can be truly implemented in real scenarios.

In this system, the Being-H series addresses the top-level problem: how robots can acquire general capabilities. Being-Dex deals with a more business-oriented layer - how these capabilities can be quickly implemented in specific scenarios. And U1 takes the problem one step further, answering where high-quality data comes from and how to continuously obtain it.

The three correspond to a relatively clear structure: the model layer provides the foundation for general embodied intelligence, the adaptation layer shortens the learning cycle of new tasks to the 30-minute level, and the data layer advances the data paradigm from the past gripper operations to a more human-like expression through the dexterous hand data collection system. BeingBeyond has built a production chain from data collection to model training and then to task deployment.

Such a closed-loop has been rare for a long time. The reason is that the three key elements of embodied intelligence have long been fragmented: data is difficult to obtain on a large scale, the model capabilities are insufficient to support cross-scenario generalization, and deployment is highly dependent on specific robot bodies.

Opportunities in the New Industrial Structure

In recent years, an obvious trend in the industry is that the robot body and the embodied brain are starting to diverge, and the attention of the entire market, including capital, is increasingly focusing on the embodied brain segment.

This trend is based on several premises:

First, there is a change in data. Massive data represented by human videos have provided the embodied model with a continuously expandable training source for the first time. Second, there is a change in model capabilities. The progress of large models in multi-modal modeling has made it possible to unify the modeling of vision, language, and actions. Third, there is a change in the engineering system. Data, training, and deployment are gradually forming a closed-loop and can be iterated repeatedly in real environments.

This further leads to a change: more and more robot body companies are choosing to outsource intelligence.

From a business perspective, the cost of self-developing a model is still high. A complete embodied model system means continuous data investment, computing power consumption, and team building, with an annual cost often exceeding tens of millions. Once an external model has general capabilities, it can be reused in multiple scenarios, and the marginal cost is significantly lower.

From an efficiency perspective, the more realistic need of robot body companies is to quickly launch new tasks, reuse capabilities in different scenarios, and control R & D investment, rather than training models from scratch repeatedly.

When the robot body and the brain no longer have to be bound together, there is room for division of labor. A subsequent question is, what kind of embodied brain companies have real value? In the current industry where more attention is being paid to implementation feasibility, it is obvious that the closer a company is to large-scale commercialization, the more its value can be recognized.

Currently, there is a consensus in the industry that "general capabilities as the foundation and specialized expert capabilities" is the most feasible path to large-scale implementation.

The human video foundation built by BeingBeyond provides a basis for the generalization of model scenarios and configurations, that is, the so-called general capabilities. And the expert capabilities in vertical implementation scenarios, with U1 perfectly filling the last piece of the puzzle for real-scenario data collection, provide the model with large-scale, high-quality real-scenario expert data.

This closed-loop from the human video route to data collection has made the industry value of BeingBeyond visible. As one of the few companies with full-stack self-developed capabilities in human video pre-training, model deployment, and data collection, BeingBeyond has established cooperative relationships with several leading domestic embodied robot body companies.

Changes are taking place. In the past, every embodied company tried to handle the robot body, data, and model simultaneously, which required heavy investment, a long chain, and was difficult to produce quick results. In the future, a clearer industrial structure for embodied intelligence may gradually take shape, with one type of company focusing on robot bodies and scenario implementation, and another type focusing on providing general intelligent capabilities.

From this perspective, the emergence of Being-H0.7 is more like a signal that embodied intelligence is moving from a fragmented state to a more clearly defined division of labor system.

This article is originally produced by「晓曦」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

BeingBeyond releases the world's most powerful model, and the embodied industry ushers in the "Brain Era"

Another Way to Understand the World

The Next Evolution of Robots

Opportunities in the New Industrial Structure