HomeArticle

Galaxy Universal, together with NVIDIA, has exposed the biggest lie about humanoid robots.

锦缎2026-04-30 07:53
The moat that embodied intelligence has yet to complete may be facing a drastic redirection.

Open any technology media page, and you'll be bombarded with news about humanoid robot financing. The year 2026 has been dubbed the "Year of Embodied Intelligence," and capital is lining up to invest in it.

But step into the R & D center of an embodied intelligence company, and you'll see a different picture.

There's no autonomous action like in science - fiction movies. There's no elegant human - machine dialogue. Operators wear VR headsets, motion - capture equipment, and hold remote control handles, repeatedly manipulating robotic arms to pick up cups and fold clothes. If it doesn't work once, they try ten times; if ten times don't work, they try a hundred times. Behind every piece of training data stands a real person.

This is the most raw reality of current embodied intelligence: it's built on labor - intensive data collection. Every movement of every robot has to be "taught" by humans hand - in - hand.

Capital is in a frenzy. But there's a stubborn thorn hidden within the industry: if the intelligence of machines can only be built up with human labor, this cost structure will never support the dream of "entering every household."

During the 2026 CCTV Spring Festival Gala, an embodied intelligence company called Galaxy Universal briefly made an appearance and then returned to the quiet of the laboratory. Its latest paper, "LDA - 1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion," proposes a proposition that could rewrite the industry's underlying logic: Break the worship of "perfect data," understand physics first, and then learn operations. The co - authors include NVIDIA, Tsinghua University, and Peking University.

The yet - to - be - completed moat of embodied intelligence may be facing a drastic diversion.

01

Copying a cat won't create a real tiger

The vast majority of robot large - scale models on the market follow the same path: behavior cloning. Simply put, it's copying a cat to draw a tiger. Human experts leave tens of thousands of perfect remote - control data, and AI extracts features from the pictures to predict what actions humans took in each frame. This solution is intuitive and effective, quickly becoming the mainstream.

But it has a fatal flaw: a very low ceiling. From the start, imitation sets an upper limit on the model's capabilities that won't exceed that of the demonstrator. If the goal of AGI is to surpass the average human level, there's no end in sight on this path.

What's even more troublesome isn't the ceiling, but the floor.

The tech circle calls this covariate shift. It sounds abstract, but the principle is simple: Motors age, gears have gaps, and light changes. All these are noise for robots. Robots trained purely by imitation will deviate from the distribution range of training data in the camera's view due to tiny errors during action execution. The model has never seen such a situation and doesn't know how to correct it. The errors snowball, and the actions collapse. The scene of a robot suddenly rushing into the audience at a recent robot marathon is a public example of covariate shift.

Galaxy Universal's paper chooses another path: Abandon the reflex - like imitation and follow the world - model route.

The reason why large - language models have undergone a transformation is that they have grasped the underlying laws of language from a vast amount of text. Robots also need the same level of understanding: understand the causality of the physical world before taking action. LDA no longer just predicts the next action but jointly predicts future images. Before issuing an instruction, the model must first conduct a simulation in its digital brain: If it pushes, how will the water cup move? What effects will gravity and friction have?

The essence of this shift is: First, have knowledge (understand the laws of the world), and then have applications (learn how to operate). The causal order cannot be reversed.

02

Don't get bogged down in pixels

To predict the future, you have to figure out what to predict.

Sora and various image - and video - generation models seem to offer a ready - made answer to the industry, but the direction is actually the opposite. You may have noticed that in AI - generated pictures and videos, the text part always appears as distorted garbled characters. The reason is simple: These models essentially piece together pixels using probability. They don't "understand" the text; they just remember that a certain color is likely to be next to another color in a certain position.

A glass of water or an apple in human eyes becomes a flat arrangement of RGB color blocks when photographed. Early world models made a mistake in "predicting future pixels." Having the robot's brain guess what the pixels in the next frame will look like wastes a lot of computing power on meaningless details such as how the shadow of the robotic arm moves, how the reflection on the cup changes, and how many textures the background wallpaper has. All are high - frequency noise and over - reactions to the environment.

LDA chooses to leave this pixel space.

It uses the visual foundation model DINO to strip away irrelevant light, shadow, and background before the input image enters the prediction network, extracting a highly abstract semantic space. It no longer obsesses over the colors of millions of pixels in the next frame but tries to understand an equation: "The semantics of the cup" plus "the pushing action" equals "the cup moving to the right."

"Don't focus on details, only pay attention to semantics." It goes against common sense but works. With the same model scale, the success rate of the old pixel - prediction - based solution is 14.2%. After switching to the semantic space, this number jumps to 55.4%. The commercial implication is more straightforward: Expensive computing clusters no longer need to waste electricity on light and shadow simulation, significantly reducing costs while significantly improving the model's cross - environment stability.

03

Perfect data is a superstition

The most impactful part of this paper on the industry is that it shatters the business fantasy of "perfect data worship."

Currently, the training logic of robots is basically borrowed from large - language models. In the past three years, the large - model field has repeatedly verified an iron law: Low - quality corpora such as logically confused texts and harmful codes will contaminate the model. Garbage in, garbage out - what goes in is garbage, and what comes out is also garbage. Robot companies naturally follow suit: They spend a lot of money to hire professional operators to record nearly perfect data, which is considered a prerequisite for a breakthrough in capabilities.

But the data logic of the physical world is different from that of the text world.

In the real world, failure itself is the most complete demonstration of physical laws. When a robot fails to grab a water cup, knocks over an object, or makes an operation error and then retries, these are considered garbage data to be discarded in traditional algorithms because they don't show "how to perfectly complete a task." However, these processes also strictly follow the laws of gravity, friction, and collision.

Robots that have only seen high - quality data are like plants grown in a sterile greenhouse. They can't survive once they leave the perfect environment. Most embodied intelligence companies target the home environment as their first commercialization goal, but the real - world home environment is far too chaotic for such robots to handle. A slight deviation will cause them to malfunction.

The universal data ingestion mechanism proposed by LDA rewrites this economic equation: Data with potential harm is removed; A large amount of low - quality, unlabeled wild data, such as short videos casually shot on the Internet, is turned into treasure and fed to the world model, allowing it to learn common sense and boundaries of the physical world from these seemingly useless materials; Extremely scarce high - quality professional operation data is only used in the final fine - tuning stage - at this time, the machine has understood the physical laws and only needs to efficiently select strategies.

The test data provides an interesting piece of evidence: In the fine - tuning stage, mixing 30% of low - quality data containing pauses and mistakes into the perfect data actually increases the robot's execution success rate by 10%. The model learns one thing: Doing it this way will mess things up, and it can be remedied like this after a failure.

Companies that are burning investors' money, assembling teams of hundreds or thousands of people, and hiring full - time employees for "manual data collection" haven't even finished building their moats, but the riverbed is already shifting. The core barrier in the next few years will no longer be who can buy more perfect data with money, but who has a stronger pipeline: collecting a large amount of rough data at low cost and extracting physical common sense from it. A leading edge in cost structure will emerge from here.

04

The GPT moment is still far away

Many people call 2026 the Year of Embodied Intelligence, and there are constant voices saying that "the GPT moment is coming soon."

Calm business observers won't easily agree.

Assuming that embodied intelligence follows the same reinforcement - learning path as large - language models, the three core elements remain the same: computing power, algorithms, and data. Text data is the digital precipitation of thousands of years of human civilization. Today, it's not difficult for either OpenAI or DeepSeek to obtain trillions of tokens. However, the interaction data in the physical world is still at the bottom of Moravec's paradox, in the era of handicraft workshops. Without a well - established underlying data infrastructure, general intelligence is a castle in the air.

Research like LDA - 1B doesn't offer a "jack - of - all - trades" finished product but a correctly - oriented signpost. This is more valuable than immediately launching a robot claiming to be all - powerful.

It ends the paradigm of blind imitation and points out the necessity of causal relationships and world models. The waste of computing power at the pixel level is replaced by semantic representation. Most importantly, it subverts the expensive high - quality data collection model and opens up a low - cost, waste - to - treasure data expansion path.

Put aside the arrogance towards perfect data and let AI learn the physical laws of the real world from roughness and failure. The road is long, but the direction is clear.

This article is from the WeChat official account "Silicon - based Starlight," written by Siqi and published by 36Kr with permission.