Robot Hackathon Uncovers Gaps and Opportunities in Embodied Intelligence

There is a saying circulating at the scene: "I can rest, but the card can't."

Last Monday, I participated in a robotics hackathon in Shenzhen.

When I arrived at nine o'clock the previous night, I originally thought I would be one of the few still working. When I walked into the venue, I found that the lights were still on and rows of tents had been set up on the ground. The robotic arms didn't stop. The contestants were gathered around their workstations, collecting data, training models, and monitoring the evaluation results. Some people were so sleepy that they took a nap beside the venue and continued working after waking up.

There was a saying circulating at the scene: "I can rest, but the graphics card can't."

This is one of the largest offline embodied intelligence developer competitions in the world to date.

Independent Variable provided high - quality datasets and relevant data collection equipment for free to all participating teams, and offered a training environment, a high - performance dual - arm operation platform, and computing power resources.

Participating teams could complete the entire closed - loop process from data collection, model training to real - machine deployment within 3 days. Under normal circumstances, it usually takes at least 6 months for a professional research laboratory to complete a similar setup.

The organizers screened out four core capabilities from a large number of candidate tasks: grasping and placing, language understanding, fine operation, and long - term sequential decision - making. In the competition, contestants could choose tasks such as putting rings on a pole, classifying fruits according to instructions, plugging in power cords, and spelling words to continuously tackle. The final score not only depended on whether the steps were completed but also on the success rate, stability, and generalization performance.

This competition, extremely compressed into 3 days, made the gaps and opportunities in the embodied intelligence industry visible at the same time.

In 3 days, two college students can create a video demo like those in academic papers

What does 3 days mean?

According to the rhythm of academic research, it often takes years from the experiment to the publication of a paper.

However, if we only look at the completion effect of a specific task - many teams composed of post - 2000 college students, with the help of the computing power, data, and basic models provided by Independent Variable, can make the robot complete tasks like pick - and - place, which are commonly seen in papers and demonstration videos, in just two days, creating a "seemingly realistic" demo.

This is exactly what is most noteworthy about this hackathon: 3 days is both an exciting number and a number that can easily create an illusion.

Let's first look at its exciting aspect.

The competition system of Independent Variable's hackathon is divided into two stages: the A - list and the B - list. The tasks on the A - list are public, and contestants can train and optimize around clear goals; the specific tasks and data distribution on the B - list will not be announced in advance, which focuses more on examining the generalization ability of the model in the real environment.

In the first two days of the competition, the contestants mainly tackled the A - list tasks, including putting rings on a pole, classifying fruits according to instructions, plugging in power cords, and spelling words. The operation requirements and KPIs of these tasks are very clear. Taking "classifying fruits according to instructions" as an example, the types of fruits are fixed, and the grasping and placing points are also basically fixed. The model can be repeatedly trained around a relatively stable set of conditions, so the score can be quickly improved in a short time.

Gan Ruyi, the algorithm partner of Independent Variable, mentioned that on the first day of the competition, the scores of most contestants were generally low, and the success rate of the ring - putting task was mostly only between 20% and 70%; but on the second day, many teams had quickly figured out their strengths and began to focus on optimization. Some teams even showed obvious over - fitting in a single task, with the success rate approaching 100%.

What does this mean?

It means that in today's embodied intelligence industry, quickly adjusting a model to "be able to complete a specific task" is no longer as far - fetched as many people think. For solution providers and factories for implementation, this is certainly a signal to boost confidence. In the past, when an industrial robot switched tasks, it often meant a long process of pre - programming, simulation, engineering debugging, and on - site running - in; now, if the adaptation period for certain tasks can really be compressed to a few days, even if it is far from indicating "generalization", it is enough to change many people's expectations of robots entering factories and undertaking specific work.

But the problem lies exactly here.

Just because a task can be quickly completed in a few days does not mean that the generalization ability of the model has been improved.

It was also when the scores on the A - list were rapidly rising and some teams were approaching full marks that Independent Variable released the hidden B - list.

When the tasks are no longer known, the models optimized around a single goal quickly expose their limitations.

Yuan Haokuan, a contestant from Nanjing University of Posts and Telecommunications who won the third prize in this competition, told InfoQ that they chose the task of "classifying fruits according to instructions". In the B - list stage, the competition not only added new types of fruits but also introduced interference items and changed the spatial structure of grasping and placing. "The fine - tuning done for the A - list before was basically useless. We had to go back to the base model and collect more diverse real - machine data again."

They collected about 30 pieces of randomly placed data on - site, fine - tuned for about 1 hour, with a total of about 10,000 steps, but the effect was still not ideal. The main problem was that the amount and diversity of data were insufficient.

This is not an accidental problem for individual teams but a common feedback from many teams in the B - list stage. It's not too difficult to get a high score for a single task, but once generalization requirements are introduced, such as increasing the types of fruits or changing the placement method, the model can hardly keep up stably.

I saw two things in this hackathon.

On the one hand, the speed of task adaptation is indeed getting faster, and the threshold for robots to enter real - world scenarios is being lowered; on the other hand, which teams are seriously working on the base model and which teams are just making superficial achievements by using ready - made bases and task fine - tuning will be distinguished in a faster way.

An open - source base model, combined with some on - site collected data and a few computing power cards, and short - term fine - tuning around a specific task, have the opportunity to reproduce the effects in papers or promotional videos.

Such results are not without value. It shows that the existing base models and toolchains are sufficient to support the rapid implementation of certain tasks; but it should not be misinterpreted as "the model already has general abilities". Because the premise of such demos is often a clear task, a fixed environment, and limited variables, rather than continuous adaptation in an open world.

What really sets apart embodied companies is who has a stronger base model and who can maintain stability in task changes, environmental changes, and continuous execution.

That is to say, the gap between teams that seriously work on the base model and teams that over - fit by using ready - made shells will only become larger in the future.

If there is one most direct lesson from this hackathon, it is that today, when evaluating a model, we can no longer be satisfied with whether it has a beautiful real - machine demo, but we need to see whether it can withstand the pressure of multi - tasks, unfamiliar tasks, and continuous tasks in a real - machine environment.

For this reason, more and more domestic manufacturers are starting to launch their own real - machine evaluation systems and challenges. Yuanli Lingji has RoboChallenge, Zhiyuan has AgiBot World Challenge, and Independent Variable has launched ManipArena. The consensus behind them is actually very simple: if we don't take the model out of the demo and test it repeatedly in a real - machine, multi - task, and constrained environment, the industry is likely to be led by the demonstration effects.

Of course, many current lists are still difficult to be completely transparent. To reduce the concerns of participants about information leakage, many evaluation systems do not require the public disclosure of model ownership and will isolate the model parameters and codes through interfaces to prevent them from being directly exposed.

This arrangement is reasonable in reality, but it also means that the industry still needs a more mature standard to distinguish "the ability to rank on the list for specific tasks" from "the truly generalizable ability".

In this sense, the over - fitting results that can be achieved in two days at the hackathon are not just a competition phenomenon. It is more like a reminder: the industry should be more vigilant about model performance and should also force model teams to produce results that can withstand the pressure of real - machines and multi - tasks.

Independent Variable's choice: Do not stack targeted model systems and engineering patches for the rapid implementation of vertical scenarios

The lessons from the competition also confirm Independent Variable's own thinking to some extent.

For many participating teams, the competition quickly exposed a problem: post - training and parameter fine - tuning can make up for some abilities, but at a certain stage, the upper limit of the model is still determined by the basic model itself.

Based on this judgment, Independent Variable did not choose scenarios where the effects can be easily optimized through engineering means. Instead, it placed more complex environments such as homes in a relatively prominent position, hoping to accumulate data in real - world interactions and continuously iterate the basic model accordingly.

Wang Hao, the CTO of Independent Variable, said in an exchange with the media including InfoQ that the company's core direction is to "keep iterating the base model forward". In his opinion, the team can certainly explore in scenarios to verify the capabilities of the base model and see if it can be applied on a large scale in certain scenarios; but one thing must be restrained, that is do not stack too many targeted model systems and engineering patches in order to make the robot implement faster in vertical scenarios. For example, if there is a blind spot in vision, adding a small vision model for detection and compensation. This method "can help you speed up implementation in the short term, but in the long run, it is harmful to the improvement of the base model".

This statement is not only a technical judgment but also a business judgment.

From the outline of external cooperation, Independent Variable is not without industrial customers, but the scenarios it invests more energy in are obviously inclined to service environments such as homes, nursing homes, and hotels.

Wang Hao does not avoid this point. He told us that from the perspective of product strategy and business strategy, Independent Variable hopes that robots can be deployed on a large scale as soon as possible and enter commercial scenarios earlier. The reason why service scenarios such as homes, nursing homes, and hotels are important is that "such scenarios can provide us with a source of data".

At the same time, Independent Variable believes that the home is one of the most complex and open environments. Advancing capabilities towards such complex scenarios and then covering more vertical scenarios is essentially a process of first achieving generalization and then "dimensional reduction": when the base model is strong enough, the additional requirements put forward by vertical scenarios for the model will actually decrease.

And the general ability ultimately boils down to the ability of the basic model.

This is why embodied manufacturers are starting to aim at "embodied - native" models.

From an engineering practice perspective, the current mainstream solution for the embodied brain has reached a certain consensus: visual, language, and at most tactile and other modal inputs are processed by a large - language model to output actions; the world model is mostly used to generate simulation data or build environments.

But the question is, is this architecture really suitable for the physical world?

In Wang Hao's view, there is a typical misunderstanding in the past training path: data from different modalities are trained separately and then aligned; or language is first made to be sufficiently general, and then vision is aligned with language. This method often sacrifices visual ability because it assumes that vision only serves language. However, the embodied scenario is not like this. Language is better at expressing macroscopic intentions but is difficult to accurately describe the continuous changes of an action in a centimeter - level space and a second - level time; while the video model focuses on pixel - level details but may not naturally understand which contacts, movements, and collisions are more critical physically.

Independent Variable's new direction is to integrate the world model and VLA more deeply within an end - to - end framework. Through joint modeling, vision and actions are aligned at an earlier stage, so that the prediction is more in line with physical laws.

This does not mean giving up the large - language model.

Wang Hao told InfoQ that the large - language model is still the basis of training, but the key change lies in the reconstruction of the expression space: "The large - language model is still used as the training basis, but we need to bring language and actions into the same space, rather than having all vision serve language as before."

In his view, the differences between language, vision, and actions are first reflected in the information scale. Language is more inclined to macroscopic expression - "the information described by language is very macroscopic", and it is difficult to accurately depict the continuous changes of an object in a centimeter - level space and a second - level time; while the video model is just the opposite, focusing on pixel - level details - "the color and brightness of each pixel can be predicted very accurately". It is difficult to naturally integrate these two scales of information in the same model.

Under this framework, actions are no longer just output results but become one of the key modalities.

According to Wang Hao, the value of actions lies in their ability to express at both macroscopic and microscopic levels: "Actions are a very good modality. Macroscopically, they can express what this behavior means and what results it will lead to; microscopically, they can help vision better observe the key changes in motion." This also means that the model is no longer just "seeing the static world" but needs to understand the motion itself and advance vision from static perception to the modeling of dynamic processes.

"Putting these modalities together," Wang Hao summarized, "we can build a model truly belonging to the physical world."

To achieve this goal, the encoding method of actions in the model also changes accordingly. It is no longer regarded as the output of a single modality but can be jointly encoded or conditionally encoded with language and vision and expressed on a more fine - grained time scale.

The choice of model structure also directly affects the data route.

Almost all embodied companies today talk about their "data pyramid", but different companies have different understandings of what should be at the bottom of the pyramid.

Take Xinghaitu and Independent Variable as examples. Both companies emphasize the importance of Egocentric data, but their understandings of this concept are actually different. Xinghaitu's core data base of Egocentric mainly refers to human first - person videos. Independent Variable's Egocentric data includes human wearable devices. Wang Hao said: "From the perspective of degrees of freedom, Egocentric data is completely consistent with human degrees of freedom. All hand - held and wearable devices are in a state between human and machine degrees of freedom."

It seems that this is just a different way of classifying data, but in fact, it corresponds to different judgments on "where the general ability comes from". Some people think that it is most important for the model to have a large amount of human - perspective experience first; some people think that it is necessary to obtain data closer to the robot control structure as soon as possible; others value real - machine takeover, tele - operation, and real - task feedback. It seems that everyone is talking about data, but the real differences often lie in the different definitions of data at the most basic level.

Three days are enough to produce a decent result. This means that demos are no longer scarce, and even no longer trustworthy.

The real world needs the continuous improvement of the base model's ability and a truly "hands - on" process - to understand the model, understand the hardware, understand the data, and also understand the failures and boundaries that will not appear in the video.

Under such a standard, many gaps are just beginning to emerge.

This article is from the WeChat official account "AI Frontline" (ID: ai - front), author: Yao Ge, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A robot hackathon reveals both the gaps and opportunities in embodied intelligence.