"Zero-Data" Robots: Ready for Operation in Two Months, Tsinghua Doctors' Breakthrough in World Models Enables Instant Skill Mastery

The behaviors of robots are endless, but their underlying laws can never escape these three rules.

How many degrees of freedom are needed for a robotic arm to pick up a bank card on the table?

Three? Five? Most people intuitively think that the more degrees of freedom, the safer it is. It would be even better to equip it with tactile, visual, and force control... After all, even human hands may not succeed in this task on the first try.

However, there is a robotics company affiliated with Tsinghua University that doesn't believe in this. They reduced this solution to just one degree of freedom.

The module you see can't even be called a robotic arm. It's more like the most common industrial gripper on the production line: two wedge-shaped black clamping pieces that move along a fixed rail and bite together like a bird's beak.

The inner side of the clamping pieces is wrapped with a silver tactile sensing material. In addition, there is no external camera, no cloud-based brain, and no learning of any "demonstrative trajectory data." Everything happens locally on the edge side.

But it can pick up that white card with a thickness of less than 1 millimeter and lying flat on the table. More precisely, it doesn't "pick" it up but "pries" it up: one clamping piece first presses on the edge of the card, using the table as a fulcrum, and the other end is lifted; the corresponding clamping piece then closes, and both sides exert force simultaneously to lift the whole card.

As seen in the video, the whole process is not elegant and even a bit clumsy: if the angle is a bit off or the force is a bit too strong, the card will slip. But it will try again and again and make corrections. Finally, it can always find a more suitable point of force.

You're not the only one surprised by the experimental results. Even Jiang Yao, the initiator of Acorn Robot (a Ph.D. in mechanical engineering from Tsinghua University and a postdoctoral fellow in neuroscience at Harvard University), called it a "pleasant surprise." "It didn't succeed on the first try," he recalled. "But after trying eight or nine times, it actually found a way on its own."

When talking about this, Jiang Yao's eyes still showed excitement, just like the first time he felt the emergence of intelligence in a language model. He called the robot's "strategy figured out by itself" "the emergence of behavior driven by instinct." What drives it is Acorn's edge-side autonomous decision-making model, Natus.

This module is Acorn's first product for the B-side flexible manufacturing scenario. It has currently completed the proof of concept (POC) phase with a top domestic cosmetics company and achieved large-scale deployment.

In Acorn's R & D pipeline, there are more forms of execution modules. They practice grasping various items tirelessly every day: from mineral water bottles to rubber soft balls, from bananas to tofu, and irregularly shaped parts.

These seemingly clumsy attempts all point to the same discovery:

If a robot can still figure out an effective strategy through practice with almost no "demonstration data templates," what the execution layer really lacks may not be more trajectory data but a set of underlying mechanisms that can inspire it to "start moving and trying."

This is also Jiang Yao's sharpest reflection on the current mainstream embodied intelligence route: VLA, world models, and simulation learning are not without value, but they are too likely to fail at the "last mile" of operation execution.

The Execution Side: The Most Silent Dilemma of Embodied Intelligence

Whether it's the VLA trying to achieve an end-to-end closed loop or the "world model" for predicting the physical future, they essentially carry the inertia of the language model's "brute force creates miracles": they think that as long as they've seen enough videos and fed enough data, operational intelligence will naturally emerge. But once they get involved in real physical interactions, this logic will undoubtedly hit two mountains: contact and the body.

The essence of operation is physical contact. Friction, damping, force conduction... These variables that are ubiquitous in the real world are difficult to model stably in the world model. It may be able to accurately generate a predicted video of "a robot grasping a water cup," but it can't calculate the relative friction at the moment when the fingertips touch the cup wall, let alone predict the tiny deformation before the glass slips.

The "seeming ability" in vision can't cover up the "inability" of the execution layer.

In addition, operations must be carried out through a specific body, and there are slight differences in the joint wear and assembly tightness of each robot. The Acorn team conducted a comparative experiment: two grippers of the same model, using the same set of model parameters, only with a difference in the tightness of the rails, will have very different effects on the execution side.

The unpredictability of contact and the slight differences in the body mean that the data-driven route is an endless bottomless pit. It's difficult to exhaust all behavior patterns, and model training must cover all scenarios and hardware deviations. But even though the world's largest open-source robot dataset has reached millions of motion trajectories, it still can't stimulate the generalization ability of the model at the execution layer.

Caption: Open X-Embodiment (OXE), currently the world's largest open-source robot dataset, contains more than one million robot rounds collected from 22 different robot carriers from 34 research laboratories around the world.

What's even more fatal is that on the production line billed by the second, no one can wait for the inference closed loop of the large model to complete in a few seconds. The delay of VLA, which can be several seconds, doesn't even qualify for on-site operation.

This makes Jiang Yao firmly believe that there is no absolutely universal best model, only the model that is most suitable for this machine. VLA wants to solve operational problems with data, but collecting hundreds or thousands of hours of high-quality teleoperation data itself requires extremely high operational thresholds.

"Operation must be learned in practice, but the premise of practice is that you must be able to start practicing first." This is Jiang Yao's second key judgment on robot execution. It reveals the Achilles' heel of VLA on the execution side and is also the starting point for Acorn to "start from scratch."

The Uncharted Territory Collided by Interdisciplinary Subjects

This judgment is not derived from literature.

During his Ph.D. studies in the Department of Mechanical Engineering at Tsinghua University, Jiang Yao dealt with impedance control and mechanical modeling every day, which gave him an ingrained intuition about physical interactions: The essence of operation is a mechanical behavior, not a visual problem.

In 2016, he went to Harvard for a postdoctoral fellowship in neuroscience, and his research direction changed to the motor control of the human brain. The laboratory conducted a large number of perception interference experiments: shielding vision and interfering with touch to observe the changes in human hand operations. He found that no matter how much interference there was, the most basic grasping action of humans remained unchanged.

"That part that never changes is instinct," Jiang Yao realized. Language can't be learned without an environment, but no one has taught babies how to grab things, yet all humans grab things in a highly consistent way. This is not because they've seen enough scenarios, but because there is an innate mechanism based on touch and mechanics.

Two sets of seemingly unrelated disciplinary languages were aligned in Jiang Yao: The essence of operation is not to fit visual trajectories but mechanical laws; the general operational ability of humans comes from instinct, not data.

Transplanting "instinct" to robots was an absolute "uncharted territory" at that time. When he returned to China to establish a laboratory in 2018, "embodied intelligence" hadn't become popular yet, and VLA hadn't become widespread. Colleagues thought he was talking about metaphysics, and investors didn't understand...

Jiang Yao didn't rush to convince the outside world. Instead, he extremely carefully cultivated like-minded people: The laboratory requires students to join the group for observation as early as their sophomore year. First, it looks at their abilities, and more importantly, it looks at "whether they understand that the data-driven approach can't solve the deadlock on the execution side and whether they believe in instinct." He won't accept people who have strong algorithms but don't agree with this point. New members must be approved by all the Ph.D. students in the laboratory. One of the longest-serving members has followed him for 10 years.

When starting a business in 2024, 8 Ph.D. students in the group unanimously chose to join Acorn. What's even rarer is that these people privately have a consensus: If the company doesn't adhere to "instinct-driven" one day, they won't continue to work for it. This is not the common sentiment in startup stories but a defense line that must be established before cognitive consensus confronts industry inertia.

Because what they're looking for is the "universal gravitation at the operational level."

It's a Law, Not a Rule

This means having to give up the obsession with fitting trajectories. Newton didn't exhaust every motion trajectory but used a law of universal gravitation without any motion parameters to govern all motions. Jiang Yao applied the same logic to operations: VLA is learning trajectories, while Acorn is looking for laws.

Rules write the operation algorithms in stone, while laws only provide constraints. Based on the in-depth exploration of physical interactions, Jiang Yao refined this law into three types of operational instincts:

Orientation instinct solves "where to go" - In coordination with vision, it guides the end to move towards the target, just like a baby naturally turns its head to follow a moving object;

Exploration instinct solves "how to touch" - This is the most complex and also the most representative part of the emergence of intelligence. After contact occurs, the robot doesn't rely on pre-set programs or imitation but autonomously explores along the surface of the object to find a stable contact configuration;

Execution and interaction instinct solves "how to grasp" - With "minimizing slippage" as the core, it adjusts the grasping force in real-time. It's gentle when grasping tofu, firm when grasping a hammer, and adapts to resistance during assembly. All adjustments are based on real-time tactile feedback without any training data.

No one told the gripper at the beginning of the article to "pry the card from the side." It only has the underlying expectation of "finding stable contact," and the prying action naturally emerges under physical constraints.

But to make this set of instincts truly form a closed loop, a key technical threshold must be crossed: Slip perception. "Just like when you're standing on a high-speed train and want to sense the relative speed between the carriage and the ground," Jiang Yao explained. "You're embedded in one of them, and there are almost no reference objects."

The team spent 7 years and iterated more than a dozen versions of the prototype before making the micron-level slip perception stable and usable. With it, when the robot encounters any object, it can sense "it's about to slip" in real-time during contact and automatically make corrections - without needing to know what the object is in advance. This is also the physical cornerstone for zero-data cold start to be possible.

With these three sets of "instinctive laws," countless behaviors of the machine can be inspired.

Natus and Magis: From Instinct to Skill

The "emergence of behavior" driven by the Natus model can be controlled in real-time on the edge side: 200Hz response, millisecond-level delay, no cloud dependency, and it's individually adapted to the mechanical characteristics of specific hardware at the time of production. Its core mission is to solve the deadlock of "not being able to practice without the ability" mentioned earlier: Make the robot "usable on the production line on the first day."

But relying on instinct for exploration all the time is too inefficient. This is the meaning of the second-layer model, Magis.

The data generated by Natus exploration is not ordinary video trajectories. It's a record with tactile semantics: When the vision sees "a banana," the touch synchronously marks "weighing 120 grams, with the center of mass to the left, and a rough skin."

This kind of visual data with mechanical annotations is sent to Magis for training, and the resulting skill model has a much deeper understanding of the physical world than pure visual data - It knows how to grasp, not just looks like it's grasping.

After Magis matures, it can directly call skills in familiar scenarios, and return to Natus for exploration in unfamiliar scenarios. The new data is then precipitated into Magis. On the one hand, the instinct continues to emerge and can provide a backup; on the other hand, the skills accumulate and evolve continuously.

"We've subverted all current data collection methods," Jiang Yao said. "The best data source is not simulation, not manual teleoperation, but the data generated by the product itself in the real physical world."

Changing Production without Stopping the Line: The Real Value of Zero-Data

Where is the most urgent scenario for this ability to "generate data by itself and develop skills by itself?"

Acorn chose flexible manufacturing. Jiang Yao believes that this is the best intersection point after weighing the thresholds on the execution side and the market pain points.

The cosmetics ODM industry is a typical example. There are more than a hundred SKUs, and they change every few weeks. Every time the production is changed, the line has to be stopped for parameter adjustment. Its pain point is not that the machines are not fast enough, but that the machines can't recognize new materials. What's more troublesome is the materials themselves: powder compacts are extremely fragile, and a little too much force will leave marks; the wicks of scented candles are soft and uneven, and too much force will pull them out, while too little force won't move them. This kind of task can't be covered by rules, the training cost of VLA is extremely high, and traditional automation is helpless.

But for Natus, when the SKU

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The "zero-data" robots are here. They can be put into operation after just two months of verification. A group of Tsinghua doctors have broken through in the field of world models, enabling robots to master skills immediately by relying on their "instincts".

The Execution Side: The Most Silent Dilemma of Embodied Intelligence

The Uncharted Territory Collided by Interdisciplinary Subjects

It's a Law, Not a Rule

Natus and Magis: From Instinct to Skill

Changing Production without Stopping the Line: The Real Value of Zero-Data