HomeArticle

Shenzhen Embodied Intelligence Co., Ltd. Secures RMB 100 Million Financing from Inovance and China Telecom, Ranking No.1 in the Industry for "Visual-Tactile" Sensor Shipments | 36Kr Exclusive

黄 楠2026-06-04 09:30
Building a world model with "visuo-tactile" technology.

Author | Huang Nan

Editor | Yuan Silai

Hard Krypton has learned that Daimon Robotics recently completed a 100-million-yuan Series A financing round, jointly invested by Inovance Technology's industrial fund Inovance Industrial Investment and China Telecom. The funds will be used to further build an ultra-large-scale dataset containing physical interaction information, accelerate the research and development of physical world models, and drive the data flywheel and commercial closed-loop in real physical scenarios.

Daimon Robotics officially started operations in 2023. Its core team has long been focused on the fields of robotic dexterous manipulation and physical interaction intelligence. Professor Wang Yu, the co-founder and chief scientist, was the founding dean of the Robotics Research Institute at the Hong Kong University of Science and Technology. The concepts he proposed, such as "embodied skills" and "skill cloning," are important components of Daimon Robotics' core technology roadmap. Dr. Duan Jianghua, the founder and CEO, and the main technical leaders are all from the core team of the Robotics Research Institute at the Hong Kong University of Science and Technology, with 10 years of know-how in manipulation intelligence. Yuan Weihao, the chief AI scientist, was a multi-modal research expert at Alibaba's Tongyi Laboratory, with cutting-edge experience in migrating world models to robotic physical manipulation.

As the popularity of embodied intelligence continues to rise, the industry logic is undergoing profound changes. The development of the track evolves along a clear path: from the early competition in the robot's walking and motion control capabilities to the exploration of differentiated algorithm architecture routes and the "embodied brain." Each round of hotspots has accumulated key foundations for its breakthrough.

As humanoid robots move from stage demonstrations to real-world operations, the threshold for refined whole-machine practical operations continues to rise. Whether high-quality physical interaction data can be collected has become a key dividing line for the industry's implementation.

In the mainstream pure vision perception solutions, sensors can only capture the appearance of objects and cannot identify physical characteristics such as softness, hardness, friction coefficient, and deformation under stress, making it difficult to support robots in predicting object changes. In contrast, the physical interaction data that integrates touch can completely record key parameters such as instantaneous force and material properties, precipitate physical common sense in large-scale model training, accelerate convergence, help robots establish physical causal cognition, and implement various refined operations.

Daimon starts by collecting and annotating physical interaction data, gradually builds a complete technical link covering perception, operation, and learning, and then constructs a world model that can provide physical common sense for robots.

At the cognitive level, its model can achieve the alignment of vision and touch modalities, enabling robots to infer the physical properties of objects from images and reverse-infer the object's shape from the sense of touch. In the execution stage, with the help of high-response frequency tactile feedback, it helps the device complete perception, judgment, and action correction within milliseconds of contact, forming a closed-loop control.

Achieve refined operations such as stringing grapes and placing eggs with physical intuition (Source/Enterprise)

"For robots to be able to work, understanding the physical world's causality and feedback based on real contact are essential," Dr. Duan Jianghua, CEO of Daimon Robotics, told Hard Krypton. "If a robot can parkour and do somersaults but can't pick up a sponge with just the right amount of force to wipe an object, its application value will be greatly reduced. 'Vision is a non-contact remote signal. It can tell you where an object is, but it can't tell you why a sponge deforms when touched. Touch, on the other hand, is the 'feel' at the moment of contact and is the key to judging physical causality and achieving refined operations."

However, having only technology and models is not enough. How to drive the continuous iteration of the physical world model through a data closed-loop and professional evaluation standards is another major challenge currently faced by the industry. Duan Jianghua pointed out to Hard Krypton that "the essence of the tactile data shortage is that the data representation method for vision has been relatively unified, while there is no standard for touch, and there is a lack of a large-scale, multi-modal real data collection system."

To solve this problem, Daimon has built an "outward-distributed" embodied data collection network. Different from the traditional model that relies on fixed-point laboratories and remote operation for data collection, the "outward-distributed" collection network decentralizes the centralized laboratory and conducts distributed social collection, which can effectively ensure the authenticity of scenarios, lead to a qualitative change in collection efficiency, and reduce marginal costs.

In April 2026, Daimon Robotics, in collaboration with dozens of leading domestic and international institutions including Google DeepMind, released the world's largest full-modal physical world dataset containing tactile information, Daimon-Infinity, which includes contact information such as texture, softness, hardness, and mechanics. It also open-sourced 10,000 hours of data for free use by the industry. Based on this dataset, a systematic evaluation standard was established, and in June, a full-modal tactile Benchmark system for physical interaction capabilities, RobOmni, was launched, supporting both "real data training" and "simulator training" modes.

Human babies learn about the world and develop their intelligence through touch. For robots that are about to move from factories into households, this lesson cannot be skipped either. After solving the problems of "seeing clearly" and "walking steadily," "touching accurately" is becoming the last and most crucial "kilometer" for embodied intelligence to enter the physical world. Daimon Robotics is trying to define its own standards in this technological process related to the "sense of touch."

Human babies learn about the world and develop their intelligence through touch. For robots that are about to move from factories into households, this lesson cannot be skipped either. After solving the problems of "seeing clearly" and "walking steadily," "touching accurately" is becoming the last and most crucial "kilometer" for embodied intelligence to enter the physical world. Daimon revealed to Hard Krypton that the shipment volume of its visual-tactile sensors currently ranks first in the world. It is trying to define its own standards in this technological process related to the "sense of touch."

The following is an excerpt from an interview between Hard Krypton and Duan Jianghua, CEO of Daimon Robotics (slightly edited):

Hard Krypton: From perception to execution, embodied intelligence needs to bridge the gap from "understanding" to "working." How does Daimon's physical world model handle the fusion of visual and tactile modalities and low-level control? What tasks that robots couldn't do before can this architecture help them complete when facing complex operation tasks?

Duan Jianghua: Our model infers physical causality. In terms of model structure, we split physical contact into two layers: the cognitive layer and the execution layer.

The cognitive layer maps vision and touch bidirectionally in the same semantic space. This is similar to human synesthesia. When you see a strawberry, you know it will have a granular texture without squeezing it. When you use a key to open a door and insert it into the lock, your hand may block your view. Without seeing the contact state between the key and the keyhole, humans rely on intuition and the sense of touch to complete the operation - whether it's inserted, stuck, or needs to be turned. We hope robots can do the same thing.

Daimon Robotics uses a gripper to pick up an egg (Source/Enterprise)

There are two mechanisms running simultaneously in the execution layer. One is a high-frequency tactile servo at the hundred-hertz level, similar to a spinal reflex. Without upper-layer reasoning, as soon as an object starts to show a slipping tendency, a compensating action is sent out before the visual frame switches. It's like when you're washing dishes and the plate covered in dish soap starts to slip a little. You don't need to look at it; your fingers will instinctively tighten to hold the plate.

The other is physical world reasoning. The model continuously predicts the operation state in the next few steps and gives a correction strategy in advance before a mistake actually occurs. It's like when you're pouring water from a kettle into a cup with one hand. As the water flows out, the center of gravity of the kettle bottom continuously changes. Your brain will continuously predict the weight distribution of the kettle in the next second based on the water flow rate and adjust the tilt angle of your wrist smoothly in advance to ensure a steady water flow.

These two mechanisms correspond to millisecond-level reactions and multi-step forward-looking respectively. They work in collaboration on the same task with different time scales. This is the most important structural difference compared to pure vision operation models.

Hard Krypton: Daimon recently released a dataset and a Benchmark for robotic physical interaction capabilities. What is the relationship between these and the physical world model you're working on?

Duan Jianghua: The dataset is the fuel, the physical world model is the engine, and the Benchmark is the tachometer.

Traditional datasets, whether visual or simulated, record "pixel changes" or "trajectories." However, to enable robots to understand the physical world, this is far from enough. For example, is an object soft or hard? Is its surface smooth or rough? What is the normal pressure, tangential force, and slipping tendency when grasping? These all belong to physical property information. The Daimon-Infinity dataset collects more than a dozen modalities, including pressure, deformation, texture, stiffness, and slipping tendency.

The greatest difficulty is not collecting a single modality but strictly aligning these more than a dozen tactile modalities with visual images and action instructions in the millisecond-level spatio-temporal dimension.

Daimon Robotics achieves the task of threading grapes autonomously (Source/Enterprise)

For example, when a robot's finger touches an object, the tactile sensor records the pressure distribution and texture information at the contact point, while the camera records the picture at that moment, and the control system records the joint angle and torque. These three must be synchronized accurately to the millisecond level in time; otherwise, the model will have difficulty learning the correct causal logic.

With data and the model in place, the next question arises - how to judge whether the model has really learned physical causality? This is the significance of Daimon's launch of RobOmni.

Existing benchmark evaluations in the embodied field often focus on the visual perception modality, emphasizing the robot's generalization grasping and long-sequence planning tasks. The evaluation standards for the tactile perception modality and contact refined operations are not yet perfect.

The industry still lacks a standardized evaluation benchmark for tactile perception and dexterous manipulation. There is no unified standard among different models and data, making it difficult to quantify tactile capabilities and systematically verify the model's generalization ability.

We noticed that some teams focusing on simulation and Sim2Real fields have recently started to introduce visual-tactile fusion evaluation. This shows that the entire industry's frontier is reaching a consensus - pure vision is not enough for robots to truly understand and interact with the world, and touch is indispensable. RobOmni fills this gap, providing a standardized, comparable, reproducible, and scalable verification entry for physical interaction capabilities.

Without a ruler, it's impossible to measure progress. Without standards, the industry can't form a joint force. So we need to make a ruler first and then measure the world.

Comments from investors:

A relevant person in charge of Inovance Industrial Investment said that for embodied intelligence to achieve a generational leap in real-scenario operations, filling in the physical causal logic through tactile perception is an inevitable path. Daimon Robotics is one of the few companies in the industry that starts from the physical causal logic and uses massive visual-tactile data to drive the implementation of physical world models in refined operation scenarios. Inovance Technology has long been deeply involved in the fields of industrial automation and intelligent robots and is well aware of the strategic value of multi-modal perception in refined operation scenarios. In the future, based on Inovance's scenario and industry knowledge, we look forward to jointly building a tactile neural network in the era of embodied intelligence with Daimon.

A relevant person in charge of China Telecom Investment Company said that for embodied intelligence to achieve large-scale commercial implementation, it not only requires the continuous iterative upgrading of cloud-based large model computing power but also highly relies on high-precision physical perception capabilities and a multi-modal data system as support. Daimon Robotics has deeply accumulated in the visual-tactile perception track and has built a solid core technology barrier. As a key force in the construction of Digital China, China Telecom is fully implementing the "Cloud Transformation, Digital Transformation, and Intelligence Benefit" strategy. In the future, we look forward to deeply cooperating with Daimon Robotics to jointly create implementable and replicable industry solutions for embodied intelligence, build a new digital infrastructure to empower the development of new quality productivity, and help accelerate the high-quality development of the embodied industry to achieve ecological win-win results.