HomeArticle

Exclusive Interview with Yingke | Wang Xiaogang, Co-founder of SenseTime, Leads New Embodied Intelligence Business to Help Robots Reunderstand the Real World

黄 楠2025-12-15 09:30
Build a "world model" that understands the physical laws of the world and the logical patterns of human behavior.

Author | Huang Nan

Editor | Yuan Silai

In the AI industry, SenseTime is a company that has been established for 11 years and has long been accustomed to the ebb and flow of tides.

In the era of the rise of visual AI, it emerged from the CUHK laboratory and opened the door to large - scale implementation. However, To B business has never been an easy task. Most companies, including SenseTime, have to deal with the long - term customized development needs of customers.

Until ChatGPT came out of nowhere, all companies collectively turned to large models. SenseTime, which took the lead in the computing power aspect, found room to showcase its capabilities. According to SenseTime's annual report, its generative AI revenue in 2024 was 2.4 billion yuan, accounting for 63.7% of the total revenue, up from 34.8% in 2023, becoming the most crucial business of SenseTime.

However, after three years of rapid development of large models, a practical problem emerged: "Apart from single - point breakthroughs in specific scenarios, how can AI truly enter the physical world and become a practical tool to change production and life?"

This is also the core proposition that SenseTime has been asking in each technological iteration.

As embodied intelligence becomes the main battlefield of the next - generation AI revolution, recently, Daxia Robotics was established, with Wang Xiaogang, the co - founder and executive director of SenseTime, serving as the chairman, officially entering the battlefield of embodied intelligence.

Wang Xiaogang told Yingke that the original intention of establishing Daxia Robotics was not to follow the trend of "in - body involution" or "showing off complex skills", but to return to real pain points and propose a new research paradigm of "Human - centric". Based on providing a "brain" that focuses on understanding the laws of the physical world, it will ultimately output a hardware - software integrated product that meets the needs of real scenarios.

This is also an industry trend. The embodied intelligence industry, which was still exploring mobile stability and applicable scenarios last year, has presented a completely different picture in just one year. Some companies have won orders worth hundreds of millions of yuan and entered the workshops of robot factories in Shenzhen, Shanghai, and Suzhou, making embodied intelligence no longer just a story for VCs.

The evolution of AI technology is moving from "digital intelligence" to "physical intelligence". Established AI companies in this process will find themselves in another important transformation.

SenseTime's net loss in the first half of 2025 was 1.162 billion yuan, a year - on - year decrease of 50%, and its R & D investment is still increasing. It needs to find more practical directions.

The breakthrough of general intelligence does not lie in the fantasy of AGI that aims for instant success, but in precipitating reusable capabilities from real interactions. The ultimate value of a robot does not lie in its cool form, but in whether it can solve practical problems in the physical world. From visual AI, large models to embodied intelligence, using Daxia Robotics as a fulcrum, SenseTime is trying to pry not only a multi - billion - dollar embodied intelligence market but also the possibility of in - depth interaction between AI and the physical world.

The following is a transcript of the interview between Yingke and Wang Xiaogang, edited for content:

Not Just an Embodied Brain Company

Yingke: This year is generally regarded as the first year of the implementation of embodied intelligence. Why did SenseTime choose this node to establish Daxia Robotics and enter the embodied intelligence track?

Wang Xiaogang: It is mainly based on considerations from two dimensions: industrialization implementation and technological paradigm.

In terms of industrialization, embodied intelligence is a vast track with a scale of tens of trillions of yuan or even greater development potential. As Huang Renxun, the founder of NVIDIA, said, in the future, everyone may own one or more robots, and the number of robots is expected to exceed that of mobile phones, and the unit value can be comparable to that of cars.

For SenseTime, in the past, we focused on the To B software field. If we want to further expand the enterprise scale and achieve the business upgrade of combining hardware and software, the vertical integration attribute of the robot track is an important breakthrough point. At the same time, based on our previous accumulation in various vertical industries, our team understands the pain points and needs of users. Compared with embodied intelligence companies that have insufficient understanding of scenarios and are difficult to solve practical problems, SenseTime's scenario implementation ability is more anticipated, and the industrialization progress is also expected to be faster.

From the perspective of technological paradigm, there are obvious shortcomings in the development of traditional embodied intelligence.

The hardware of robot bodies has developed rapidly, but the intelligent ability at the "brain" end is relatively lacking. The core problem lies in the adoption of the "Machine - centric" technological route; that is, first design various robot bodies with huge differences in form and parameters, and then train a general model through the data collected by the bodies. This idea is not tenable. Just as humans and animals in nature cannot share the same brain, robots with different structures, such as dexterous hands, grippers, and mechanical arms with different numbers, are also difficult to adapt to a unified model.

Yingke: What are the differences in the technical solutions adopted by the Daxia Robotics team?

Wang Xiaogang: We propose a new technological paradigm of "Human - centric". First, study the interaction methods and movement laws between humans and the physical world. Through various tools such as wearable devices and third - perspective devices, combined with multi - dimensional data such as vision, touch, and mechanics, record human behaviors in real production and life, especially complex common - sense behaviors.

By inputting the above data into the world model, the model can deeply understand the laws of the physical world and the logic of human behavior, thus constructing a powerful robot "brain". At the same time, a mature world model can in turn guide hardware design, making the hardware form more suitable for actual application needs.

In August and September this year, companies such as Tesla and Figure AI announced that they would abandon the real - machine route and turn to the visual solution based on first - person - view cameras. However, in essence, it only records human behaviors through vision and does not cover key dimensions such as force, touch, and friction, which are the core requirements for embodied intelligence to have three - dimensional contact with the physical world.

Only relying on vision technology, robots can perform imitation actions such as dancing and punching, but in scenarios that require interaction with the physical world, such as moving bottles and screwing screws, they will inevitably face technological bottlenecks.

At present, the Human - centric paradigm proposed by Daxia Robotics has been verified in practice. Previously, the team of Professor Liu Ziwei, a core member of Daxia Robotics, cooperated to build an EGO life dataset, which contains 300 hours of real human behavior data from first - person and third - person perspectives. During this period, the embodied vision model developed based on this dataset has been tested to effectively solve the pain point that most of the existing data are simple and meaningless behaviors and are difficult to support complex motion learning.

Members of the Daxia Robotics team: In the first row, from left to right, are Li Hongsheng, Tao Dacheng, Wang Xiaogang, and Pan Xingang; in the second row, from left to right, are Lü Jianqin, Zhao Hengshuang, Liu Ziwei, and Liu Xihui (Source: the enterprise)

Yingke: Public data shows that the market scale of embodied intelligence in China exceeded 80 billion yuan in 2024, and hundreds of start - up companies have entered the embodied intelligence field in the past two years. In this context, how does Daxia Robotics define its ecological niche in the industry?

Wang Xiaogang: The ultimate goal of the Daxia team is to output a hardware - software integrated product that can specifically solve practical problems in various scenarios, rather than simply being a model - making company.

In this process, we found that the existing hardware design often fails to match the scenario requirements, which also promotes the team to embark on the path of joint R & D and customized hardware manufacturing.

Take the quadruped robot (machine dog) product as an example. The cameras of traditional machine dogs in the industry have a narrow field of view and are installed at a low position, resulting in their inability to accurately identify the direction of travel at intersections and capture traffic light signals when crossing the road. We cooperated with Insta360 to develop a panoramic camera module that can achieve 360 - degree full - view coverage, solving the problem of limited vision.

In addition, many current machine dogs still have pain points such as insufficient waterproof performance, high cost of the computing power platform, and limited battery life, which cannot meet the normal use requirements of actual scenarios.

Yingke: In the specific implementation of joint development, what is the cooperation model between the two parties?

Wang Xiaogang: Our strengths lie in the model, navigation ability, and operation ability at the brain end. In the past, although the company had B - end software services and large - scale devices to provide underlying facilities, it did not form a standardized product form at the end side.

Relying on several body hardware and component companies invested and laid out by SenseTime in the past two years, the Daxia team adopts an ecological cooperation model, providing hardware design specifications and jointly designing and developing hardware bodies with partners. At the same time, we also maintain an open attitude at the model end, providing basic models and material solutions.

Yingke: SenseTime has rich data and technological precipitation in fields such as security and autonomous driving. Which core capabilities can be directly reused when these resources are migrated and extended to the field of embodied robots?

Wang Xiaogang: There are two core capabilities. The first is the R & D system and safety standards. Both autonomous driving and embodied robots need to rely on massive data to drive technological iteration. The R & D system, data closed - loop, and data flywheel precipitated from autonomous driving have been verified to effectively improve the iteration efficiency of robot technology. At the same time, the strict standards for safety and data quality in the autonomous driving field can also be migrated to the R & D of embodied robots, providing guarantee for product reliability.

Second, application functions. The Ark platform accumulated by us in the smart city has hundreds of different application functions, which were mainly used for fixed - camera scenarios in the past. Now, by connecting it with embodied robots, when the devices go outdoors, they can seamlessly migrate and expand the function boundaries with the help of the platform's back - end analysis ability.

"Within one or two years, Human - centric will be the first to achieve large - scale application in machine dogs"

Yingke: Looking back at the past 11 years of SenseTime, it has witnessed and participated in the complete transformation from the large - scale implementation of visual AI to the current explosion of embodied intelligence. How to understand the different paths of technological iteration at each stage and the underlying logic behind them?

Wang Xiaogang: SenseTime's development process clearly outlines the evolution of AI technology from the 1.0 to 3.0 forms.

When the company was founded in 2014, AI was in the 1.0 era, with face recognition as the representative achieving a recognition rate that exceeded that of the human eye. At that time, the "intelligence" came from manual annotation. By adding labels to images, "cognitive ability" was injected into images that originally had no intelligent attributes.

However, due to limitations such as the small amount of information in labels and strong pertinence, different tasks required separate annotation of corresponding images and videos, resulting in the situation of "how much artificial intelligence there is depends on how much manual work there is". Limited by the data dimension, the models at that time were not only small in size but also difficult to achieve cross - scenario and cross - industry generalization applications.

In the 2.0 large - model era, the situation changed fundamentally. The core difference is that the data itself contains more intelligence. We use text and text - image data on the Internet. A poem, an article, or a piece of code records a large amount of behavioral intelligence accumulated by humans over thousands of years, which is far higher than the intelligence content of simple labels.

The large model, combined with this data, achieved an intelligent explosion, enabling the model to cross different scenarios and industries and have strong generality.

However, the value of Internet data is gradually being "exhausted", and the marginal effect of generality is also gradually slowing down.

In the 3.0 embodied intelligence era that we are moving towards, we will turn to direct interaction with the physical world. To build a "world model" that understands the physical laws of the world and the logic of human behavior, it is far from enough to just study text and text - image data. We must conduct actual interactions in the physical world. Whether it is cleaning a room or providing services, these specific scenarios all contain complex real - time intelligence. Through direct contact and interaction with the world, AI will break through the limitations of existing data and generate new intelligent growth paths.

Yingke: From the perspective of industry trends, the R & D focus in the embodied intelligence track has shifted from focusing on the "embodied brain" last year to exploring the "cerebellum" motor control ability now. What is the essential reason behind this shift in direction?

Wang Xiaogang: I think the core reason is that the research paradigm of most people is still "Machine - centric".

In this paradigm, the interaction of machines naturally becomes motor control, that is, the cerebellum, because it is closely related to the underlying hardware. However, precisely because the data collected by different robot bodies are different, it is difficult to form a general and unified brain.

Secondly, it is unable to generate complex activities. The traditional model of collecting data through real - machine operation has obvious limitations and can only generate simple action data that can be completed in a dozen seconds, such as picking up, moving, and placing. However, complex activities that require long - term driving, such as cleaning a room or providing services, cannot be completed.

This also proves the necessity of our proposed "Human - centric" approach to train the world model through environmental data collection.

Yingke: Compared with the existing world models, what are the differences in the underlying logic of the "Kairos 3.0" world model proposed by Daxia Robotics? How does it solve the problem of physical world hallucinations?

Wang Xiaogang: The world model we built is different from existing models based on synthetic data, such as Sora and Marble proposed by Fei - Fei Li's World Labs team. The difference lies in that Kairos 3.0 adopts a three - stage architecture of "multi - modal understanding and fusion - synthesis network - behavior prediction".

The model unifies multi - modal understanding and generation centered on the camera, supports spatial imagination, and enables flexible cross - perspective applications such as world exploration (Source: the enterprise)

Our model is divided into three parts. The first part is multi - modal understanding and fusion. Existing models mainly rely on images, videos, and text descriptions as inputs, while our input system is more diverse, covering multi - modal information such as images, videos, camera poses, 3D target trajectories, and tactile mechanics, which enables the model to better understand the physical world.

For example, in the cooperative research between Daxia and Nanyang Technological University, the model can infer the camera pose from a single photo. When the camera on the wrist of a mechanical arm captures an image, it can accurately locate the position of the mechanical arm and reverse - deduce the movement trajectory of the mechanical arm based on the image change, achieving a deep understanding of the interaction logic in the physical world.

The second part is the synthesis network. Based on the understanding and fusion in the first step, the Kairos 3.0 world model can synthesize various videos, including the synthesis of operation tasks with different types of robots.

The third part is prediction. After receiving instructions, the model can predict how the mechanical arm should operate next, thereby guiding the robot to perform operations. This enables our model to simulate dynamic scenarios, separate dynamic targets, and flexibly replace various elements in the scenario, such as changing bottles, mobile phones, desktops, and even