When an 11-year-old AI company joins the battle in the field of embodied intelligence.
This year is known as the first year of embodied intelligence, and this field has now become the hottest battlefield for AI implementation.
Recently, Yufan Intelligence, a well - known visual AI company with 11 years of history, released two embodied intelligence products and announced full - stack self - research of "intelligence + hardware", fully embracing the era of embodied intelligence.
It seems like a significant leap, but within the industry, Yufan's move into embodied intelligence is quite logical.
On the one hand, visual ability has become the core entrance for machines to understand the physical world and is also the foundation of multimodal intelligence. Teams with a visual background have become a backbone force in the field of embodied intelligence. Entering the field of embodied intelligence is an inevitable direction for the evolution of this enterprise's capabilities.
In addition, on the path of "intelligence + hardware", Yufan also has long - term R & D experience in integrating software and hardware. In the era of visual AI, the computing performance of various device terminals at that time could not support the direct implementation of AI algorithms. Yufan was the first in the industry to reconstruct algorithms based on the performance of edge - side chips, reducing the algorithm's consumption of hardware and achieving end - to - end performance optimization.
This set of experience in collaborative software - hardware development from underlying hardware adaptation to upper - layer AI algorithm optimization allowed Yufan to reap benefits in the era of visual AI. On this basis, it quickly achieved commercial implementation and large - scale delivery. In the era of embodied intelligence, the implementation of intelligent robots also greatly tests software - hardware collaboration, and Yufan's past experience undoubtedly provides assistance for this.
"We have figured out how to do embodied intelligence and are determined to use the accumulation of the past decade to quickly become a leading player in the field of embodied intelligent robots in this wave of AI. We not only want robots to see, understand, communicate, and act but also to truly learn to think and make decisions independently," said Zhao Hongyi, the chairman of Yufan Intelligence.
01 Why fully embrace embodied intelligence?
A new player has joined the embodied intelligence track.
A few days ago, Yufan Intelligence, a well - known enterprise in the field of visual AI, held its 11th anniversary celebration and partner conference. At the conference, in addition to releasing a new generation of visual AI hardware and Agent products, Yufan also officially launched two embodied intelligence products - the spatial cognition large model Manas and a quadruped robot dog, announcing that this 11 - year - old artificial intelligence enterprise has officially entered the era of embodied intelligence.
The spatial cognition large model Manas was unveiled on Yufan Intelligence's official WeChat account in July this year. It is a Multimodal Large Language Model (MLLM). According to the information provided by Yufan, Manas achieved SOTA results on the popular spatial understanding datasets VSI - Bench and SQA3D in the industry compared with models of the same scale in the industry.
During this official release, the outside world observed that Manas' role in Yufan's embodied intelligence strategy has been further clarified. In the future, it will serve as the brain of Yufan Intelligence's embodied intelligence hardware, playing the role of a spatial cognition base, enabling intelligent hardware to perceive the real physical world and have the ability to make independent decisions.
The newly released quadruped robot dog is the first embodied intelligent robot launched by Yufan Intelligence. It is reported that its mechanical structure, motors, motion control platform, and capabilities are all self - developed by the Yufan team.
The release of these two products also reveals Yufan Intelligence's strategy in the era of embodied intelligence - continuing the "intelligence + hardware" gene, conducting full - stack self - research on the brain, cerebellum, and body, and fully embracing Physical AI.
Yufan's decision to enter the embodied intelligence track at this time is not a sudden move for the industry.
In fact, with the progress of large language model technology, the intelligence level of various types of hardware in a broad sense has been upgraded. Leading players in the machine vision industry, such as Hikvision, are implanting multimodal models into devices to improve the intelligence level of hardware.
In the field of robots, with the in - depth integration of robots and large model technology and the development of multimodal large model capabilities, especially visual ability has brought stronger generalization ability, and the "brain" of robots is also evolving. Originally, robots could only complete single - body, single - scenario tasks, but now they are expected to evolve into "generalists" with stronger generalization ability.
There are many enterprises in the visual AI field entering the embodied intelligence track. For example, at the end of last month, SenseTime released an embodied intelligence brain at WAIC to layout the embodied intelligence track.
At the same time, researchers and practitioners in the visual field have become an important force in the field of embodied intelligence. Professor Sun Fuchun of Tsinghua University mentioned in his speech at the 2025 Beijing Zhiyuan Conference in June this year that embodied intelligence has always been developed by two groups of people. One is the computer vision group, centered around vision, with Li Fei - fei as a typical representative, and the other is the practitioners in the original robotics field.
Zhao Hongyi elaborated on the strategic considerations behind this release in his speech. He emphasized that multimodal, especially visual ability, is crucial for the development of embodied intelligence.
Zhao Hongyi pointed out that Yufan Intelligence's current entry into the embodied intelligence track is not only a strategic choice of an artificial intelligence company with 11 years of technological accumulation to follow the general trend on the eve of industrial transformation but also an echo of the founding team's original intention of making robots, which finally came true after the internal and external technological conditions matured.
He revealed a detail in Yufan's entrepreneurial journey that was rarely noticed by the outside world before. In 2014, Yufan raised its first angel investment with a home robot demo. "Our original entrepreneurial dream was to make intelligent robots."
At that time, robot technology spanned three major technological peaks: image recognition (perception), voice interaction (understanding and dialogue), and motion control (action). Limited by real - world conditions such as technological conditions and team size, Yufan finally chose the image recognition track it was best at to complete the commercial implementation cycle. However, this team has never given up its dream and original intention of intelligent robots.
With the rise of this wave of large model tides, artificial intelligence is evolving from AI 1.0 to AI 2.0. The field of embodied intelligence has become one of the main battlefields for AI implementation. Robots are evolving from being "able to see, hear, speak, and move" to truly having the ability to make independent decisions. Among them, vision is becoming the key support for robots to have cognitive and decision - making abilities.
"Among all perception methods, visual information has the highest density and the strongest universality. It is the core entrance for machines to understand the physical world and is also the foundation of multimodal intelligence. In the scenario of embodied intelligence, vision not only determines what the machine sees but also determines what the machine does next."
In Zhao Hongyi's view, this release is more like a strategic evolution of Yufan. Vision was the clearest implementation direction in the AI 1.0 era, and now vision is expected to become the entrance for more intelligent robots. Coupled with the founding team's long - held robot dream, once the technological reserves are mature, they will inevitably take this step.
02 What has Yufan done to embrace Physical AI?
In addition to its visual gene, Yufan's simultaneous launch of two embodied intelligence products also demonstrates the technological reserves of this artificial intelligence enterprise in multimodal and intelligent hardware capabilities.
Taking multimodal capabilities as an example, Yufan has had a lot of thinking and work results in the past year on how to enable intelligent agents to have spatial understanding capabilities.
Currently, the industry is still in the exploration stage regarding how to enable robots to have a more intelligent brain, and the technological routes have not yet "converged". Some industry insiders believe that there are multiple routes such as the end - to - end VLA model (Vision - Language - Action), the brain - cerebellum architecture, and the world model.
Although the technological routes are different, there is a consensus that robots need to have multimodal reasoning capabilities, which are also regarded as the key for AI to comprehensively perceive, understand, and make decisions like humans. And the multimodal vision - language model is considered the core foundation for achieving multimodal reasoning. Because it can map pixels, 3D structures, and text into the same high - dimensional vector space, forming "cross - modal alignment".
Here, natural language is the explicit intermediate layer in the reasoning process, which can be read by humans and called by downstream strategy networks. The vision - language model plays the role of the core control center connecting perception, decision - making, and human instructions in embodied intelligence.
However, not all multimodal models are suitable to be the brain. An industry insider noticed that using GPT - 4o as the brain of a robot is not ideal because it lacks long - term planning and spatial understanding capabilities. This is also a problem with many multimodal language models on the market. Although they perform well in perception tasks such as image recognition and language understanding, they still have obvious shortcomings in spatial perception. For example, their perception of fine - grained, local, and geometric information is not as accurate as traditional pure - vision models.
In the scenario of embodied intelligence, robots need to accurately grasp objects. The model not only needs to "understand" the semantic content of the image but also needs to have accurate perception ability of the three - dimensional space. For example, geometric information such as the actual size, relative orientation, and spatial layout of objects are all supports for subsequent complex tasks such as the robot's path planning, object operation, and environment understanding.
Wang Tao, the CTO of Yufan Intelligence, introduced that this means that the robot's "brain" must deeply integrate the language model with spatial perception ability to achieve robust operation and interaction in the real world. Only when semantic understanding and spatial reasoning ability are both available can embodied intelligence truly move towards large - scale application.
Manas, which was unveiled in July this year, is a multimodal language model (Multimodal Large Language Model, MLLM) strengthened by the scenario of embodied intelligence. Its base is an open - source large language model, and they specifically conducted inductive training and strengthening work on the spatial understanding level. It embodies many achievements of Yufan's technical team in spatial cognition of embodied intelligence and multimodal technology.
First, at the end of last year, Yufan self - developed the multimodal reasoning architecture UUMM. It refers to the architecture of large language models and adapts it to the scenario of embodied intelligence, receiving human language and visual input and outputting action instructions to form a closed - loop for rapid iterative optimization.
On this basis, in March this year, the Yufan team also released HiMTok, which is in line with Yufan's VLA project. Through an innovative method, it achieved the endogenous integration of the large model's image segmentation ability. While keeping the model structure and parameter scale basically unchanged, it achieved the organic integration of multiple tasks such as image understanding, image segmentation, and target detection. This work has taken a step forward in upgrading the large model from single - text output to multimodal output such as images and robot actions (Robot Action).
After that, they also improved the model's multimodal output ability based on reinforcement learning technology.
This series of work has enabled Yufan's MLLM model Manas to perform excellently in benchmarks related to spatial understanding such as target counting, absolute/relative distance, physical size, path planning, and spatial relationships from a self - perspective. The release of Manas means that Yufan's ability reserve for the embodied intelligence brain has matured.
The self - developed quadruped robot dog, another released product, means that Yufan also has the capabilities of the robot's body and cerebellum. "With the mature parts chain of various robots, we self - developed core components such as motors and control platforms. After multiple iterations and many trials and errors, we have now reached the third - generation product."
The R & D team of Yufan revealed that they will accelerate the integration of the robot's brain and cerebellum in the future.
03 Continuing the "intelligence + hardware" gene and taking the path of full - stack self - research
Full - stack self - research on the brain, cerebellum, and body of robots is a significant challenge for any newly entered enterprise. Why did Yufan choose the path of full - stack self - research?
According to the observation of Shuzhi Frontline, this is related to the current industrial situation of embodied intelligence, and Yufan Intelligence's past corporate gene and development history have also strengthened this team's understanding of the "intelligence + hardware" route.
From the perspective of the industrial situation, the various technological routes around embodied intelligence have not yet converged, and the standards of various types of hardware have not yet been unified. Manufacturers with intelligent algorithm capabilities can hardly focus on the R & D of the robot brain without considering the hardware body factor.
An industry insider previously mentioned that there are so many embodied intelligence manufacturers now, and the degrees of freedom and the number of sensors of the bodies of different manufacturers are different, so the data is not universal at all. This makes it difficult for algorithms trained based on data to migrate across different bodies, which also means that manufacturers need to fully consider how to cooperate with embodied intelligence hardware when developing algorithms at present.
The Yufan team told Shuzhi Frontline that they are taking the full - stack self - research route now to better ensure the quality, quality control, and effectiveness of embodied intelligence products. "The brain and cerebellum need to be integrated, and this dual - system also needs to cooperate with the body. If we purchase products from external teams, it is difficult to achieve the best results at this stage."
On the other hand, the industrial chain has developed significantly compared with previous years. Thanks to the strong manufacturing foundation in China, the hardware parts industrial chain related to robots has become very mature. In addition to self - researching the core motor control parts, other parts can be supported by the industrial chain, which also lays a foundation for start - up companies like Yufan to take the full - stack self - research route.
At the same time, Yufan's past gene has also made them firmly choose the "intelligence + hardware" route in the era of embodied intelligence.
"'Intelligence + hardware' is our established path. In the AI 1.0 era, based on the 'intelligence + hardware' route, we have successfully embedded visual AI technology deeply into specific scenarios such as security, construction sites, communities, and hotels, achieving rapid commercialization and large - scale delivery of technology," said Zhao Hongyi.
Behind this is Yufan's accumulation of capabilities in software - hardware collaboration. Zhao Hongyi revealed that the early camera hardware could not support good algorithm applications because the computing power of the edge side was insufficient. At that time, many manufacturers doing face recognition would add an accelerator stick to the device to support the application implementation.
Yufan chose to solve the problem through software - hardware adaptation collaboration and algorithm innovation. Based on the hardware performance limitations, they used a method similar to "replacing floating - point with integer compression and approaching the hardware limit layer by layer" in the quantitative trading field. They rewrote the model algorithm from floating - point calculation to integer calculation and made