HomeArticle

With the rapid development of multimodal AI, how does Xiaodu Superpower reshape smart hardware?

碧根果2025-11-17 21:00
Advance from an "AI assistant" to an "AI partner".

In 2025, the era of AI hardware truly began.

After the technological upgrades of GPT - 4o and Gemini 1.5 in 2024, multimodal large - language models became capable of transitioning from theoretical research to practical applications. AI is no longer confined to text generation or voice Q&A; it can now understand images, perceive the environment, and respond actively. AI hardware has finally moved beyond being just a "toy" for geeks and has truly entered the lives of most people. As a result, in this year, AI hardware made its way onto the stage at an almost explosive rate.

From voice recorders, cameras, and speakers to glasses, rings, and necklaces, every item is being redefined: some people pursue the efficiency of instant recording, some explore more anthropomorphic and immersive interactions, while others value the connection between emotions and semantics. Regardless of the form, these are all attempts to bring AI closer to humans.

Behind this lies a more profound question: In what form should AI integrate into the physical world?

At this very moment, on November 13th, Xiaodu launched a brand - new multimodal AI intelligent assistant, Super Xiaodu, at the Smart Hardware Sub - forum of the World Conference. Different from most AI hardware startups that focus on a single scenario, Xiaodu chose to comprehensively reshape its entire product line, full - volume, and full - ecosystem products through an upgrade.

Among them, new hardware products equipped with Super Xiaodu, such as the Xiaodu AI Glasses Pro, the Xiaodu Smart Camera C1200 with a triple - camera version and the C800 video - call version, and the Xiaodu Smart Speaker Fun, also made their debut at the forum.

What Xiaodu aims to capture is the intersection between AI and the real world. "Since its inception, Xiaodu has always pursued a revolution in human - machine interaction. Super Xiaodu is the new carrier of this mission." Li Ying, the CEO of Xiaodu Technology, said on - site.

01

Xiaodu's Super Evolution: From Assistant to Partner

If AI is only regarded as an added value to hardware, then no matter how the hardware form changes or how rich the software functions become, it is essentially just a stack of technologies; only when AI becomes the primary driving force for the transformation of hardware interaction and even redefines it, can we truly promote the arrival of the "next - generation human - machine relationship."

While most global hardware manufacturers are competing on "how to better integrate an AI assistant into devices," Xiaodu chose to focus on the evolutionary capabilities of the AI assistant in "perception, learning, and memory" and use this to drive innovation in hardware products.

The release of Super Xiaodu is the ultimate manifestation of this logic.

In addition to its existing voice - interaction capabilities, as a multimodal AI intelligent assistant, it also has the ability to process visual information such as images and videos. It can even conduct complex reasoning and planning by combining the perception of surrounding environmental information.

One case at the press conference was particularly impressive - "Intelligent Item Search": When you ask the camera, "Where did I put the remote control?" Super Xiaodu will first scan the real - time image of the current room; if it doesn't find it, it will automatically trace back to historical images from the past 24 hours or even a longer period to locate the last time and place the remote control appeared and display the corresponding video record.

Although it solves the daily problem of "losing the remote control," the significance behind this idea goes far beyond that.

From a technical perspective, this means that AI not only needs to be able to "see" and "clearly identify" objects but also understand the spatial and temporal relationships, thus constructing a multi - dimensional and dynamic mapping of the real world - and this is also one of the main challenges currently faced by large - language models.

Xiaodu summarized the upgrade of Super Xiaodu into three major evolutions:

1. From single - point response to global understanding: It is no longer limited to executing single commands but can conduct in - depth context understanding and make comprehensive judgments based on time, space, people, and actions to achieve more comprehensive and multi - dimensional perceptual decision - making.

2. From passive intelligence to active intelligence: Different from the past interaction mode of "you call, I respond" and "you ask, I answer," it can actively understand, analyze, and even predict user needs and provide solutions.

3. Strengthened personalized memory: It can not only remember habits and preferences but also perceive tones and emotions, read people's expressions, anticipate what you think, and understand what you need, truly transforming the human - machine relationship from a "tool" to a "partner."

Li Ying also announced on - site that the full - line, full - volume, and full - ecosystem launch of Super Xiaodu will not only cover new products such as the Xiaodu AI Glasses, Xiaodu Smart Camera, and Xiaodu Smart Speaker Fun but also enable tens of millions of already - sold devices to be upgraded for free, achieving a more natural, in - depth, and considerate human - machine interaction experience and truly enabling the "AI assistant" to make a leap to an "AI partner."

02

When an AI Partner Enters the Physical World

At the press conference, several new hardware products fully equipped with Super Xiaodu also became the focus of the event.

Take the Xiaodu AI Glasses as an example. It is equipped with the Qualcomm Snapdragon AR1 chip, a Sony 12 - megapixel 109° ultra - wide - angle lens, supports 4K photo and 1440p video shooting, and has built - in EIS intelligent anti - shake; it uses an open - type anti - sound - leakage dual - speaker and a five - microphone array for coordinated sound collection, combined with an inverse sound - field directional acoustic system and a self - developed ENC call noise - reduction algorithm, which can effectively reduce noise interference in call, music - listening, and voice - interaction scenarios.

In terms of battery life, it can be used continuously for about 7.5 hours in the comprehensive mode, and with the accompanying smart charging case, it can reach about 68 hours, ensuring worry - free daily use.

In addition, in terms of appearance and wearing experience, the Xiaodu AI Glasses Pro weighs only 39 grams. From the on - site display, in addition to the Boston and cat - eye frame styles, Xiaodu also offers sunglasses and photochromic lenses for selection, and is equipped with adjustable soft silicone nose pads, which have been optimized in terms of fashion sense, usage scenarios, and face - shape adaptation.

Of course, the AI glasses market is highly competitive, where both "hardware capabilities" and "software capabilities" matter.

As one of the early domestic manufacturers in this field, Xiaodu has demonstrated the remarkable effect of "1 + 1>2" through the combination of hardware and software in the actual functional experience of its AI glasses by upgrading multimodal intelligence.

For example, when you are inconvenient to take out your phone but need to record a parking space or a notice from the property management of your community, just tell Xiaodu, "Help me record this," and the glasses will automatically take a photo, analyze it, and generate a memo. You can then ask at any time, "Where did I park my car?" "When will the water be cut off tomorrow?" or even make a one - click call to the property management, truly achieving "record what you see and get answers when you ask."

In the office scenario, the role of AI is further amplified: The "AI Meeting Minutes" function of the glasses, on the basis of regular voice - to - text transcription and content summarization, can not only take photos to record important meeting materials such as blackboard writings and PPTs and automatically match them to the corresponding positions in the minutes but also support further insights into the speaker's intentions, analysis of potential controversial points, etc., and generate various optimization suggestions such as communication strategies, follow - up promotion guidelines, and process efficiency improvements.

It is reported that this function will be officially launched in December this year.

In addition, the "Ambiance Playlist" function jointly created by Xiaodu and NetEase Cloud Music gives AI a more flexible expression. When you say, "Play a song that fits the occasion," the glasses will generate a exclusive BGM based on the scene in front of you - whether it's the light and shadow on a twilight street or the view from a mountaintop, AI can capture it and compose a piece of music that matches the mood.

There are countless similar scenarios. Through the glasses as a portable carrier, Super Xiaodu has integrated into every moment of our daily lives.

Li Ying mentioned that if AI glasses, as an extension of our senses, achieve "first - person perspective intelligence," then the perception and understanding of the surrounding environment by smart cameras will open up a new form of "God's - eye perspective intelligence."

Xiaodu launched two smart cameras this time:

One is a video - call version with a screen, targeting families with the elderly and children, supporting convenient and smooth two - way WeChat video calls; the newly released Xiaodu Smart Camera C1200 with three cameras, through the combined design of a pan - tilt long + short - focus lens and a fixed ultra - wide - angle lens, can not only link two images for better tracking of moving targets but also achieve high - definition detail capture with 10x optical hybrid zoom, making it more suitable for pet owners.

Also, with the support of the multimodal capabilities of Super Xiaodu, the Xiaodu Smart Camera has created an "AI Care - at - Will" function, which can recognize the specific actions of specific objects such as people and pets and actively intervene based on the understanding of the image semantics - for example, it will actively give a voice reminder when a child has an abnormal study posture or send out a sweeping robot to deter a pet when it is causing a mess.

It is evident that the current Chatbot - style question - and - answer mode can hardly meet people's imagination of higher - level intelligent applications.

Letting intangible intelligence enter real life, understanding what we are experiencing at the moment, and actively providing help and companionship may be the more anticipated form of AI.

03

Multimodality Is Not the End

From Siri a decade ago to Xiaodu in the era of smart speakers, people have been trying to open the door to intelligence through dialogue - voice interaction is almost a standard feature of all smart hardware, bringing convenience but always struggling to become a necessity.

In the past two years, with the rapid development of multimodal technology, the focus of the large - language model competition has also shifted rapidly: OpenAI's GPT - 4o was the first to achieve real - time multimodal understanding and generation of text, images, audio, and video with a single model; Google's Project Astra intelligent agent based on Gemini can observe and understand the surrounding environment through cameras and microphones and has long - context memory capabilities; Meta is also exploring the addition of more multimodal AI applications, including visual Q&A, in the smart glasses it launched in cooperation with Ray - Ban.

In this industry narrative, Xiaodu's "super" evolution this time actually chose a longer but more value - adding path for users in the long run: From voice, vision to emotion, from understanding commands to understanding people, truly redefining the "AI assistant."

As Li Ying said on - site, "AI is the core that endows smart hardware with a soul and opens up a new space for imagination" - From smart speakers, smart screens, companion machines, fitness mirrors, learning machines to today's AI glasses and smart cameras, every product evolution of Xiaodu clearly points to the same goal.

If a device is just "placed there" but cannot be truly used, then the value of AI cannot be realized. On the contrary, if AI can interact with and accompany the user through hardware, that is the starting point for the symbiosis of humans and technology.

Market trends also confirm this idea: The Global Market Insights report indicates that the global AI hardware market was approximately $5.9 billion in 2024, is expected to grow to $66.8 billion in 2025, and is projected to reach about $296.3 billion by 2034, with a compound annual growth rate of about 18%; in the Coherent Market Insights report, the "On - Device AI" market (the part where AI runs on wearable terminal devices) was estimated to be $26.61 billion in 2025 and is expected to expand to $124.07 billion by 2032, with a compound annual growth rate of about 24.6%.

Facing the rapid growth of the industry, Xiaodu has further clarified its strategic position of "centering around AI and using hardware as a carrier" through the release of a new multimodal AI assistant and the inclusive upgrade of old and new devices.

According to official data, the penetration rate of Xiaodu's self - branded products has reached 54 million households and is still growing. At the same time, Super Xiaodu will also be opened up in the form of an intelligent engine, enabling more industry partners in sectors such as hotels and elderly care to upgrade their capabilities and become an AI - capable foundation that various manufacturers can use. "We hope that everyone can work together to create a more intelligent, convenient, and user - friendly experience for users," Li Ying said.

Looking back from the point of 2025, from the well - known voice assistant to today's multimodal AI assistant, the evolution of Super Xiaodu is not just a technological upgrade but is also reshaping the connection between humans, machines, and the world.

When the barriers between language, images, and sounds are finally broken - when machines transform from passive tools to digital partners that can listen, see, speak, and think, the revolution regarding the future form of human - machine interaction has just begun.