Rokid: First in the Industry to Promote AIUI - What Impact Will Next-Generation AI Agent Glasses Have?

The big move of AI glasses has piqued everyone's curiosity.

This time, Misa really piqued everyone's curiosity.

Text by | Shuyue

Edited by | JeffHill

Last time when I visited Misa, the CEO of Rokid, I was shocked to learn that the Rokid team was collaborating with a mysterious and highly - regarded domestic large - scale foundation model company on a major initiative, exploring the technical and product feasibility of a heavy - weight feature.

The two parties might be engaged in "secret co - development", and it seems that there are results recently. Because Misa recently posted a teaser image:

The Always - on and Hands - free features of glasses make them the best carrier for large models

Previously, many people recognized the potential of AR glasses, believing that they would eventually take over from smartphones and become the next - generation consumer electronics.

A key reason is that AR and AI glasses have a core feature: long - term real - time online (Always On) and extremely low presence. They are used frequently and are a necessity.

Actually, it's not hard to find that glasses are a form of terminal that people wear for a very long time in daily life, have a very low presence, and require a very high level of comfort. They are very close to the voice input and have good stability, and don't require users to change their existing habits.

Compared with smartphones, glasses are "always online" to our senses, while smartphones need to be actively taken out, unlocked, and opened to interact with them.

Glasses are part of how we "see" the world and are naturally integrated with our visual and auditory senses. Whether wearing myopia glasses or sunglasses, people are used to having them always in the periphery or center of their vision. When AI capabilities are integrated into them, they become an "unobtrusive", always - ready intelligent presence.

The various features of glasses can be said to make them the best carrier for large models.

AI glasses provide a shorter - link interaction. For example, if you want to know the rating and special dishes of a restaurant ahead, with a smartphone, you need to: take out the phone - unlock the screen - find and click on the map or review app - input or search by voice - wait for the results. This series of actions takes at least 7 - 8 seconds in total. While with AI glasses, you only need to say: "Hey, glasses, how's the restaurant ahead?" The glasses will immediately present the information in front of you through voice or micro - display. The Hands - Free feature of glasses means faster instant response.

The first - person perspective and multi - modal data collection capabilities of glasses can become an extension of human organs. Smartphones are an external tool that needs to be actively used, while AI glasses can obtain real - world data in the three - dimensional space in front of them in real - time and unobtrusively - the scenery we see, the sounds we hear, the voices we speak, the direction of eye movement, and the posture of the head. These personalized data accumulated all - day - long are of great significance for training a personal - exclusive AI Agent. A truly understanding AI not only knows your schedule and preferences but also can see what you see and hear what you hear, thus providing more accurate, timely, and predictive services. This ability holds great commercial potential.

AI and glasses, one is a disruptive intelligence, and the other is the hardware form closest to human senses. They are deeply coupled. Glasses start to "think" because of AI, transforming from a passive optical tool into an active intelligent assistant; and glasses also provide a "perceiving" window for AI to understand the three - dimensional physical space in real - time beyond virtual text and online data.

Nowadays, the number, frequency, and duration of ordinary people's interactions with AI are breaking through a critical point. When having a conversation with AI becomes as natural a basic need as "using electricity", the features of glasses as an AI terminal will be infinitely magnified.

The old - fashioned human - machine interaction and UI have no future

Previously, AR glasses were regarded as the next - generation consumer electronics, but it has been difficult to realize their potential. A key bottleneck is that ordinary users can't accept their interaction methods. Especially the positioning, selection, and operation in three - dimensional space are very inefficient and laborious.

AR glasses must achieve efficient, natural, and accurate human - machine interaction on the premise of ensuring portability, comfort, and low power consumption. Before the emergence of large models, especially before the essential breakthrough of multi - modal models, it was almost an unsolvable problem.

Touch interaction is currently the mainstream. From smartphones to smartwatches, touch is everywhere. Current AI glasses integrate the touchpad on the temple, which is easy to get started with, but they completely lose the "Hands - Free" feature. When users perform touch operations, they have to raise their hands to touch the glasses, which interrupts the natural activity state. Moreover, the touch area on the temple is small, the operation accuracy is limited, and the experience is poor when performing complex operations (such as text input and fine selection), and false touches are prone to occur.

Voice - controlled interaction is also the main solution for many current AI glasses. It does achieve "Hands - Free". Users only need to say the wake - up word and the command to complete operations such as taking pictures, making calls, and querying information. However, the most core problem of voice - controlled interaction is privacy. In public places, such as elevators, meeting rooms, and libraries, it's very embarrassing to speak loudly into the air to give commands, and users also worry that their commands and the glasses' responses will be heard by others. On noisy streets, in restaurants, or outdoors with strong wind, the accuracy of voice recognition will drop significantly, the delay will be high, or the interaction will fail, gradually accumulating a sense of frustration.

Eye tracking was also a key exploration solution for AR glasses before. Because "the cursor follows where the eyes look", which completely conforms to human intuition. By tracking the user's point of gaze, the system can accurately locate the object or area that the user is currently interested in and conduct the next - step interaction based on it. The problem is that accurate eye tracking requires high - frame - rate cameras, infrared light sources, and complex image - processing algorithms, which will inevitably increase the weight, volume, and power consumption of the glasses' components. On glasses that pursue extreme portability and long battery life, integrating a high - precision eye - tracking system is a huge engineering challenge. In addition, it's not easy to distinguish between "looking" and "selecting", that is, the mechanism design of "gazing" and "confirming".

There is also gesture recognition, which allows users to interact with the virtual world through gestures such as waving and pinching. However, in practical applications, gesture recognition faces similar challenges to eye tracking: it requires high - quality sensors, usually cameras or depth sensors, and powerful local computing power to analyze the hand bones and movements in real - time. This is very unrealistic for glasses, especially AI glasses, where space is at a premium. Moreover, without visual feedback, gesture recognition can easily make people feel embarrassed and fatigued as if they are drawing symbols in the air, and the operation efficiency is much lower than that of physical buttons.

Electromyogram (EMG) is very advanced and cool, and it also avoids many disadvantages. The EMG wristband captures the electrical signals generated by muscle movements through sensors worn on the arm to identify the subtle movements of fingers. Meta demonstrated this in its Orion glasses. Users don't need to raise their hands. As long as they move their fingers slightly in their pockets or under the table, they can control the glasses, perfectly solving the problems of privacy and "social embarrassment", and it's also very natural and "unobtrusive". EMG interaction is good, but it also faces problems of technological maturity and cost. Currently, high - precision EMG signal acquisition and processing not only require high - end hardware but also have problems with the universality of algorithms and adaptability to individual differences, and large - scale commercialization is also costly.

Basically, the more efficient, natural, and advanced the interaction method, the higher the hardware cost, the greater the weight and volume of components, and the higher the power consumption.

Pursuing extreme natural interaction (such as eye + gesture) may make the glasses bulky and the battery life extremely short, which actually reduces users' willingness to wear them. This goes against the core values of AI glasses, which are "portable and Always On".

How to solve it?

The possible key is AI User Interface (AIUI). And behind it is the AI Agent driven by a multi - modal large model.

The revolutionary AI - native UI is exciting

What is AIUI? It's by no means just giving a voice assistant a new look and a new name. Instead, it's an active human - machine interaction paradigm based on large AI models and multi - modal integration. It makes AI adapt to users instead of making users adapt to machines. When AIUI can significantly reduce the difficulty of interaction with extremely low hardware cost and power consumption, AI glasses can be widely accepted. This will be a huge boost to the industry.

The essence of AIUI lies in its ability of "multi - modal integration" and "context awareness". The traditional interaction mode is a single - path of "command - execution". Users issue clear commands, the device executes, and then waits for the next command. In theory, AIUI integrates voice, vision, and even environmental sensor data to understand the user's "intention" rather than just the "command".

For example, when you wear AI glasses and look at an unfamiliar building and subconsciously say, "This building is really beautiful." A traditional voice assistant may be confused or just record this sentence. But under the AIUI framework, the AI Agent of the glasses will process information from multiple dimensions simultaneously:

Visual information: Through the glasses' camera, the AI "sees" the building you are looking at and identifies its shape, style, and possible name.

Voice information: The AI hears you say, "This building is really beautiful", which is an emotional evaluation rather than a clear command.

Context information: The AI knows your current location (through GPS), the current time, and may even know whether you have searched for building - related information before.

Based on this integrated multi - modal data, AIUI can make an intelligent and active response.

However, it's not easy to implement the AI - native UI. It must completely rebuild the underlying architecture and design a user interface with the AI model (rather than traditional deterministic logic) as the core driving force. It's fundamentally different from the traditional graphical user interface (GUI) or the "old interface with AI buttons". Its core features can be summarized as follows:

Dynamic generation instead of static preset

The controls (buttons, menus, windows) of the traditional UI are pre - drawn by designers, and the interaction paths are fixed. The interface elements of the AI - native UI are often generated in real - time - they appear dynamically according to the user's intention, context, and model output.

Conversation is the interface, blurring the boundary between "interaction" and "result"

In the traditional UI, users operate the controls first and then wait for feedback. The AI - native UI makes natural - language conversation the main interaction method. Users directly express their goals, and the system uses language/content/actions as both interaction and result.

Multi - modal integration, and the interaction medium is no longer limited to the mouse and keyboard

The traditional UI relies on the pointer and keyboard input. The AI - native UI can process voice, images, gestures, and even eye movements simultaneously and present the output in the most suitable form for the current task.

Predictive and proactive

The traditional UI is passive: the user triggers, and the system responds. The AI - native UI can predict the user's intention, prepare information in advance or make proactive suggestions, and even complete some decisions on behalf of the user.

Personalized adaptation instead of a one - size - fits - all approach

The traditional UI allows users to manually adjust preferences through the settings panel. The AI - native UI automatically adapts to and optimizes the interface and interaction methods by continuously learning the user's behavior, language habits, and workflow.

No interface (Zero UI) or invisible interface

The most advanced AI - native UI may have no traditional UI - the interface retreats to the background, and the AI directly completes the task. Users don't feel the "operation interface" and only see the task being automatically completed.

Collaborative human - machine co - creation instead of simply executing commands

In the traditional UI, the machine is a tool that executes clear commands. The AI - native UI encourages humans and AI to explore together. The AI provides options, asks questions, and makes supplements to form a cyclic and iterative creation or decision - making process.

Allow fuzzy input and error correction

The traditional UI requires precise input (form fields, drop - down options). The AI - native UI accepts fuzzy, incomplete, or even contradictory commands and clarifies them through conversation or corrects them automatically.

The fundamental differences between AIUI and the traditional UI can basically be summarized as follows:

The real AI - native UI is still in the early stage of evolution. Many products just add AI functions to the old shell. But this time, Rokid is really implementing it and doing practical things. The teaser of AIUI has been released!

This is a very rough small demo video sent to me by Misa.

The future of AI glasses will definitely have GUI Agent

From the first demonstration of GUI Agent by Zhipu in 2024, where the large model helped us open WeChat and send out red envelopes in groups, to this year when Lobster took over the computer and for the first time, you just need to give a simple instruction to achieve a completely automated workflow that never gets tired for 24 hours.

The large model has demonstrated a long

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Rokid is the first in the industry to promote AIUI. What impact will the next-generation AI Agent glasses have?