HomeArticle

It can recognize lip language. Apple's new patent may save head-mounted devices.

三易生活2025-12-01 10:12
Once users can use such products without any concerns, these products will have the opportunity to become popular among the general public.

To save the Vision Pro, Apple has come up with a new move. Recently, a source revealed a newly approved patent, indicating that its future head - mounted devices will support lip - reading. So, users can receive commands by recognizing lip movements without making a sound.

In the patent - related documents titled "Electronic Device with Voice Input Structure", Apple described how to achieve voice input by reading lip movements through built - in visual sensors when the wearer cannot speak. If this patent can be implemented, it may have a huge positive impact on all current head - mounted devices.

After Xiaomi and Alibaba entered the market one after another, this wave of AI glasses craze has arrived. However, despite the booming situation, such products actually face considerable crises. According to the statistics of VR Vision, on Douyin e - commerce alone, the average return rate of AI glasses reaches 40% - 50%. Among the many reasons why netizens list them as "gadgets that gather dust", in addition to the increased wearing burden due to more components, voice interaction is another pain point that is frequently reported.

For devices like smart speakers whose usage scenarios are mainly concentrated in the home environment, voice interaction is a perfect match. However, AI glasses and XR headsets are quite different, as a significant part of their usage is in public places.

It is true that with the leap - forward development of AI large - model related technologies, such smart devices have good semantic recognition capabilities, can understand the meaning behind users' statements, and with the help of noise - reduction engines, can distinguish the required voice stream from various sounds, thus accurately identifying commands from the wearer.

Although sound pickup and semantic understanding are not problems, the real pain point is that not everyone can overcome the shame of talking to the air in public. Moreover, since the sound environment in public places is more complex, even with advanced noise - reduction technology, users need to speak louder to control the devices. However, making a loud noise in public obviously goes against public order and good customs. In addition, voice interaction requires users to state their needs at a certain volume, but many people do not want their privacy to be exposed.

Ten years ago, when talking about why Tencent didn't develop a voice assistant, Ma Huateng said, "We didn't focus on it. This function seems convenient, but actually it may not be. For example, it's so silly for a person to say to their phone what they want to do. I'd be too embarrassed to do so in a crowded place, and it's not private either. I'd rather press the buttons a few more times."

It should be noted that AI glasses with shooting functions already face huge privacy disputes, as not everyone can accept living under the camera. Coupled with the fact that AI glasses generally rely on voice interaction as the core, it brings a huge psychological burden to users when using them in public. In addition, the amount of information in audio is much larger than that in pictures and texts, but a large amount of information is not always a good thing. Users need to spend more time analyzing and filtering, and compared with vision, hearing is less friendly to the human brain in terms of context association, which means users need to expend more energy when using voice interaction.

So, the question is, don't the manufacturers of AI glasses and XR headsets know the defects of voice interaction? In fact, the answer is that voice interaction is currently the most cost - effective solution. However, a major pain point of voice interaction is that users must make a sound. To enhance ASR (Automatic Speech Recognition), NLP (Natural Language Processing), far - field sound pickup, and capture the voiceprint of users speaking in a low voice, the cost will inevitably increase significantly, which in turn will lead to a smaller potential audience. But if the cost is reduced, the social pressure caused by speaking loudly still exists objectively.

Before voice interaction, the main interaction method for smart glasses was the temple touch mode. However, interacting on the temple of the glasses is not ergonomic. The drawback is that users need to raise their hands to the level of the glasses, and frequent hand - raising will make people feel tired. The reason why TWS earphones can use touch interaction is that users do not frequently adjust the volume or turn on noise reduction, so touch operations in low - frequency scenarios are feasible.

In the situation where touch interaction is not suitable for head - mounted devices and voice interaction has defects, Apple's silent input mode of lip - reading is expected to be a win - win solution. The way of receiving commands by reading lip movements without making a sound solves a series of pain points for users when interacting with devices in public. Moreover, lip - reading is not some kind of black technology nowadays. It can be achieved by simply pairing with a mature AI vision model.

By feeding the lip movements of users speaking different languages into the AI model and conducting sufficient pre - training, the device equipped with the AI model can understand what the user is saying. As long as the problem of recognition rate is solved, the biggest obstacle to the wide application of AI glasses and XR headsets in public places may be removed.

Once users can use AI glasses and XR headsets without any concerns, such products will have the opportunity to change from being exclusive to niche early adopters to becoming popular consumer electronics products.

This article is from the WeChat official account "3eLife" (ID: IT - 3eLife), written by 3eLife Editor. It is published by 36Kr with authorization.