StartseiteArtikel

Mundarten können erkannt werden. Ein neues Patent von Apple könnte Headset-Geräte retten.

三易生活2025-12-01 10:12
Sobald Benutzer diese Art von Produkten bedenkenlos nutzen können, haben diese Produkte die Chance, zu Massenprodukten zu werden.

To save the Vision Pro, Apple has made another new move. Recently, a source has revealed a new Apple patent, which shows that future Apple headset devices will support lip reading. Thus, users can give commands by having the devices recognize lip movements without having to speak out loud.

In the patent documents titled "Electronic device with speech input structure", Apple describes how the device performs lip reading via a built - in visual sensor to enable speech input when the user cannot speak. If this patent can actually be implemented, it could have an enormous positive impact on all current headset devices.

After Xiaomi and Alibaba entered the market one after another, the new wave of AI glasses has arrived. However, despite the seemingly bright market, these products also face significant crises. According to a statistic from VR Vision, the average return rate of AI glasses on the TikTok e - commerce platform is 40% to 50%. The reasons why netizens call these products "junk collectors" include not only the increased load due to additional components but also speech interaction, which is another central problem.

For smart speakers, whose use cases are mainly in the home environment, speech interaction is an optimal solution. In contrast, AI glasses and XR headsets are often used in public spaces, which represents a significant difference.

In fact, thanks to the rapid progress of technology around large AI models, these smart devices have a good ability for semantic recognition. They can understand the meaning behind user instructions and use a noise suppression module to filter the desired speech stream from various ambient noises to accurately recognize the user's commands.

Although voice recording and semantic recognition are not problems, the real problem is that not everyone can overcome the embarrassment of speaking loudly in public. Moreover, the noise environment in public spaces is more complex. Even with advanced noise suppression technology, users have to raise their voices to operate the device. However, this goes against the prevailing customs. In addition, many users do not want to disclose their privacy by speaking their requests out loud.

Ten years ago, when Ma Huateng talked about the fact that Tencent had not developed a voice assistant, he said: "We didn't regard it as a priority. This function seems convenient, but it may not be. For example, if someone says into their phone in public what they want to do, it looks stupid. I wouldn't dare to speak like that in a crowd, and it's not private either. I'd rather accept multiple button presses."

It is known that AI glasses with camera function already trigger serious privacy problems, as not everyone is willing to be constantly under the camera. In addition, AI glasses usually rely on speech interaction, which places a huge psychological burden on users in public spaces. Moreover, audio data contains far more information than images and texts. However, this is not always advantageous, as users need more time to analyze and filter the information. Hearing is also less efficient than seeing when associating contexts, which costs users more energy during speech interaction.

So the question is: Are the manufacturers of AI glasses and XR headsets not aware of the disadvantages of speech interaction? The answer is that speech interaction is currently the most cost - effective solution. However, a major drawback of speech interaction is that users have to speak loudly. If you want to improve ASR (Automatic Speech Recognition), NLP (Natural Language Processing), and far - field voice recording and capture the speech patterns of softly - speaking users, the costs increase drastically, which in turn leads to a smaller potential target group. If you reduce the costs, there is still the social pressure caused by speaking loudly.

Before the introduction of speech interaction, the main interaction method of smart glasses was touch control on the temples. However, this type of interaction does not conform to ergonomics, as the hands have to be raised to be in line with the glasses. Frequent hand - raising quickly leads to fatigue. TWS earphones can use touch control because users do not often need to adjust the volume or activate noise suppression. Therefore, touch control is feasible in such rare use cases.

Since touch control is unsuitable for headset devices and speech interaction has disadvantages, Apple's new silent input mode (lip reading) could be an optimal solution. Users can give commands by having the devices recognize lip movements without having to speak out loud. This solves a series of problems that users have when interacting with devices in public spaces. Moreover, lip reading is no longer a magical technology nowadays. It can be easily realized with a mature AI visual model.

By imprinting the lip movements of users in different languages on an AI model and training it sufficiently, the device equipped with this AI model can understand what the user is saying. If the problem of the recognition rate is solved, the biggest obstacle to the widespread use of AI glasses and XR headsets in public spaces could be removed.

If users can use AI glasses and XR headsets without hesitation, there is a possibility that these products will change from a niche product line to popular consumer electronics.

This article is from the official WeChat account "Three - Easy Life" (ID: IT - 3eLife). Author: Three - Easy Jun. 36Kr has obtained permission for publication.