When AI Meets Voice: Reshaping the Future Interactive Experience
Recently, Justin Uberti, one of the early founders of WebRTC and the co-founder and CTO of Fixie.ai, recently announced his joining OpenAI to lead the development of real-time AI projects. He believes that voice interaction is the future of AI, and we are returning to a conversational society.
Justin Uberti's decision to join OpenAI now seems logical. As early as May this year, OpenAI released the first end-to-end voice-in, voice-out large model GPT-4o, and the scenes described in the movie "Her" ten years ago began to become a reality. Low-latency, highly intelligent AI, with the ability to reply to information instantly, never lose connection, provide emotional companionship 24/7, and offer emotional value at any time, makes the role of AI go beyond a simple production tool and begins to deeply integrate into all aspects of life.
In fact, in the past two years, AI has become increasingly powerful from "being able to speak" to "being able to talk", and the discussion about AI is no longer limited to "AI assistants". Topics such as "AI boyfriend/girlfriend" have even begun to appear frequently, and AI companionship has become one of the main development trends of social applications. According to the latest report from A16Z in August, among the Top 100 apps, 16% of the products are AI companion products, and they account for 6 of the top 20.
When the broad market space and huge development potential of AI companion applications are widely recognized by the market, one notable focus is that voice interaction has become the most critical entry point at present.
Whether it is native AI applications such as Xingye, Character.AI, Zhumeng Island, Poly.AI, or domestic leading pan-entertainment apps including TT Voice, Soul, etc., which launch AI avatars, AI pets, and AI companion gameplay... Although the gameplay and design of different applications are unique, the core element is to use conversation as the main interaction method to provide users with emotional experiences.
Behind the above phenomena lies a trend change: With the continuous development of the AI companion application market, it will also stimulate users' strong demand for a better voice interaction experience. In the era of AI technology-driven transformation, how to closely follow the ever-changing needs of users and improve product experience?
Recently, "Zego Technology" released its self-developed audio engine - Purio AI Audio Engine, providing a new solution to the market. This solution creates a pure, fidelity, and comfortable auditory experience for users through three core technologies: AI noise reduction, AI echo cancellation, and volume equalization technology. It not only supports social application users to obtain a better sound quality experience but also can be combined with the latest AI companion solution to make AI companionship more realistic.
"Zego Technology" released its self-developed audio engine - Purio AI Audio Engine
When Voice Becomes the Key Entry for Interaction
Undoubtedly, as the most natural and convenient communication method for humans, voice is the key entry point for human-computer interaction in the intelligent era.
On the one hand, through the application of RTC technology, the low-latency and fast response make the interaction between humans and AI closer to reality. On the other hand, through voice recognition, it becomes possible for machines to recognize human emotions and tones, and ultimately output more accurate and intelligent responses.
From the product development trends of major AI manufacturers, it can also be seen that voice is an indispensable part. For example, since the release of GPT-4o, end-to-end real-time multimodality has become a new direction followed by domestic and foreign manufacturers. Among them, the foreign AI manufacturer Character.AI launched the calling function, Microsoft AI stated that it will have a real-time voice interface by the end of the year, the domestic Doubao announced in August that the large model has supported the new function of real-time voice calls, and Kimi released the voice call function in October...
It can be predicted that voice interaction will also become the ultimate form of future conversational multimodal large model interaction.
However, voice interaction is not new to users. In fields such as smart homes, mobile phones, vehicles, smart wearables, robots, etc., which have already become part of daily life, voice interaction technology has achieved rapid penetration and implementation. Most of the time, just by opening one's mouth, the machine can complete a series of tasks for humans.
Moreover, because voice interaction has a significant effect in promoting social relationships between strangers and improving the communication efficiency of acquaintances, it is widely used in social, office and other scenarios and has become the basic ability of applications. Interactive scenarios are rapidly innovating based on voice calls, such as various novel, rich and personalized real-time interactive experiences such as playing games together, chatting parties, online karaoke, live streaming, education, etc., which penetrate into users' daily lives.
The convenience also makes voice interaction occur anytime and anywhere, such as participating in online meetings during commuting, chatting remotely with a partner during meals, or having a live stream outdoors, etc. The convenient habit of interacting anytime and anywhere also brings a more complex call environment than before, and the sound quality problem occurs more frequently than before.
For example, in human-computer interaction, a noisy environment will significantly reduce the recognition accuracy rate; in a multi-person conference, if any user brings obvious noise to the microphone, it will impact the atmosphere of the multi-person room, and the bad feedback will also affect the user's own desire to speak, and the noisy performance will even lead to the loss of the audience; in the KTV application in a complex environment, the human voice quality will also become "dull and turbid" or even "leak echo" and "swallow sound", resulting in a poor user experience...
How can users have a "smooth" interaction experience in a complex environment? The innovation and breakthrough of key voice interaction technologies is a key link, and the focus is on how to remove the noise more cleanly without distortion to ensure that users can hear more clearly.
In this context, "Zego Technology" released the Purio AI Audio Engine, bringing users a pure, fidelity, and extremely comfortable auditory experience through upgrading the algorithm effect of AI noise reduction, launching a new AI echo cancellation algorithm, and dynamic loudness equalization algorithm.
Bringing High-Quality Experience to Users through Innovation
Purio AI is the latest technology of "Zego Technology" focused on sound quality enhancement.
It is reported that "Zego Technology" has embarked on the road of self-developed audio engine innovation since 2015. In 2015, it developed its own 3A audio engine. In 2018, it served more than 70% of the top Internet customers, and continuously launched one-click access voice interaction solutions suitable for various industries. In 2021, it was the first to launch a complete KTV solution that connected with music copyright merchants. Technically, it pioneered scene-based AI noise reduction, KTV professional AEC algorithm, and was the first to support the scale of 10,000 people connecting in a single room, etc.
In 2022, "Zego Technology" officially released the AI noise reduction function. At that time, AI noise reduction had been widely used in the overseas market, while the domestic market was relatively conservative. The fundamental reason is that the domestic market does not rely on a quiet interactive environment for the time being. However, as time goes by, users often encounter noise interference, ranging from the noise of people and vehicles in public places and outdoor busy areas, and the television and music in indoor places, to the sounds of keyboard typing, plugging and unplugging headphones, coughing, and swallowing.
Therefore, a good voice interaction experience has become the most urgent need of users at present. In other words, how to improve the sound quality, that is, the optimization of the main 3A capabilities: noise reduction, echo cancellation, and automatic gain control, has become the core ability to solve the pain points of users.
First, taking noise reduction technology as an example, traditional noise reduction cannot effectively suppress transient noise, and traditional echo cancellation causes greater damage to the human voice, resulting in poor environmental adaptability. The development and addition of AI technology, through its strong generalization ability, just makes up for the adaptability of traditional methods in a complex environment.
Not only that, the ability of AI enables noise reduction and echo cancellation not only to adapt to the changing environment of users, effectively suppress interference while restoring the human voice, but also to have the ability to recognize scenes. For example, AI can fully understand the difference between "interference" and "human voice" and achieve precise separation; it can also achieve intelligent switching effects in different scenes, such as entrance music is not noise, and applause in a conference scene is not noise, etc.
In practical applications, for example, one of the companies that first introduced AI into mobile social products - Qutoutiao Technology, through the audio technology of "Zego Technology", ensures a smooth and high-quality microphone connection experience for 200 million users, and also continuously creates new voice social gameplay for users.
The "Online KTV" gameplay launched by Qutoutiao Technology in 2022 is combined with the Zego Purio AI audio engine technology, achieving a significant breakthrough in the KTV scene effect: The dynamic loudness equalization ability brings precise human voice and accompaniment alignment, solving the problem of conflict between the human voice and accompaniment in the KTV process; The KTV scoring technology provides an accurate and timely feedback singing scoring system, which can more comprehensively and objectively evaluate the user's singing level by combining multiple dimensions such as pitch, rhythm, articulation, and breath...
TT Voice × Zego Technology cooperation case
It is worth noting that the addition of AI actually also brings new challenges to voice interaction, and the most prominent impact is that the model is complex and the performance consumption is huge, which poses a great challenge to the delivery of real-time scenarios. That is, in terms of delay and power consumption, when implemented in mid-to-low-end models, not only is the delay high and the power consumption large, it is easy to heat up, and it is difficult for applications to use AI capabilities.
In order to solve the problem of being held back caused by the addition of AI, "Zego Technology" achieves the effects of low overhead, low latency, and high fidelity through various technical means such as reparameterization, parameter sharing, and model quantization. The new Purio AI Audio Engine still maintains the previous ultra-low latency and lightweight characteristics. In terms of latency, the AI-level algorithm is <10ms, and the CPU power consumption increment of the low-end model is <4%, ensuring the high availability of end users.
On the basis of the past, Purio AI has the recognition and elimination technology for more than 400 types of noises in all scenarios, and the suppression effect is improved by 52% compared to the 2022 version. With the support of multiple high-precision AI technologies, it eliminates noise layer by layer and accurately restores the human voice. The objective index of human voice fidelity reaches the industry-leading level. At the same time, it also has the scene-based AI noise reduction ability, which can intelligently adjust the AI noise reduction strategy; The AI algorithm can intelligently identify and eliminate up to 99.9% of audio echoes. Multiple high-precision AI technologies separate the proximal signal from the echo signal to ensure the accurate restoration of sound quality...
Today, "Zego Technology"'s sound quality enhancement technology has been widely used in pan-entertainment social applications such as live streaming, music listening, socializing, and radio, as well as in industry applications such as financial dual recording, online education, video conferencing, and smart hardware.