HomeArticle

Conversational AI: Waiting for the Next "Trillion-Dollar Moment"

晓曦2025-11-05 14:47
When conversational AI possesses human intelligence, the scenario in "Her" transitions from the silver screen to reality.

When you, feeling depressed, pour out your heart to the cute AI plush toy "Fu Zai" on your desktop, you may not realize that the natural dialogue between humans and AI is triggering a hidden new wave of real-time voice technology, unlocking a vast commercial blue ocean.

On October 31st, the 11th Real-Time Internet Conference, Convo AI & RTE 2025, jointly hosted by Agora and the RTE Developer Community, officially kicked off in Beijing. At the conference, Zhao Bin, the founder and CEO of Agora, shared the following set of data:

In 2025, 67% of enterprises will place conversational AI agents at the core of their strategies, and 84% of enterprises plan to increase their related investments in the coming year (data from Deepgram and Opus Research).

Correspondingly, Agora's usage of conversational AI-related services achieved a 151% quarter-on-quarter growth in the third quarter of 2025, demonstrating strong market demand.

Conversational AI integrates technologies such as large language models (LLM), automatic speech recognition (ASR), text-to-speech (TTS), and real-time interaction (RTE).

Most people's impression of conversing with AI is that AI is likely to "read a script" without emotion, much like a repeater, with a very obvious AI feel. However, with the emergence of conversational AI, AI can engage in natural, real, and fluent conversations just like humans.

Just as Samantha in the movie "Her" has become a reality - the small device in the protagonist's hand can not only accurately recognize voice, text, and images but also adjust its tone of response according to human tone, emotion, and language habits.

To help enterprises and developers seize this historical opportunity of conversational AI, Agora officially released the "2025 Conversational AI Development White Paper" at the conference. In this white paper, after conducting industry research and deeply integrating its experience in the RTE industry, the Agora team outlined a comprehensive roadmap for conversational AI, including technological evolution, core technologies, mainstream solutions and business models, quality evaluation systems, industry practice cases, and future trend outlooks.

Beyond theory, Agora is also taking practical actions to promote the development of conversational AI. At this press conference, Agora also launched a series of conversational AI products, including the next-generation conversational AI engine, supporting conversational AI development kits, model evaluation platforms, and orchestration platforms.

A wave of conversational AI is rising.

Conversational AI Becomes a Reality

Think back: Under what circumstances do you usually call up Siri on your iPhone to have a conversation with it?

Most people might answer that it's probably when setting an alarm before going to bed. There is a set of intuitive data: Industry data shows that currently, only 21% of users are satisfied with the existing AI conversation experience, and the user churn rate of some services is even unacceptably high.

Actually, it's not that humans are reluctant to talk to AI; it's just that AI doesn't understand humans well enough. In essence, in human conversations, only 7% of the information comes from the language content, and more than 90% of the information perception comes from non-verbal elements such as intonation, facial expressions, and body language.

To enable AI to have "human-like conversations," enterprises still face many technological challenges to overcome.

For example, currently, the end-to-end latency of most conversational AI systems is generally over 3 seconds, while the normal latency in human conversations is about 400 milliseconds. This short 3-second delay is a major pain point in human-machine interaction. In the business world, just a few seconds of waiting can make users lose patience. Therefore, racing against time to overcome the response latency is a significant barrier for conversational AI to approach the human conversation experience.

In addition to the latency challenge, another technological challenge for conversational AI is how to endow AI with the "core" of human wisdom.

Many people can relate to some human-machine interaction scenarios: During a conversation, AI may be interrupted by a user's throat-clearing sound, keyboard noise, or a momentary hesitation, causing the context to break. Or, when you are in a noisy party or exhibition, AI often fails to lock onto the real user's voice and loses focus.

Although these experiences seem minor, they are crucial for building trust and emotional dependence between humans and AI. For users, what they expect is not just a machine that can provide correct answers but an AI with a "human touch."

For this reason, Zhao Bin, the founder and CEO of Agora, summarized the technological challenges of conversational AI into several points: low-latency response, natural interruption, context management, emotional understanding and expression, etc.

To address these technological difficulties, the current mainstream technical solution in the industry is the cascaded model. In short, the cascaded model is like a well-organized "assembly line" where voice conversations are broken down into three independent steps that work serially (speech-to-text ASR - large model understanding of text LLM - text-to-speech TTS).

Compared with other models, the cascaded model is more modular. Developers can flexibly choose the most suitable suppliers for the three steps, just like building blocks, to optimize costs and improve effectiveness. Therefore, the cascaded model has become the technical solution of choice for most AI customer service, smart speakers, and other applications in the industry.

Take Agora as an example. They have built three types of product forms covering different customer groups around the cascaded model. For application developers who want to quickly launch their products, Agora has launched the Conversational AI Engine 2.0. As an out-of-the-box one-stop solution, Agora aims to overcome the various pain points of conversational AI mentioned above.

Specifically, Engine 2.0 relies on the global real-time network to achieve high-speed end-to-end response and ultra-low latency. Additionally, it has built-in advanced functions such as intelligent interruption and voiceprint recognition, enabling intelligent interaction in conversations. Moreover, Engine 2.0 is designed to be developer-friendly, supporting multiple mainstream large models. Different module functions can be selected according to needs and can be quickly integrated into different application scenarios.

Of course, for companies that want flexible selection and in-depth customization, Agora also provides modular SDKs, such as speech recognition SDKs, allowing developers to freely "build blocks." For customers who are already using Agora's real-time audio and video services and want to add AI capabilities without changing the architecture, Agora also offers a series of extension kits that can "plug in" a range of conversational AI functions.

Through these three product forms - the engine, SDK, and extension kits - Agora covers different customer groups from novices to experts, ensuring that all customers can find the most suitable way within its ecosystem to bring conversational AI into reality.

A "Ruler" for Conversational AI

Whether communicating with humans or AI, conversation is always a very subjective matter. However, for the long-term development of conversational AI, the industry lacks a comprehensive, complete, and objective evaluation framework, which is like setting a navigation channel for conversational AI.

Although some evaluation methods have been proposed in the industry, such as task completion rate and word error rate, these are single-point technical indicator evaluations, which are too fragmented and have limitations. In reality, the elements of voice and conversation in conversational AI are too complex, and the existing evaluation methods inevitably have a huge gap with the actual experience.

Therefore, in the "2025 Conversational AI Development White Paper" released by Agora, an evaluation framework of "Three Dimensions and Two Tracks" is proposed. The "Three Dimensions" assess the capabilities of AI itself, such as understanding ability, expression ability, and interaction ability. The "Two Tracks" refer to two assessment methods for AI, including benchmark testing and user-oriented testing.

It may seem a bit abstract, but imagine you are interviewing an AI assistant using this evaluation framework. At this time, you give it an instruction: "Help me book an Italian restaurant suitable for a business dinner."

An AI assistant with stronger understanding ability can extract and understand the key words in your instruction, such as "tonight," "business dinner," and "Italian restaurant." An assistant with weaker understanding ability may only catch the keyword "restaurant" and recommend the nearby McDonald's.

Next, an AI assistant with strong expression ability will introduce the features of suitable restaurants to you in a natural and pleasant tone according to the emotion conveyed in your instruction. An assistant with average expression ability will only read out a long list of addresses in a rigid broadcasting tone, like a heartless repeater.

While the AI assistant is introducing the restaurant, you suddenly interrupt it and ask, "Is there a parking lot near the restaurant?"

At this time, an AI assistant with poor interaction ability may directly ignore your question and continue to finish introducing the restaurant. An assistant with strong interaction ability, with excellent conversation rhythm and interruption handling ability, may immediately stop, help you query the information, and then add, "Do you still need me to help you query the menu?"

It's worth noting that this evaluation framework not only uses benchmark testing to ensure that conversational AI has solid basic skills but also integrates user-oriented testing into practical testing beyond the evaluation of hard technical indicators, allowing conversational AI to receive subjective evaluations from users.

If the "Three Dimensions and Two Tracks" provides a "ruler" for conversational AI, defining what a good conversational AI framework and principles are, Agora doesn't stop there. They also provide a series of useful practical tools for developers based on this framework.

Agora's AI model evaluation platform creates a crucial "decision support system" by precisely identifying the core pain points in conversational AI scenarios. It is reported that through interactive testing that simulates real conversations, the platform dynamically monitors and updates data at ten major city nodes around the world, providing an intuitive comparison of the real-time performance of mainstream ASR, LLM, and TTS models.

For example, when a developer wants to create an "AI social companion" application based on Agora's conversational AI engine, they can directly use Agora's evaluation platform to horizontally evaluate the performance of different ASR, LLM, and TTS models in terms of response latency, which is very important in the "social companion" scenario, and finally select the model combination that best suits their business.

As conversational AI gradually moves beyond the concept and rapidly improves in terms of technical routes, product solutions, evaluation standards, and tools, it is destined to take root and flourish in more fields.

Conversational AI: Quietly Making an Impact

Currently, conversational AI has achieved large-scale implementation in three scenarios: smart hardware, emotional companionship, and online education.

2025 is regarded as the year of explosion for AI hardware. From the AI companion hardware represented by "Fu Zai" that has set off a wave in the industry to the "hundred-glasses war" triggered by AI glasses like Ray-ban meta, conversational AI plays a crucial role in the bustling AI hardware market, endowing cold hardware with a human-like soul and wisdom.

In the field of emotional companionship, in AI social applications represented by Xingye and Charecter.AI, conversational AI has evolved AI conversations from mechanical responses to a social engine with memory, personality, and empathy. Conversational AI allows AI to truly accompany humans.

In the field of education, conversational AI is triggering a teaching revolution. For example, it makes oral language practice more human-like, creating a more immersive language learning environment. The AI dual-teacher system (Dou Shen AI) derived from conversational AI also promotes the equality of educational resources.

All these signs indicate that the ecosystem of conversational AI is being rapidly built. However, you may wonder what kind of imagination the future of conversational AI holds. Agora also depicts the future picture in its white paper.

Firstly, in the future, conversational AI will achieve a qualitative leap in multimodal interaction, enabling functions such as listening and speaking simultaneously, understanding users' facial expressions and gestures. Human-machine interaction will be infinitely close to human-to-human interaction, just like at Agora's press conference when the founder and CEO Zhao Bin demonstrated the "AI customer service," and the audience found it difficult to tell whether they were talking to a machine or a human.

Secondly, in the future, conversational AI may not just be a single-point, passively responsive tool but a "super assistant" with multi-agent collaboration.

From a business perspective, the future of conversational AI may also hide more commercial value. Perhaps its form of existence will be more hidden and diverse. It may be integrated into business processes, become a digital employee in an organization, or serve as a new information entry and service hub.

You can also let your imagination run wild. In the future, conversational AI may transform into a "digital life form." During the growth stage, it is the guardian of human enlightenment. During the work stage, it becomes an all-round assistant in work and life. As human data accumulates, it will evolve to understand you better.

Conclusion

Conversational AI, which was first born in the 1960s, only truly entered a period of rapid development after acquiring human-like capabilities in the era of large models. This seemingly simple technology has far-reaching significance in the long run.

Firstly, conversational AI has completely changed the underlying logic of human-machine communication, evolving the previous human-machine interaction mainly based on graphical user interfaces (GUI) to a more natural way of communication that conforms to human instincts. This transformation will greatly lower the threshold for using AI technology, thereby achieving the equalization of AI and the popularization of technology.

Secondly, conversational AI will help humans free themselves from cumbersome and complex tasks, allowing them to focus more on creative and strategic work. There are infinite commercial possibilities hidden in this, and more new business models and forms may emerge, not limited to the currently popular companion AI native apps and AI companion hardware.

In summary, conversational AI not only represents a technological high ground but also essentially changes the way of interaction, production relations, and drives economic growth.

At Agora's press conference, Zhao Bin, the founder and CEO of Agora, stated that as of now, Agora's annual service minutes have exceeded 1 trillion minutes for the first time. This is a milestone, indicating that RTE technology (Real-Time Engagement) has become an indispensable part of the industry, just like electricity, water, and gas.

With all its technological subsets maturing, conversational AI is fully prepared. It is waiting for its "trillion-minute moment."