OpenAI enters the voice model battle, unleashing the most powerful GPT-RealTime with more features and lower prices.
On August 29th, Zhidx reported that early this morning, OpenAI released the voice-to-voice model GPT-RealTime for developers and simultaneously updated the API features, including support for remote MCP servers, image input, and SIP (Session Initiation Protocol) phone calls.
OpenAI claims that this is its most advanced speech synthesis model to date. GPT-RealTime has improvements in following complex instructions, precisely invoking tools, and generating more natural and expressive voices. The model can naturally read out repeated letters and numbers, seamlessly switch languages, and even capture non-verbal signals such as laughter.
Today, OpenAI also released two new voices, Cedar and Marin, which will be exclusively available in the Realtime API.
In terms of pricing, the general version of the Realtime API and the new GPT-RealTime model are now open to all developers. The price for every million token audio input of GPT-RealTime is $32 (approximately 228 RMB), the price for cached input is $0.4 per million tokens (approximately 2.85 RMB), and the price for every million token audio output is $64 (approximately 456 RMB). The price of GPT-RealTime is 20% lower than that of the gpt-4o-realtime-preview.
OpenAI has added fine-grained control over the conversation context, allowing developers to set intelligent token limits and truncate multiple rounds at once, significantly reducing the cost of long conversations.
In October last year, OpenAI released the public beta version of the Realtime API, and thousands of developers have used the API and provided suggestions so far.
However, judging from the comments on OpenAI's social platform X, some users are full of expectations for this new model, saying that voice applications will become more interesting. But some developers have reported that the voice of the model still sounds robotic, and the old voice characters only sound slightly more expressive.
In the field of voice models, both domestic and international progress is accelerating. At the beginning of this month, MiniMax, one of the six major domestic models, launched the voice generation model Speech 2.5, covering more than 40 languages. At the beginning of this year, the Doubao App also updated its real-time voice call function, which is free for users. It can imitate different voices and perform emotion perception. On the same day as OpenAI, Microsoft launched its first highly expressive and natural voice generation model, MAI-Voice-1, which can generate different audio performances with the same prompt.
01. Talk to Your Voice Assistant Like a Friend When Buying a House, Tickets, or Booking a Doctor
OpenAI posted examples of collaborating with five companies to build voice assistants on its blog.
First, there is Zillow, an information service platform for the US real estate market. OpenAI's new model can talk to users naturally, helping them filter housing listings according to their lifestyle needs or analyze purchase prices.
Second, as the mobile assistant for T-Mobile, the AI assistant can quickly alternate conversations. Even if users interrupt in the middle of a sentence and start a new topic, it won't be affected.
Third, for the ticket trading platform StubHub, OpenAI's new model can help users make payments and guide them through any issues they encounter during the payment process.
Fourth, when it comes to helping users call and book doctors, on the Oscar Health platform, this new model can help users confirm available appointment times, appointment notes, and appointment addresses.
Finally, for the insurtech company Lemonade, when users encounter insurance issues when buying a car, the AI assistant can provide purchase assistance. It can understand users' needs during the conversation and then perform the purchase operation based on the internally stored user personal and bank card information.
02. Capture Laughter, Seamlessly Switch Languages, and Adjust Tone
OpenAI has improved GPT-RealTime in terms of audio quality, understanding user instructions, and following instructions.
For a voice agent to keep users engaged in a conversation, the model needs to have intonation, emotion, and rhythm like a human being to create a pleasant conversation experience. The blog mentioned that GPT-RealTime can produce more natural and high-quality voices and follow fine-grained instructions, such as "speak quickly and professionally" or "speak sympathetically with a French accent."
In terms of understanding user instructions, GPT-RealTime can capture non-verbal cues such as laughter, switch languages within a sentence, and adjust the tone. According to OpenAI's internal evaluation, the model is also more accurate in detecting alphanumeric sequences such as phone numbers in languages like Spanish, Chinese, Japanese, and French.
In the Big Bench Audio evaluation, GPT-RealTime achieved an accuracy rate of 82.8%, surpassing OpenAI's old model released in December 2024. The Big Bench Audio benchmark test is an evaluation dataset used to assess the reasoning ability of language models that support audio input.
When building voice-to-voice applications, developers provide the model with a series of behavioral instructions, including how to speak, what to say, what to do, or what not to do in specific situations. OpenAI focuses on improving the model's ability to follow these instructions, enabling even the smallest instructions to convey more information to the model.
In the MultiChallenge audio benchmark test, which measures the accuracy of following instructions, GPT-RealTime scored 30.5%, a significant improvement compared to the old model's 20.6%. The MultiChallenge evaluation assesses the performance of large models in handling multi-round conversations with humans. OpenAI selected a subset of test questions suitable for audio presentation and converted them into voice through text-to-speech (TTS) technology to create the audio version of this evaluation.
To build a powerful voice agent with a voice-to-voice model, the model needs to be able to invoke the right tools at the right time. OpenAI has improved function calls in three dimensions: invoking relevant functions, invoking functions at appropriate times, and invoking functions with appropriate parameters. In the ComplexFuncBench audio evaluation, which measures function call performance, GPT-RealTime scored 66.5%, surpassing the score of the old model. The model we released in December 2024 scored 49.7%.
In addition, OpenAI has also improved asynchronous function calls. Long-running function calls will no longer interrupt the conversation flow, and the model can continue to have a smooth conversation while waiting for the results. This feature is natively supported in GPT-RealTime, and developers do not need to update their code.
03. Preserve Voice Nuances and Add Four New Features to the RealTime API
Different from the traditional multi-model chaining process of voice-to-text and text-to-voice, the Realtime API directly processes and generates audio through a single model and API. This reduces latency, preserves the nuances in the voice, and makes the responses more natural and expressive.
The new features of the RealTime API include:
Developers can enable MCP support in a session by passing the URL of a remote MCP server in the session configuration. After connection, the API will automatically handle tool calls without the need for developers to manually set up the integration.
This setup allows developers to make the session immediately available by simply pointing it to a different MCP server.
In terms of image input, developers can add images, photos, and screenshots to a Realtime API session and use them together with audio or text. Now the model can build conversations based on what the user actually sees, allowing users to ask questions such as "What do you see?" or "Read the text in this screenshot."
Rather than treating images as real-time video streams, the system is more like adding pictures to the conversation. Developers' applications can decide which images to share with the model and when to share them, thus controlling what the model sees and when to respond.
OpenAI has also added features to make the Realtime API easier to integrate, including Session Initiation Protocol (SIP) support and reusable prompts.
SIP support allows developers to directly connect their applications to public telephone networks, PBX systems, office phones, and other SIP endpoints through the Realtime API.
Reusable prompts allow developers to save and reuse prompts, including developer messages, tools, variables, and example user/assistant messages. They support cross-Realtime API session use and follow the same logic as the Responses API.
04. Conclusion: Set Up Multi-Layered Protection Guidelines to Prevent Model Abuse
To prevent the abuse of real-time voice conversations, the Realtime API includes multi-layered security protection and mitigation measures. OpenAI uses an active classifier for Realtime API sessions, which means that if certain conversations are detected to violate the harmful content guidelines, these conversations can be aborted. Developers can also use the Agents SDK to add their own additional security protection measures.
Currently, ultra-realistic real-time voice conversations have shown a wide range of application scenarios. The real-time voice conversation feature of Doubao and the newly launched digital employees by Baidu all use voice as the main form of interaction with users. Coupled with OpenAI's new voice-to-voice model released this time, it also demonstrates stronger reasoning ability and more natural voice expressiveness, enabling it to handle complex multi-step requests and build AI agents in different fields.
This article is from the WeChat official account “Zhidx” (ID: zhidxcom), authored by Cheng Qian and edited by Li Shuiqing. It is published by 36Kr with authorization.