StartseiteArtikel

OpenAI hat gpt-realtime vorgestellt: Sprachintelligente Agenten treten in die Ära der "Sekundenreaktion" ein, und Entwickler rufen, dass die Interaktion natürlicher sei.

极客邦科技InfoQ2025-09-16 18:41
OpenAI hat gpt-realtime vorgestellt: Schaffung eines einsatzfähigen, end-to-end Sprach-Agenten für die Produktion.

OpenAI has officially released gpt-realtime, a voice-to-voice model representing OpenAI's latest R & D achievements. Meanwhile, the Realtime API is also fully open. This update aims to reduce latency, improve voice quality, and provide developers with more powerful tools, such as supporting MCP servers, image input, and SIP-based phone calls, thereby creating a truly production-ready AI voice agent.

The combination of the Realtime API and gpt-realtime enables end-to-end voice processing within a single system, eliminating the need to connect speech-to-text and text-to-speech models separately. This architecture significantly shortens the response time and preserves the nuances in voice expression, which is crucial for real-time voice interaction, as even a few hundred milliseconds of latency can disrupt the fluency of the conversation.

gpt-realtime has been trained to generate higher-quality voices with more natural speaking rates and intonations. It also performs stably in executing tone-style instructions, such as "speak with empathy" or "use a professional tone". Two new synthetic voices, Cedar and Marin, have been added, and the existing voices have been updated to make them more realistic.

gpt-realtime has also made significant progress in comprehension. The model can recognize non-verbal signals, switch between multiple languages within a single sentence, and process cross-lingual alphanumeric sequences (such as phone numbers and vehicle identification codes) more accurately, supporting multiple languages including Spanish, Chinese, Japanese, and French. Internal test results show that gpt-realtime achieves an accuracy rate of 82.8% on Big Bench Audio, a significant improvement from the 65.6% of the previous generation model. In terms of following instructions, the score on the MultiChallenge audio benchmark has also increased from 20.6% to 30.5%.

The function call ability has also been enhanced. The new model performs better in identifying relevant functions, calling them at appropriate times, and passing correct parameters. On ComplexFuncBench, the accuracy rate has increased from 49.7% to 66.5%. In addition, the system has added an asynchronous function call feature, allowing the voice agent to continue the conversation while waiting for results. This feature has important application value in customer service and transaction scenarios.

The Realtime API has also undergone a comprehensive upgrade to better meet production-level requirements. Developers can now directly connect remote MCP servers to sessions, avoiding the cumbersome manual integration process. The API also supports image input, enabling applications to have conversations based on visual content (such as screenshots or photos). SIP support allows voice agents to seamlessly integrate with existing phone systems, including PBXs and desktop phones. The reusable prompt function simplifies session management, and full EU data storage support meets compliance requirements for European deployments.

According to the release notes, early enterprise partners have tested these features in near-production scenarios. Zillow has launched a pilot project for voice-interactive real estate searches, while T-Mobile is exploring the application of real-time responses in customer service. Both companies emphasize that AI voice agents are driving the transformation of interaction methods from traditional scripted automation to a more flexible and domain-expertise-oriented direction.

OpenAI has further strengthened deployment security measures. The Realtime API has a built-in classifier that can abort harmful conversations, and developers can also add domain-specific security constraints through the Agents SDK. In addition, the preset voices of the Realtime API help reduce the risk of impersonation.

Currently, the gpt-realtime model and the Realtime API are fully open for all developers to use. Developers can refer to the Realtime API documentation and the prompt guide to get started quickly and experience the new gpt-realtime demo version in the Playground.

Original link: https://www.infoq.com/news/2025/09/openai-gpt-realtime/

This article is from the WeChat official account "InfoQ". Author: Hien Luu. Republished by 36Kr with permission.