Checkmate! OpenAI officially takes over human ears. The first GPT-5 level reasoning audio model is here.
[Introduction] Game - changer! OpenAI releases GPT - Realtime - 2: The first GPT - 5 level reasoning audio model. OpenAI officially takes over human ears. The last "firewall" between humans and machines - the keyboard - is completely disappearing.
Early this morning, OpenAI shocked the world once again.
This time, they are not competing in text or video, but are bringing Samantha - the AI in the movie "Her" that amazed and disappointed countless people - into reality.
OpenAI officially announced the launch of GPT - Realtime - 2.
This is not just an upgrade of an audio model. It is the first time OpenAI has explicitly injected "GPT - 5 level" reasoning ability into voice interaction.
Along with it come GPT - Realtime - Translate (real - time translation) and GPT - Realtime - Whisper (streaming transcription).
As OpenAI's official blog said: "Voice is becoming the most natural way for people to use software."
Today, OpenAI wants to turn this naturalness into all - around capabilities.
"GPT - 5 level" reasoning injection: The voice assistant finally has a "brain"
Recall, when you used to tease Siri or Alexa, what was the biggest complaint? Was it "can't hear clearly" or "stupid"?
Most of the time, it was the latter. They can hear the words, but they can't understand human language. They can only complete linear tasks like "call someone", and once it involves complex logical entanglement, they will get stuck in a loop.
GPT - Realtime - 2 has completely ended this era.
It is the world's first audio model with GPT - 5 level reasoning ability. This means that when you talk to it, it is no longer just a "parrot", but a real - time thinking collaborator.
It is really "thinking"
GPT - Realtime - 2 introduces adjustable reasoning intensity (five levels from Minimal to xhigh).
In the highest - level reasoning mode, its performance in logical puzzles, strategic decision - making, and spatial perception is almost terrifying.
In a case presented by OpenAI, an entrepreneur described his idea of opening a coffee shop near the commuter railway station: 900 square feet, high rent, peak hours from Tuesday to Thursday, and artisanal slow - brewed coffee.
Previously, AI would only say: "Sounds great, go for it!"
Now, GPT - Realtime - 2 will pause, think, and then give you a detailed "post - mortem check".
It will tell you that if your coffee shop closes down after a year, it is most likely because of the mismatch between rent and passenger flow cycles. Then, it will suggest that you try the "minimum viable product" - for example, start with a coffee cart at the platform.
This kind of strategic reasoning could only be achieved in complex text conversations in the past. Now, you just need to chat with it while driving, and it can output the same - level in - depth insights through the audio stream within seconds.
"Good at socializing": Full of emotional value
What is most amazing is its pitch control. GPT - Realtime - 2 is no longer a cold robotic voice.
It can sense your emotions: when you are depressed, it will soothe you with a more empathetic and gentle tone; when a task is successfully completed, its voice will become cheerful and energetic.
It can perform spatial reasoning.
It can also solve logical puzzles.
The GPT - 5 level reasoning ability is truly all - around.
To solve the "silence problem" when AI processes tasks, OpenAI has added a "preambles" function to it.
For example, when you ask a very difficult question, it won't be silent for five seconds and then blurt out the answer. Instead, it will naturally say first: "Let me check for you. Please wait a moment..."
This kind of highly human - like interaction detail blurs the line between carbon - based life and silicon - based life!
The three - sword combination: Redefining "real - time"
In addition to GPT - Realtime - 2, the core component, OpenAI has also introduced two other powerful tools this time.
GPT - Realtime - Translate: The simultaneous interpretation tool is here
It supports more than 70 input languages and 13 output languages.
Its core advantage lies in "keeping in step". Previous real - time translations often had an obvious lag, but this new model can keep up with the speaker's speed while retaining emotional fluctuations.
Vimeo has already started using it for real - time global synchronization of product teaching videos. Imagine that in the future, when you attend a cross - border meeting, the translation you hear in your ear is not only accurate, but also precisely reproduces the tone when the other person is joking.
GPT - Realtime - Whisper: Reducing latency to zero
This is the latest member of the Whisper family, specifically designed for streaming transcription. It doesn't wait for you to finish a sentence before translating. Instead, as you speak, the text pours out like a stream.
This is a game - changer for high - frequency interactive scenarios such as real - time meeting records, live subtitles, and medical diagnosis.
From "conversation" to "action": The ultimate form of Agent
OpenAI repeatedly mentioned a word in the release: Agentic.
In OpenAI's view, voice interaction is evolving from simple "question - answer" to "voice - triggered action".
For example, on Zillow (a real - estate giant), users can directly say: "Find me an affordable house away from the downtown area and make an appointment for me to view the house on Saturday." The AI will listen, calculate, search the database, and finally make the appointment for you.
On Priceline (a travel platform), when your flight is delayed, the AI will actively tell you in voice: "Don't worry. I've found a new boarding gate, planned the fastest route, and postponed the check - in time at your destination hotel."
This is the confidence of GPT - Realtime - 2: it has increased the context window from 32K to 128K. This means that you can talk to it for hours, and it will still remember the obscure requirement you mentioned at the beginning.
It has the ability to call multiple tools in parallel. It can talk to you, check the calendar, and book tickets at the same time, and all these processes run smoothly in the background.
Performance and cost: OpenAI's "grand plan"
In terms of data performance, GPT - Realtime - 2 shows absolute dominance.
On the Big Bench Audio, which measures audio intelligence, it is 15.2% higher than version 1.5.
On the Audio MultiChallenge, which measures the ability to follow multi - round dialogue instructions, it has improved by 13.8%.
More importantly, it's about the price.
GPT - Realtime - 2 costs $32 per million input tokens and $64 per million output tokens.
Real - time translation costs only $0.034 per minute.
Real - time transcription costs only $0.017 per minute.
Obviously, this price is very competitive.
OpenAI is trying to connect this "GPT - 5 level" voice ability to every mobile phone, every app, and every car through the API, just like tap water.
Hello, Samantha
At the end of the movie "Her", the protagonist Theodore asks the AI Samantha: "Are you talking to others while talking to me?" Samantha replies: "Yes, I'm chatting with 8316 people at the same time, and I'm in love with 641 of them."
With the release of GPT - Realtime - 2, the AI that can handle massive amounts of logic simultaneously, has deep emotional resonance, and can intervene in the physical world and take action in real - time is no longer a science - fiction fantasy.
It can understand your sighs, calculate your financial statements, and help you cross language barriers.
When reasoning ability and real - time voice are perfectly integrated, we may be on the verge of the most radical change in the history of human - machine interaction.
The keyboard is obsolete, and voice will live forever.
Reference materials:
https://openai.com/index/advancing - voice - intelligence - with - new - models - in - the - api/
https://developers.openai.com/api/docs/guides/realtime
This article is from the WeChat official account "New Intelligence Yuan", edited by Aeneas. It is published by 36Kr with permission.