Completely outshining ChatGPT, Google's move is extremely ruthless: it can even perfectly reproduce your "sarcasm".
Google has released the Gemini 2.5 Flash Native Audio model. It can not only perform real - time voice translation while preserving intonation but also enables AI to interact as naturally and smoothly as a real person in complex instructions and continuous conversations. This update marks the transition of AI from simple "text - to - speech" to the era of true "anthropomorphic interaction."
Imagine this scenario:
You're walking on the bustling streets of Mumbai, India, wearing headphones. Surrounded by the noisy hawking and the completely unintelligible Hindi.
At this moment, a local uncle rushes up to you and asks for directions in Hindi. He speaks very fast and sounds anxious.
In the past, you might have fumbled to take out your phone, opened a translation app, pressed the button, awkwardly held the phone to his mouth, and then listened to the emotionless "machine - translated" electronic voice from the phone.
Designed by Nano Banana Pro
But now, everything has changed.
You stand still, and a fluent Chinese voice comes directly into your headphones: "Hey! Friend, excuse me, is the railway station this way?"
The most amazing part is that this Chinese translation is not only accurate in meaning but also perfectly replicates the uncle's anxious and out - of - breath intonation!
You reply in Chinese, and your headphones automatically convert your voice into Hindi and transmit it to the other person, even preserving your enthusiastic intonation.
This is not just a re - enactment of the "Tower of Babel" in science - fiction movies. It's the heavy - hitting "nuclear bomb" that Google dropped this week - Gemini 2.5 Flash Native Audio (Native Audio Model).
Today, let's take a closer look at how powerful this update is.
What makes the so - called "native audio" so powerful?
Many people might ask, "Don't all current phones have a text - to - speech function? What's so special about it?"
There's a huge misunderstanding here.
In the past, the process of AI voice interaction was like this: Hear the voice -> Convert it to text -> AI processes the text -> Generate a text response -> Convert the text back to voice and read it out.
This process is not only slow, but also in the process of "converting back and forth," the most subtle elements in human communication, such as tone, pauses, and emotions, are all lost.
The core of Google's newly released Gemini 2.5 Flash Native Audio lies in the word "Native."
It doesn't need to convert voice to text and then back. It listens directly, thinks directly, and speaks directly.
For example, it's like chatting with a foreigner. In the past, you had to frantically look up words in your mind, but now you've developed a "language sense" and can speak fluently.
In this update, Google not only upgraded the text - to - speech models of Gemini 2.5 Pro and Flash, bringing stronger control.
More importantly, it has made real - time voice agents (Live Voice Agents) a reality.
What does this mean?
It means that in Google AI Studio, Vertex AI, and even in Search (Search Live), you're no longer talking to a cold machine but having a real - time brainstorming session with an intelligent agent that has a "brain" and "ears."
The "simultaneous interpretation" in your headphones breaks the Tower of Babel of languages
In this update, the feature that excites ordinary users the most is definitely the real - time speech translation (Live Speech Translation) function.
Google isn't making empty promises this time. The function has already started beta testing on Android devices in the United States, Mexico, and India through the Google Translate app (iOS users, please be patient; it's coming soon).
This function has two killer features that address the pain points:
Continuous listening and two - way dialogue: Truly "seamless" translation
In the past, the most annoying thing about using translation software was having to constantly click the "speak" button.
Now, Gemini supports continuous listening.
You can put your phone in your pocket, put on your headphones, and Gemini will automatically translate the multiple languages it hears around you into your native language in real - time.
It's like having an invisible translator with you all the time.
In two - way dialogue mode, it's even smarter.
For example, if you speak English and want to chat with someone who speaks Hindi.
Gemini can automatically identify who is speaking.
You'll hear English in your headphones, and when you finish speaking, your phone will automatically play the Hindi translation for the other person.
You don't need to set "I'm speaking now" or "He's speaking now." The system switches automatically.
Style transfer: Even "emotions" can be translated
This is the feature that gives me goosebumps - Style Transfer.
Traditional translation is like an emotionless reading machine.
But Gemini, with its native audio capabilities, can capture the subtle differences in human language.
If the speaker has a rising intonation and a lively rhythm, the translated voice will also be cheerful;
If the speaker has a low and hesitant tone, the translated voice will also show hesitation.
It preserves the speaker's intonation, rhythm, and pitch.
This is not just about understanding the meaning; it's about understanding the attitude.
This feature is extremely important in business negotiations or arguments!
In addition, it also supports:
- Over 70 languages and more than 2000 language pairs: Covers the native languages of the vast majority of people around the world.
- Mixed - language input: Even if a conversation contains several different languages, it can understand them all without you having to manually switch.
- Noise robustness: Optimized for noisy environments, it filters out background noise. You can hear clearly even in a noisy outdoor market.
Developers are overjoyed. This AI can finally "understand human language"
If you're a developer or want to build a customer - service AI for an enterprise, the three underlying capability improvements brought by Gemini 2.5 Flash Native Audio are definitely a "timely rain."
More accurate function calls
In the past, when voice assistants were involved in operations that required accessing external data, such as checking the weather or flight information, they often got stuck or gave very rigid responses.
Now, Gemini 2.5 knows when to obtain real - time information and can seamlessly integrate the retrieved data into its voice responses without interrupting the smoothness of the conversation.
In the ComplexFuncBench Audio evaluation, which specifically tests complex multi - step function calls, Gemini 2.5 scored an impressive 71.5%, far ahead of the competition.
Performance comparison of the updated Gemini 2.5 Flash Native Audio with previous versions and industry competitors on ComplexFuncBench
This means that it can truly act as a reliable "clerk" rather than a naive "chatbot."
Better compliance with instructions
Do you often feel that AI doesn't understand complex instructions?
Google has made great efforts this time.
The compliance rate of the new model with developer instructions has increased from 84% to 90%!
This means that if you ask the AI to "answer in this specific format, use a strict tone, and don't be wordy," it can execute your request more accurately.
For building enterprise - level services, this reliability is the core competitiveness.
Smoother conversation
Multi - round conversations have always been a difficult problem for AI.
During a conversation, AI often forgets what was said before.
Gemini 2.5 has made significant progress in retrieving context.
It can more effectively remember the previous conversation content, making the entire communication process not only coherent but also logical.
Combined with the low latency of native audio, you'll feel like there's a real person sitting across from you.
How far are we from "Jarvis"?
Google's update is actually sending a clear signal:
Voice interaction is becoming the gateway to the next era.
From Gemini Live to Search Live, and then to real - time translation in headphones, Google is liberating AI from the screen and putting it into our ears.
For ordinary users: Technological advancements are breaking down language barriers.
Next year (2026), this function will be extended to more products through the Gemini API.
In the future, maybe we won't need to spend years painfully memorizing words. A pair of headphones will allow us to travel the world.
For enterprises: The threshold for building a next - generation AI customer - service that can listen, speak, handle tasks, and show emotions is being significantly lowered.
Easter egg
In addition to the native audio model, Google has also launched a nuclear - level experimental product - Disco.
It's a new discovery tool from Google Labs for testing future web ideas.
It has a built - in GenTabs, a powerful tool based on Google's most powerful model, Gemini 3.
Google has stated that it's still in the early stages, and not all functions work perfectly.
The most amazing thing is that it can understand your needs.
GenTabs helps you browse the web by actively understanding complex tasks (through the tabs you've opened and chat history) and creating interactive web applications to assist with the tasks.
Without writing a single line of code, it can turn your messy tabs and chat history into a personalized interactive app.
Want to plan your weekly meals? Want to teach your kids about planets?
Just tell it in plain language, and it will automatically generate a tool for you. All data is verifiable; it won't make things up.
The macOS version is now open for