Google killed off "Wait for you to finish speaking before translating", supporting real-time translation while speaking for over 70 languages
Before you can finish a sentence, the translated voice will ring in your ear - and it has the same speaking speed and intonation as the speaker, only a few seconds behind.
Just now, Google launched Gemini 3.5 Live Translate.
This is its latest voice-to-voice translation model. In a nutshell: It completely breaks the old rule of "translating after you finish speaking".
Jeff Dean, the chief scientist of Google DeepMind, personally posted an official announcement, with a sense of confidence of "spending 20 years honing a sword" between the lines:
Voice translation is one of Google's longest-running machine learning projects, and this time, it has finally entered the earphone.
Breaking the "walkie-talkie" style of translation
Everyone is familiar with the old translation machines.
You say a sentence, and it waits until you finish speaking, then laboriously translates it for the other person.
Back and forth, the rhythm is completely disrupted, and the two people seem to be using a walkie-talkie.
What's even more troublesome is that real conversations are never as regular as one person speaking after another - people may interrupt, hesitate, or change their words halfway.
Gemini 3.5 Live Translate doesn't work like that. It translates while listening. Before the speaker finishes speaking, the translated voice is already there.
Behind this is a rather delicate balancing act: waiting a little longer allows it to hear more of the context and translate more accurately; speaking immediately can keep up with the speaker, but it may misguess the second half of the sentence.
The model carefully weighs these two aspects word by word, and the final result is - a coherent output without awkward pauses, only a few seconds behind the speaker throughout.
What's even more amazing is the voice itself.
It can preserve your speaking speed, pitch, and intonation - the translated voice is not a cold machine voice, but a voice with your speaking style. If you're in a hurry, the translated voice is also in a hurry; if you speak slowly, the translated voice also takes it easy.
The model card released by DeepMind reveals some details: this model is based on Gemini 3 Pro and can handle an audio context of up to 128K tokens. The evaluation focuses on three indicators - translation quality, latency, and voice naturalness.
In other words, Google's KPI for it is not "translating correctly", but "having a smooth conversation".
It can recognize more than 70 languages at once, and it can automatically identify languages. It can keep up even if you switch languages midway, without manual settings. You don't have to worry about a noisy environment. It can be used in markets, airports, and on the roadside.
Developers, enterprises, and ordinary people are all covered
This time, Google is going all out and deploying on three fronts simultaneously.
- Developers can start using it today through the Gemini Live API and the public beta of Google AI Studio;
- Enterprises can participate in the private beta on Google Meet starting this month;
- Ordinary people can use it on the Android and iOS versions of Google Translate globally - just click on "Live Translation" in the lower left corner of the app and connect any pair of earphones.
What impresses office workers the most is Google Meet. Previously, its voice translation only supported 5 languages and could only switch between English and other languages.
Now, it supports more than 70 languages, and a single meeting can support more than 2,000 language combinations - English, Mandarin, and Swedish are flying around the table, and everyone can understand each other instantly.
There is also a hidden feature on Android: the "Listening Mode". Hold the phone to your ear like you're making a call, and the translated voice will come directly from the earpiece, so others won't hear it.
If you're in a Spanish tour group and don't have earphones on hand, just take out your phone and hold it to your ear to solve the problem.
Ten million calls per month
Just talking about parameters is too abstract. Let's look at a real scenario.
Google asked Grab in Southeast Asia to test it. The driver speaks the local language, and the passenger hears it in their native language. Common phrases like "Where are you?" and "I'll be there soon" during pick-ups are no longer a communication problem.
You know, Grab users make more than 10 million voice calls every month - this is not a demo at a press conference, but a real task to be integrated into millions of daily conversations.
In addition to Grab, companies like CJ ENM and LiveKit have also tried it in advance, and their feedback points to the same thing: quality, accuracy, and low latency.
Developers also save a lot of effort.
Platforms like Agora, Fishjam, and LiveKit have already connected to the Gemini Live API, taking care of the most difficult real-time media stream infrastructure - tasks like collection, transmission, and echo cancellation are handled by others, and developers only need to focus on the user experience.
Video dubbing, multilingual live broadcasts, cross-language customer service, and online classrooms are all ready-made applications.
Twenty years of running, into the earphone
If you look back, you'll find that Google has been planning this for a long time.
Twenty years ago, Google Translate was just a pioneering small experiment, aiming to turn the science of language into a magic tool for connecting people.
Now, it translates more than one trillion words for billions of users every month.
From "translating text to text", to "translating a menu by taking a photo", and now to "turning what you say into the voice of another language in real time", this journey has taken a full twenty years.
Of course, don't be too overconfident.
Google itself has marked some limitations: Currently, it only accepts audio input; when there is a strong accent, rapid language switching, multiple people speaking at the same time, or a long pause, the voice replication may be unstable.
It's not the end, but a very promising starting point.
The direction is clear. Simultaneous interpretation used to be a task that only top interpreters could handle, costing thousands of dollars per hour and requiring a week of preparation in advance.
Now, it's becoming a function that silently runs in the earphone, ready to serve at any time.
When language is no longer a barrier, all that remains is whether people want to talk to each other.
References:
https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-live-3-5-translate/
https://deepmind.google/models/model-cards/gemini-3-5-audio/
https://ai.google.dev/gemini-api/docs/live-api/live-translate
https://x.com/JeffDean/status/2064400689825288351
This article is from the WeChat official account "New Intelligence Yuan". Edited by Solomon. Republished by 36Kr with permission.