HomeArticle

Google ushers in a new era of voice agents. Speaking means productivity. Is the most powerful add-on for Siri here?

智东西2026-03-27 11:29
You can start development just by speaking, and it's already available on mobile phones.

Setting an example for voice lobsters! Google's most powerful audio model is here. You can create an app just by speaking.

According to a report from Zhidongxi on March 27th, early yesterday morning, Google officially launched its highest - quality audio and voice model —— the real - time voice model Gemini 3.1 Flash Live, and it is simultaneously available in Gemini App, Search Live, and Google AI Studio. The latter is provided to developers in a preview version.

The core of this version lies in the upgrade of real - time voice Agent capabilities: Voice can now directly drive application development (vibe coding), and the real - time multimodal dialogue capabilities of the Gemini App have been enhanced simultaneously. In multiple evaluations, it outperforms models such as GPT - Realtime - 1.5, Qwen3 Omni 30B A3B Instruct, and GPT - 4o Audio preview.

As soon as the model was released, it was called the "savior" of Siri by netizens overseas. Just yesterday, foreign media reported that Apple's WWDC in 2026 will focus on AI and will launch a new version of Siri. Apple has obtained the direct connection permission to Google's complete Gemini model and will deploy self - developed lightweight edge - side AI on iPhones through distillation.

This model is designed for real - time voice interaction and has been comprehensively optimized for continuous dialogue, including key capabilities such as response latency, context memory, multilingual processing, and tool invocation.

In Gemini Live, the context window has been doubled compared to before. Search Live supports multilingual real - time interaction in more than 200 countries and regions, and its overall capabilities are oriented towards continuous dialogue and complex task scenarios.

Judging from the results of the public test, this version has significantly improved in the key capabilities of the voice Agent. In the ComplexFuncBench audio test, the function call accuracy of Gemini 3.1 Flash Live reaches 90.8%, which is a significant improvement compared to the 71.5% of the December 2025 version of Gemini 2.5 Flash Native Audio and the 66.0% of the September 2025 version.

In the Audio MultiChallenge audio output list released by Scale, this model scored 36.1%, higher than models such as GPT - Realtime - 1.5 (34.7%), Qwen3 Omni 30B A3B Instruct (24.3%), and GPT - 4o Audio preview (23.2%).

Meanwhile, this version focuses on optimizing the real - time dialogue experience. The model handles intonation, speaking speed, and pauses more precisely in voice recognition; in a noisy environment, its ability to filter background noise is enhanced, allowing it to more stably recognize user instructions and perform tasks; in complex instruction scenarios, its ability to follow system constraints has also been improved.

Some users who have received the update have started trying new ways of using it. Some people directly use voice commands to let the model generate short singing segments, and this kind of ability can be triggered during the dialogue.

Its API prices have also been announced: For text input, it costs about $0.5 per million tokens, and about $4.5 for output; for audio input, it's about $3, and about $12 for output. It supports multimodal input invocation.

As soon as the model was released, there have been initial feedbacks in the community. Some netizens commented that this is a "powerful update" and pointed out that faster voice response is a "key breakthrough in terms of user experience". If the latency and continuity in multi - round dialogue can remain stable during longer - term use, the adoption rate of voice interaction may increase significantly.

However, some users still remain cautious. A developer said bluntly that he had given up using voice models before because the quality of their responses was significantly worse than that of text, and he questioned whether this situation had really changed.

Zhidongxi has also conducted a preliminary experience of this function. Its Chinese voice performance is still a bit mechanical, and there are interruptions during multi - round dialogue. It has not been able to fully experience its continuous interaction ability. Currently, this version is being pushed in batches, and iOS and Android users have started to receive the update one after another.

01. Just speak to modify code: Redo the UI, interaction, and style all in one go

In this release, Google first demonstrated scenarios of voice - driven application development (vibe coding). Developers can create applications while speaking in Google AI Studio, making the development process keep up with the rhythm of brainstorming.

Live Vibe Coder page, where users can have hands - on practice

Users can continuously adjust the interface using voice. At the beginning of the dialogue, the user directly requests a modification: "Make the microphone a bit bigger", and the interface changes immediately; then adds "Add some yellow polka dots to the background", and the page background is updated right away.

Subsequently, the user continues to add more requirements, such as adding "feedback effects when the mouse hovers" and making the background pattern scroll continuously. These changes are gradually completed within the same dialogue.

As the user speaks, the interface changes accordingly. Mid - way, the user suddenly changes the direction and says "Just make the whole thing in a pop - art style". The model then continues to redo the visual style based on the existing foundation, and the whole process is similar to a real - time one - on - one communication with a designer.

02. Three scenarios of design collaboration, cross - language dialogue, and game interaction are implemented simultaneously

In addition to application development, Google also presented three practical usage scenarios, including interface design collaboration, cross - language companion communication, and character interaction in games.

In the case of the design tool Stitch, voice can also be directly involved in the interface editing process. The user first makes the interface jump to the "practice mode", then switches to the "song library", and then starts to point out specific problems: "These dotted lines and square borders look a bit rigid. Can we make the numbers fit the circles better?" The interface is then adjusted towards a more concise direction. Then, the user changes the idea: "Try a color scheme that is more brown and woody", and a new visual version is directly generated.

In the interaction case of the AI hardware device Ato for elderly users, the emphasis is on the continuity of multilingual dialogue, and the dialogue content revolves around daily greetings and companionship. The user first chats in English and then inserts a condition: "I want to talk to my grandma, but she only speaks Spanish". The model switches languages within the same dialogue and continues the communication, and the dialogue content will not be interrupted due to the language change.

Once real - world information is inserted into the dialogue, such as mentioning "I'm a bit tired after coming out of the hospital", the model will respond according to the context and have a continuous conversation.

In the case of the RPG game "Wit’s End", voice is used to drive the character itself. When the player asks questions, the model will respond in a tone in line with the character's setting, such as answering questions like "Do you have a physical form?" and "Where does your power come from?". The dialogue always stays within the character's context. The answers will not go beyond the setting and will continue to develop within the same world view, with the tone and expression remaining consistent.

03. Conclusion: Google is creating a "full - stack voice Agent", while domestic players are attracting users and improving capabilities at the same time

Judging from this release, Google is building voice capabilities into a more complete and general - purpose system. Whether it's vibe coding in programming scenarios, AI hardware interaction, or the mobile - end Gemini App entry, multiple forms are being promoted simultaneously, covering a wide range of usage scenarios.

In terms of product form, the Gemini App is already significantly similar to domestic products like Doubao. Both use dialogue as the core entry, undertaking search, tool invocation, and multi - round interaction. However, the actual experiences are different. Doubao is more proactive in Chinese expression, tone style, and sense of interaction. Its teasing expressions are more likely to form user stickiness and have already accumulated a certain user base in China.

In contrast, Google's current focus is still on expanding capabilities. Especially in scenarios like voice - driven development, the continuous modification ability and real - time interaction rhythm demonstrated by vibe coding are already ahead of existing product forms.

Meanwhile, the progress of domestic voice model capabilities is also accelerating. Step - Audio R1.1 from Jieyue Xingchen ranked first in the Artificial Analysis voice inference list, surpassing models such as Grok, Gemini, and GPT - Realtime with an accuracy of 96.4%, becoming one of the representative achievements in the field of voice inference.

On one hand, Google is continuously raising the upper limit of capabilities and trying to cover more scenarios; on the other hand, domestic players are promoting both user scale and model capabilities. The competition in the field of voice Agents is intensifying.

This article is from the WeChat official account "Zhidongxi" (ID: zhidxcom). Author: Jiang Yu, Editor: Bing Qian. Republished by 36Kr with permission.