HomeArticle

Speaking is three times faster than typing, and the way of using AI is being rewritten.

世界模型工场2026-05-14 20:05
Voice is becoming the new gateway to productivity.

In more and more offices, the sounds no longer come from the clicking of keyboards, but from people whispering to their screens.

Some are dictating product requirements to AI, some are assigning today's task lists to AI via voice, and some are telling AI to extract the key points from a 40 - minute meeting.

This is not a sci - fi scene from 2030.

In the shared spaces of YC in Silicon Valley and on the startup floors in Caohejing, Shanghai, a group of early AI users are working in this way.

Actually, voice interaction is not something new.

Siri has been around for nearly 15 years, and smart speakers were popular for a while. The previous two attempts failed to truly change the way people work.

But this time, it's really different.

The office scenario is changing

Let's start with a real - life example.

In an episode of the "Crossroads" podcast, Zhang Haoran, the co - founder of Moxt, described how he prepared for a podcast interview.

"I pressed the voice button and said to the AI, 'I'm going to meet Koji from Crossroads. Go online and find out about this person first. He wants to know about Moxt, and I'm going to do a podcast with him. How do you think I should introduce it? What topics would he be more interested in? Draft a document for me first.'"

This is a complex instruction with multiple layers of tasks directly given by voice.

The AI searches, understands, and organizes on its own, and outputs a structured first draft.

This way of working is spreading rapidly among the startup circle and tech bloggers.

Their feedback is almost unanimous: the results exceed expectations.

A feeling that is repeatedly mentioned is that after switching to voice, more ideas come up.

Zhang Haoran also talked about how his team holds meetings now.

"In the past, the collaboration model was to send documents, write comments, and then have a chat with you." But now, one - on - one meetings are held like this:

The AI drafts a document first. Two people start talking, and the AI records the whole process. After the conversation, the document is automatically updated.

This is not just as simple as the AI converting the recording into text.

He mentioned a detail: now when talking to the AI, there's no need to explain "what Moxt is" or "who the other person is".

The AI has enough background information and can look it up on its own without being fed the context.

This is what this way of working has truly changed.

The AI has changed from a passive - response tool to a continuous and always - available participant.

An even more extreme change is happening among programmers.

In early 2025, Andrej Karpathy proposed a concept called "Vibe Coding". Developers can use voice to command AI programming tools like Claude Code or Cursor to write code directly by speaking.

Specifically, while looking at the code on the screen, a programmer can say, "Rewrite that error - reporting function and add exception handling." The AI makes the changes, and the developer hardly needs to touch the keyboard.

Even coding, the job that relies most on the keyboard and requires the most precise input, is being infiltrated by voice.

This shows that voice as a productivity entry point is not only applicable to a specific scenario but is becoming widespread.

Is voice really viable this time?

TechCrunch recently conducted a horizontal review of such tools. AI voice input products like Wispr Flow are spreading rapidly.

Wispr has been downloaded more than 2.5 million times globally in 10 months since its launch.

The signal is clear: voice input is changing from a strange habit to a product category that can be taken seriously.

The underlying logic is simple.

For most people, their minds work faster than their hands. The speed of speaking is about three times that of typing.

In the past, what one thought in the mind had to be typed out character by character on the keyboard. One had to think clearly before typing. If typing was too slow, the train of thought would be interrupted, and if typing was too fast, mistakes were likely to occur.

But voice doesn't have this problem. One can say whatever comes to mind.

Once one gets used to outputting at the speed of thinking, going back to typing will seem very slow.

It's worth noting that this is not the first time humans have attempted voice interaction. Voice assistants have failed twice. Why is it feasible this time?

Looking back at the early Siri, the technical goal was simple: to convert what people said into text.

But after the conversion, what was obtained was a pile of colloquial, jumpy, and raw text full of "um", "then", and "that is to say". No one was willing to use this for work.

Then, looking at later versions like Alexa, Google Assistant, and the upgraded Siri, the goal was upgraded. They were not only supposed to understand the words but also execute commands.

They can indeed do things like "set an alarm for me" or "what's the weather like today".

But for something a bit more complex, like "help me organize the content of this morning's meeting and send it to the project team", they completely fail.

The two failures, seemingly due to immature technology, actually stem from the same problem:

Voice produces chaotic raw materials, and in the past, there was nothing that could handle the chaos.

But after the emergence of large models, everything has changed.

You can speak in a chaotic way, and the large AI model can still understand what you want.

The AI can handle vague instructions, jumpy logic, half - finished sentences, and a lot of nonsense in spoken language.

This makes the logic that led to the failure of voice products in the past two decades suddenly work.

Voice interaction is becoming a trend

In the field of voice interaction, technology and product development are touching on more fundamental things.

The first change comes from the interaction model.

The interaction models recently released by Thinking Machines show a more radical direction.

Traditional voice interaction is turn - based: you speak, the AI processes, and then replies. But this is not like a real conversation.

In a real conversation, the other person will interrupt, pick up the conversation while you're speaking, and continue when you pause.

Thinking Machines' solution is real - time streaming interaction: the AI listens, thinks, and responds simultaneously, and the end - to - end delay is compressed to less than 0.4 seconds.

The natural pause interval in human conversation is about 0.2 seconds, and 0.4 seconds is close to the rhythm of a real conversation.

This means that the turn - based voice interaction model may become history faster than we think.

When the AI can really "butt in", the voice agent is no longer a tool that starts working only after you finish speaking, but a real in - the - moment collaborator.

The voice agent is evolving from a demo to a product that can be launched.

Another signal comes from the infrastructure layer.

In the past, voice agents were just showcases.

They sounded cool at product launches, but when it came to integrating them into products, there were problems with latency, stability, and interruption handling.

Currently, platforms like OpenAI's Realtime API, AssemblyAI's Voice Agent API, and Inworld are starting to integrate voice recognition, voice synthesis, model inference, interruption handling, and tool invocation into a more unified interface.

Developers can use a single API to build production - level voice agents, and the entire technology stack can be directly launched.

This means that the threshold for voice application development is getting lower and lower, and a batch of previously unimaginable product forms may emerge rapidly.

The third change comes from the battle for entry points.

Google integrated Gemini dictation into the Gboard keyboard during the 2026 Android Show I/O Edition event.

This move may seem ordinary, but it's a dangerous signal for startups like Wispr Flow.

Because once the battle for entry points reaches the operating system level, the rules change.

Looking at these things together, a clear picture emerges:

On the technology side, latency is being compressed, and real - time interaction is evolving from research to products.

On the infrastructure side, voice agents are moving from demos to launch - ready products.

On the platform side, large companies are starting to position voice input as an operating - system - level feature.

This is the whole industry moving in the same direction within the same time frame.

The resistance to voice

Technology can solve problems, but it doesn't mean everything. In reality, the often - underestimated resistance comes from human nature.

The talk - show actress Niaoniao once told a joke. She said that even if she was bitten by a mouse, it was hard for her to call for help immediately.

"If no one comes to rescue me, I may just die. But once someone comes to rescue me, I still have to greet them."

The whole audience burst into laughter because the feeling was so real.

This is the situation of introverts facing voice input. It's not that they don't want to speak; it's that speaking itself has a cost.

Typing gives a sense of a draft. You can delete mistakes, think clearly before sending, and no one sees the process. But speaking doesn't have this buffer.

An open - plan office makes this even more embarrassing.

When you whisper tasks to the screen, your colleagues' ears will perk up.

Being heard is the real obstacle, and noise is secondary.

So the "low - voice recognition" function launched by tools like Wispr is, in a sense, helping socially - anxious people. It can recognize even when you mumble quietly.

This solves not a technical problem but a psychological threshold.

This is probably the most absurd and yet most real footnote on the road to the popularization of voice input:

The technology is ready, but people aren't.

In the long run, even if voice becomes a new interaction method, it won't replace typing. However, the popularization of voice will create an efficiency gap.

Those who are already using voice for work have meeting records, dictated documents, and a place to capture the fleeting thoughts in their minds. Their ideas are more likely to be captured by AI.

This is the real meaning of voice becoming a productivity entry point.

This article is from the WeChat official account "World Model Workshop". Author: World Model Workshop. Republished by 36Kr with permission.