Increasing Number of Office Workers Mutter in Front of Computers

Over the past two decades, voice input has always been a less useful auxiliary function in input methods. Now, large AI models are turning it into a trendy way of working.

A keyboard product has recently become popular on Taobao. It has only 4 buttons, 1 joystick, and 1 microphone interface, no letter keys, and it can't be used for typing. Its price starts at 269 yuan, and the version paired with a DJI microphone is priced at over 400 yuan. This product is called AhaKey - X1, developed by Nanjing Jinxinwan Technology Co., Ltd. (hereinafter referred to as AhaKey), and it was launched around the Spring Festival this year.

Its use is very simple: it facilitates users to talk to AI.

Users just need to press the voice button and dictate work instructions into the microphone. After the AI converts the voice into text, it sends the text to AI tools such as Claude, ChatGPT, DeepSeek, and Cursor for execution. Whether it's writing code, revising a plan, or organizing meeting minutes, users don't need to type; they just need to speak, and the AI will automatically organize the colloquial expressions into structured text.

Zhang Xinyang, the co - founder and CTO of AhaKey, told a reporter from Economic Observer that since the product was launched, the monthly sales have doubled. During the "6·18" period, the company's inventory was nearly 1,000 units, and currently, it is in contact with multiple industrial capitals and investment institutions for financing.

A keyboard without letter keys can sell well because more and more people are starting to use voice instead of typing to give work instructions to AI. This way of working first became popular among programmers. They use voice to describe their requirements to AI, and the AI generates code. But now, product managers, lawyers, and content creators are also starting to do this.

Zhang Xinyang told a reporter from Economic Observer that there was a user who impressed him deeply. It was a lawyer in his forties. "He wasn't very proficient in using a Windows computer." But after buying AhaKey, he could complete his work by talking to AI without typing. Zhang Xinyang said that this made him and his team realize that the demand for voice - based office work in the AI era might be much greater than they had expected.

Actually, voice input is not something very new. As early as 1997, IBM launched the commercial Chinese voice recognition system ViaVoice, claiming a maximum recognition rate of 95%, which was pre - installed on the mainstream PCs at that time. In the nearly three decades since then, companies such as iFlytek, Sogou, and Baidu have continuously invested in the field of voice input. The products have extended from the PC to the mobile phone, but voice has never become the mainstream input and interaction method.

Zhang Xinyang believes that the change occurred after the maturity of large AI models. "In the past, voice input solved the problem of converting speech into text, but it didn't solve the problem of understanding language." He said that in the past, voice input methods simply recorded what you said word for word. If there was a single wrong word, you had to correct it manually. The output text was in a colloquial form, which was difficult for people to read. However, large AI models have changed the receiving end. Even if you speak intermittently with slips of the tongue, the AI can still understand your meaning and then output a smooth text.

Or rather, when the recipient of voice input changes from a human to an AI, the requirement for recognition accuracy is greatly reduced, and voice - based office work can truly be realized.

According to the incomplete statistics of a reporter from Economic Observer, as of the end of the first quarter of 2026, the total financing of global startups in the field of voice AI has exceeded 7 billion US dollars.

Currently, the overseas voice dictation app Wispr is conducting a new round of financing, with a target valuation approaching 2 billion US dollars. Half a year ago, this figure was only 700 million US dollars. On May 12, Google integrated the AI dictation feature Rambler into its default keyboard Gboard, covering hundreds of millions of Android phones and making it available for free. In China, on May 7, Alibaba's Qianwen launched the AI voice input function on the PC. On May 28, iFlytek (002230.SZ) released AI glasses. The intelligent agent it carries can automatically organize colloquial expressions into structured text.

Over the past two decades, voice input has always been a not - very - useful auxiliary function in input methods. Now, large AI models are turning it into a trendy way of working.

"AI Can't Feel Pain"

Even though the recognition accuracy of various voice input tools is already very high, and functions such as simultaneous interpretation and multi - language translation have been launched one after another, voice input has never become a mainstream interaction method. Most people still choose to type when communicating online, working, or interacting in daily life. The problem obviously doesn't lie in the recognition accuracy.

Lin Huijie, the general manager of the Wearable Device Business Department of iFlytek, mentioned in an interview with a reporter from Economic Observer that there is an obvious problem with traditional voice input. After voice transcription is completed, "you can't directly send it because others can tell at a glance that it's typed out by voice, and it doesn't look good. Although it's convenient for you, it causes discomfort to others."

The Chinese speaking speed is usually about three times the typing speed, with a clear speed advantage. However, "speed" only solves the efficiency of the sender. A piece of colloquial text with interjections, repetitions, and jumpy logic is a burden for the reader. For example, receiving a 60 - second voice message on WeChat can be a headache, and the reason is the same - the speaker is comfortable, but the listener suffers.

This is a common problem faced by traditional voice input methods. Even if the recognition accuracy reaches 99%, the output text is still in a colloquial form, without punctuation, paragraphs, and often contains "um", "ah", or half - nonsense sentences, which are difficult for people to read.

But AI can't feel this pain. The colloquial text that is unbearable for humans poses no understanding obstacles for AI. No matter how messy and fragmented the speech is, it can extract the intention. The problem of "convenient for oneself but painful for others" in voice input disappears as soon as the recipient becomes an AI.

Therefore, voice - based office work is rapidly spreading in two types of scenarios. In the first scenario, users speak to Claude, DeepSeek, or ChatGPT to give instructions, and the AI directly understands the intention and executes the task. The whole process doesn't require producing a smooth text for people to read. This is a situation that voice input has never encountered in the past few decades: when the recipient changes from a human to an AI, the requirement for the normativeness of language expression is greatly reduced.

In Zhang Xinyang's words, "understanding the intention is more important than being accurate word for word."

Programmers were the first group to enter this mode on a large scale. Andrej Karpathy, the co - founder of OpenAI, publicly proposed the concept of "vibe coding" in February 2025. Developers describe their requirements in natural language, and the AI generates code, which the developers then review and modify. Karpathy mentioned at that time that he used the voice dictation tool SuperWhisper to dictate programming instructions to the AI. By December 2025, Karpathy had completely stopped typing code and relied 100% on voice input.

From late February to early March 2026, Codex, the programming agent under OpenAI, and Claude Code, the programming agent under Anthropic, launched their native voice modes within less than a week of each other. Developers can hold down the space bar and speak, and the AI can receive programming instructions.

The AhaKey - X1 is designed for this workflow. Zhang Xinyang said that when using AI programming tools such as Claude Code, the AI will frequently ask users to approve operations. Pushing the joystick up means automatic approval, and pushing it down means confirming one by one. "It's like an automatic transmission. All operations that need approval will be automatically approved." Three of the 4 buttons correspond to speaking, confirming, and rejecting respectively, and the 4th button is for users to customize.

According to Zhang Xinyang, the team initially found a problem when using AI for office work themselves. Sitting upright in front of the computer to type sometimes restricts ideas. "Many ideas come out in a flash, maybe when you're lying on the sofa in your study." So, since communicating with AI has become speaking, why do you have to sit in front of the computer?

So, they first created an open - source project on the open - source community GitHub. After some people saw it, they came to buy components and kits, and later, some people hoped to receive the assembled finished products directly. "It's the users who are pushing us forward," Zhang Xinyang said. On Xiaohongshu, many users have already spent 69 yuan to buy a three - button small keyboard and a microphone to make a similar device by themselves.

The second scenario where voice - based office work is rapidly spreading is that even if text still needs to be produced for people to read in the end, the AI adds a layer of semantic processing after voice transcription: automatically deletes interjections, corrects grammar, straightens out logic, adjusts sentence patterns, and outputs a smooth text that can be directly used. The delay caused by this process is usually only one or two seconds.

"Even if there are mistakes in what you said earlier and you correct them later, the AI can help you sort it all out and form an effective copywriting content," Lin Huijie told the reporter. This also means that in the past, voice input required extremely high recognition accuracy to be barely usable. Now, even if the accuracy is average, the large model can output better results than word - for - word transcription based on its understanding ability.

In fact, in the past two years, a group of startups around AI voice dictation have been growing rapidly. Among them, the one with the highest valuation is Wispr, a company located in San Francisco, USA. This company was founded in 2021. It initially made brain - computer interface wristbands (for silent voice input) and then transformed into developing voice dictation software in mid - 2024.

Public information shows that as of early 2026, Wispr has completed a total of approximately 81 million US dollars in financing. According to the data disclosed by Wispr, among users who have used the product continuously for more than 6 months, 72% of the characters in their daily input are completed by voice rather than the keyboard. Since the product was launched, the user scale has increased by more than 100 times year - on - year, and 70% of the users who have used it for 12 months are still actively using it.

In September 2025, Reid Hoffman, the co - founder of LinkedIn, announced on social media that he had been "voicepilled" and called it "a new way to amplify capabilities."

As of May 2026, Wispr's target valuation has approached 2 billion US dollars, nearly tripling in half a year. A dictation app with a valuation of 2 billion US dollars indicates that the capital market is clearly betting on the scenario where voice replaces part of keyboard input.

iFlytek Input Method is also following this trend. At the end of 2025, iFlytek Input Method added an AI button to the keyboard interface. Users can long - press this button to directly give instructions to the AI by voice without switching to other apps. According to iFlytek's 2025 annual report, the penetration rate of the large - model service of iFlytek Input Method among users has increased by 900%, and the input efficiency has increased by 77%.

This may indicate that the demand for voice - based office work is spreading from the geek circle to a wider range of workplace people.

"Whisper Quietly!"

The speed advantage of voice - based office work is obvious, but office work is not just about speed. Writing a carefully - worded email, revising a piece of code with complex logic, or polishing a proposal for a client require precise control rather than rapid expression. Whether it can cover these scenarios is one of the key issues determining how far voice - based office work can go.

A reporter from Economic Observer asked Zhang Xinyang in an interview: Some people think that the prompts typed on the keyboard are more organized, and the typing process itself helps you sort out your thoughts. Can voice input replace this process? Zhang Xinyang's answer was, "The value of typing will always exist."

He clearly distinguished between the two: voice is on the expression side, and the keyboard is on the organization side. "When you want to modify something, the thinking process itself is valuable to you." Voice solves the problem of quickly "pouring out" your ideas, while editing and in - depth thinking still require the keyboard.

Zhang Xinyang also mentioned a change: Two years ago, "prompt engineers" were a popular recruitment position. Users needed to carefully design the input format to get satisfactory results from the AI. But now, this position has basically disappeared. The AI can structure, disassemble, and schedule the scattered colloquial input by itself. "Purely from the perspective of the effect, there is no need for people to edit and type anymore."

The AI's tolerance for input formats is getting higher and higher, and the way of giving instructions to the AI is becoming less and less important. On this premise, the input method with the fastest speed and the lowest cognitive burden will naturally win, and there is no need to translate your thoughts into written language when speaking. Or rather, now that the AI's ability to understand natural language has reached this level, office products with voice as the core interaction method have met the conditions for establishment for the first time.

But actually, the idea of operating a computer by voice appeared earlier than large AI models.

On May 15, 2018, Smartisan Technology held a press conference at the Bird's Nest in Beijing. At that time, the founder Luo Yonghao demonstrated the Nut TNT workstation on stage. TNT stands for Touch and Talk, which features voice and touch - based operation of a desktop computer. Users can complete operations such as searching, editing documents, and sending emails by speaking to the screen. Such a product, which was defined by Smartisan Technology as a revolutionary one, was widely ridiculed after the press conference. The netizen's mockery "Quiet! You're disturbing my use of TNT!" once became a widely - circulated "famous meme" on the Internet at that time.

The core reason for the netizens' ridicule of TNT was the poor voice interaction experience demonstrated by Luo Yonghao on site. Although the voice recognition technology in 2018 could already achieve a relatively high accuracy rate, there was no large model to understand the intention. Every recognition error was a friction point that required users to correct manually. Users had to speak clearly and logically for the machine to give the correct response. A little ambiguity would ruin the experience.

Or rather, in 2018, the recipient of voice interaction was a traditional software system, which required precise input to operate and had a low tolerance for colloquial expressions. Even if the accuracy of voice recognition itself had reached over 95%, without the support of a large model, each of the remaining 5% of errors would become a breaking point in the user experience.

Under the technical conditions at that time, a desktop computer with voice as the main operation method could not deliver on its promise and could not provide the expected experience. If TNT were equipped with a large model that can understand natural language and launched today, it would face a different situation.

The large model has solved the problem of "not understanding", but the problem of "inconvenient to speak" still exists. In Zhang Xinyang's view, the first problem faced in the actual promotion of voice - based office work is noise. "In an open - plan office, if seven or eight people are muttering to their computers at the same time, even if each person is lowering their voice, the combined noise can be a headache."

Edward Kim, the co - founder of the US human resources software company Gusto, also said in a recent media interview that he has been promoting voice - based office tools within the company, and he "almost always talks to the computer now." But doing this continuously in the office "is really a bit embarrassing."

Zhang Xinyang introduced that AhaKey paired with a DJI microphone can achieve low - voice recognition, maintaining an accuracy rate of 99% at a volume of 20 decibels, which is about the same as a whisper in a bedroom at night. Colleagues sitting next to you can hardly hear what you're saying.

Of course, there are also other technical solutions to this problem. On May 28, Kong Changqing, the director of the voice translation line of iFlytek Research Institute, introduced in an interview with a reporter from Economic Observer that the newly released AI glasses of iFlytek adopt a multi - modal noise reduction solution combining lip - movement recognition and a microphone array. In high - noise scenarios such as exhibitions, subways, and restaurants, the recognition accuracy can be improved by 30% to 40%.

Lip - movement recognition and low - voice recognition are two different technical paths, but they are facing the same market demand: being able to use voice for office work in a noisy environment with many people. "Especially for some extremely noisy scenarios that were completely unusable before, (lip - movement recognition) has basically reached the threshold for use," Kong Changqing said.

The second problem faced by voice - based office work is privacy. The dictated content becomes sound waves, and others around can hear the email content, code logic, and business ideas. In addition, there are also security concerns about the voice data processed in the cloud.

In November 2025, some users found on

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

More and more office workers are muttering in front of their computers.

"AI Can't Feel Pain"

"Whisper Quietly!"