HomeArticle

The actual test of Xiaomi's fastest 1T large model: throughput exceeds 1000 Tokens per second, and Vibe Coding delivers results in 7 seconds

量子位2026-06-11 12:44
A general-purpose GPU can achieve it

The global arms race of large models is opening up a new battlefield beyond "intelligence" —

Inference speed.

It is Xiaomi that has taken this battlefield to a new height.

Xiaomi has released a brand - new MiMo - V2.5 - Pro - UltraSpeed model, which is the high - speed version of MiMo - V2.5 - Pro.

It has a total of 1T parameters, supports a 1M context, and directly boosts the single - API inference speed to over 1000 TPS, refreshing the world's fastest inference speed for flagship models.

Moreover, unlike Groq, which relies on custom chips, it can be achieved with general - purpose GPUs.

This also means that Xiaomi's new model this time has broken the industry's impossible triangle of "fast speed, high performance, and general - purpose GPU cannot coexist". What Xiaomi has demonstrated is the full - link inference optimization ability from the model layer to the engine layer, and the underlying inference engineering strength is undoubtedly at the world's first - tier level.

This time, QbitAI has also obtained the test qualification for MiMo - V2.5 - Pro - UltraSpeed. Let's see if it's really that fast.

Actual test of Xiaomi's "fastest flagship model"

Let's first see if MiMo - V2.5 - Pro - UltraSpeed can generate a complete web application.

I connected it to Claude Code and asked it to write a web - based Pomodoro Clock application.

To be honest, with the current inference ability of the model, this task is relatively simple. So the main thing we want to see from this task is its speed.

Implement a Pomodoro work timer that can run directly in the browser using HTML, CSS, and JavaScript. Requirements include: three switchable modes of 25 - minute focus, 5 - minute short break, and 15 - minute long break; large - font countdown display; start/pause/reset buttons; automatically switch to the break mode and play a prompt sound (generated using the Web Audio API) after completing a Pomodoro; display the number of completed Pomodoros today and a history list on the right; support customizing the duration of each stage; and refer to the Linear design style for the color scheme.

As a result, its speed really surprised me.

In the first 5 seconds after submitting the task, I saw it thinking slowly and thought it was going to fail.

It turned out that it was building up strength. Before I could react, the code for the Pomodoro Clock web page I needed was all output in a flash.

More than 500 lines of HTML, and it only took 7 seconds in total including the thinking process.

This gif shows the original speed. Don't blink!

In contrast, if using Claude, even with the lightest - weight Haiku paired with Low Effort, it still takes at least 40 seconds.

When running the same task on the web - based version, due to the longer thinking process, the overall time consumption is much longer than when using Claude Code connected to MiMo - V2.5 - Pro - UltraSpeed.

However, the web - based MiMo - V2.5 - Pro - UltraSpeed has a built - in speed display, and it can be seen that the average speed in the output stage reaches over 1000 TPS.

If we look at the peak value, it is estimated that the maximum throughput in the inference stage reaches over 600 TPS, and in the output stage after inference, it soars to over 3300 TPS.

Of course, simple as it is, the functions still need to be verified.

After the page runs, the default duration fully meets the requirements and supports customization, and the required sound effects will play normally when the timer ends.

Moreover, after completing the focus/break timing, it will automatically switch to the other mode, and the switching of the break mode also follows the rhythm of three short breaks and one long break.

It's certainly a good thing that the model runs fast, but if the speed is achieved by "dumbing down", it's putting the cart before the horse.

So after the simple speed - testing task, it's time to increase the difficulty and see if there is any "dumbing down" behind the speed of MiMo - V2.5 - Pro - UltraSpeed.

At the same time, in order to test whether MiMo - V2.5 - Pro - UltraSpeed can be well adapted to different Harnesses, I changed the environment to Hermes.

Build a local - area network real - time chat room. The requirements are that the backend uses Node.js + Express + WebSocket; it supports multiple users to be online simultaneously. Users need to enter a nickname when entering and bind it to the device. Only the first time a device enters does it need to enter a nickname, but there should be an editing function; the chat interface refers to the Slack style and supports switching between multiple channels; messages support plain text and code blocks (code blocks are automatically highlighted); an online user list is displayed, and there are system prompts when users log in and out; it supports message reference replies; message records are persistently stored using SQLite, and historical messages can be loaded when entering a channel; output the complete code of all files, and then start and deploy it to port 11451.

After finishing writing, MiMo - V2.5 - Pro - UltraSpeed directly reported the project files, function list, and startup method to me.

Let's directly look at the running effect.

First of all, the most basic real - time chat, login/logout reminders, and input prompts are all normally implemented.

Special formats such as code and bold text can also be displayed normally.

The message reference function also works normally.

After refreshing the page, the previously set device nickname is retained as required, and a logout prompt also appears normally at the other end, and the online list changes synchronously.

In short, this time, MiMo - V2.5 - Pro - UltraSpeed completed the entire development process including the front - end, back - end, and database in one go.

This example is sufficient to prove that while increasing the speed, MiMo - V2.5 - Pro - UltraSpeed can still complete full - stack development tasks with high quality, that is, its intelligence is still on par.

However, what role can such speed play in actual production?

I asked MiMo - V2.5 - Pro - UltraSpeed to act as a senior script editor and lead four analysts to conduct an emergency joint review of a movie outline in front of the investment committee.

You are a senior script editor with three capable reviewers under your command. Now you need to conduct an emergency joint review of the following theatrical movie script outline before tomorrow morning's project review meeting. Please complete the task according to the following division of labor: Your story structure analyst takes over first and specifically examines whether the three - act structure is complete, whether the rhythm ratio of the main line and side lines is reasonable, and whether the climax scene is well - paved, and issues a structural review opinion. At the same time, your character analyst also works in parallel, specifically examining whether the protagonist's motivation is credible, whether the character arc is complete, and whether the functions of the supporting characters are clear, and issues a character review opinion. Your market analyst simultaneously examines from a commercial perspective whether the target audience of this theme is clear, how similar movies have performed in the market, and whether the project has enough differentiated selling points, and issues a market feasibility opinion. After receiving all three opinions, you, as the script editor, make a comprehensive judgment: Can this outline be advanced for project establishment? List the list of problems that must be modified and directly output a revised complete outline.

The summary of the story is as follows:

Theatrical movie script outline: "Migratory Birds Don't Fly South", a realistic emotional drama targeting urban women aged 25 - 40. One - sentence summary: A Hunan woman who has been working in Beijing for twelve years is forced to return home after her mother suddenly falls ill. She re - understands her relationship with her family between taking care of her mother and leaving. Main characters: Xie Wanqing, 38 years old, the director of a public relations company in Beijing, divorced, living alone, and having been estranged from her mother for a long time; Xie's mother, 64 years old, a retired teacher in a Hunan county, strong - willed, traditional, and used to applying pressure through silence; Chen Mo, 40 years old, a former colleague of Xie Wanqing, who returned to his hometown to start a business earlier due to family reasons and now runs a homestay. Story summary: Act 1: Xie Wanqing receives a call from her father saying that her mother has had a sudden cerebral infarction and is hospitalized. She takes a leave of absence to return home. Originally planning to leave after handling the matter, she finds that her mother's recovery requires long - term care, and her father is no longer able to bear it alone. She is caught in a dilemma between her career and family. Act 2: Xie Wanqing stays in the county. During the process of taking care of her mother, she has several intense conflicts with her mother. Her mother's strong - willedness and control push her to the verge of collapse. At the same time, she re - establishes contact with Chen Mo. Chen Mo's life choices make her start to re - examine her life path in the past twelve years. Act 3: Her mother recovers and is discharged from the hospital. Xie Wanqing faces the final choice of whether to return to Beijing. She finally chooses to return to Beijing but reaches a certain reconciliation with her mother, not forgiveness but acceptance of each other as different people. Core themes: Escape and belonging, self - realization and family responsibility, the Chinese - style mother - daughter relationship. Expected duration: 105 minutes.

I built a three - Agent workflow using Hermes and asked MiMo - V2.5 - Pro - UltraSpeed to simultaneously start three sub - agents to conduct a parallel review of a movie outline.

Among them, the story structure analyst examines the three - act rhythm, the character analyst examines the motivation and character arc, and the market analyst examines the commercial feasibility.

After summarizing the three opinions, the main Agent makes a comprehensive judgment and directly outputs a revised outline.

As a result, in less than two minutes, all three sub - agents completed their respective tasks, and the final report was completely delivered to me.

The three sub - agents each found problems that the others didn't notice.

The structure analyst pointed out that the mid - point and the second turning point in the second act of the original outline were completely missing, which was a major flaw.

The character analyst found that the protagonist, Xie Wanqing, was always being pushed forward and lacked an active desire that ran through the whole movie. And the character of Chen Mo could be removed without affecting the story, indicating that he didn't have an irreplaceable position in the narrative.

The market analyst compared it with competing products, gave a box - office range of 30 million to 1.2 billion, and pointed out that the key to the gap lies in the intensity of emotions and the ability to detonate social topics.

After all three opinions were received, the revised outline given by the main Agent filled in all the structural gaps.

The second act, which was only one sentence in the original version, was split into three levels of progressive conflicts, the mid - point and the second turning point were added, the father changed from a simple information transmitter to the most important "translator" in the whole movie, and Chen Mo's "peaceful life" was also overthrown. This setting directly shattered the protagonist's romanticized imagination of "another kind of life".