HomeArticle

66,000 people are in the queue, Xiaomi's flagship "Over-speed" model extends the experience time, the official says: Fortune 500 companies are vying to use it

智东西2026-06-24 15:39
Xiaomi UltraSpeed demand far exceeded expectations.

According to ZDXX on June 24th, yesterday, the Xiaomi MiMo Open Platform issued an announcement, declaring an extension of the chat experience and API access experience period for its MiMo-V2.5-Pro-UltraSpeed model. This model was launched on June 9th, and the original experience window was scheduled to end on June 23rd. However, due to the number of applications far exceeding expectations, the team decided to extend the opening time.

▲ Notice on the extension of the limited-time experience of MiMo-V2.5-Pro-UltraSpeed (Source: Xiaomi MiMo)

Official data shows that as of June 23rd, the MiMo-V2.5-Pro-UltraSpeed has received over 66,000 usage applications. The applicants include Fortune Global 500 companies, industry-leading enterprises, and individual developers, covering multiple fields such as law, finance, communications, logistics, automobile manufacturing, culture and media, and universities.

The Xiaomi MiMo team stated in the announcement that the number of applications was "far beyond expectations" and emphasized that "the ultimate inference speed will bring new usage scenarios and paradigms to the industry."

After the extension, users can continue to apply for the internal test, and users who have passed the review can continue to use it. The specific offline time will be arranged separately according to the resource situation.

Looking back at the previous release, the MiMo-V2.5-Pro-UltraSpeed is an ultra-fast inference mode jointly launched by the Xiaomi MiMo team and the AI inference system team TileRT. It has achieved an output speed of over 1000 tokens/s for the first time on a flagship model with one trillion parameters (1T), with a peak of about 1200 tokens/s.

This model is based on the MoE architecture, with a total of 1T parameters. The activated parameters in a single forward propagation are about 42 billion, and it supports a super-long context of 1 million tokens.

▲ Lei Jun posted an announcement about the new progress of MiMo-V2.5-Pro-UltraSpeed (Source: Sina Weibo)

Xiaomi said that the implementation path of UltraSpeed does not rely on dedicated hardware solutions such as Cerebras wafer-scale chips or Groq customized SRAM chips. Instead, on a standard 8-card general-purpose GPU node, through collaborative optimization on the model side and the system side, the 1T model has achieved an output speed of over 1000 tokens/s.

On the model side, Xiaomi uses FP4 mixed quantization, mainly quantizing the MoE Expert with FP4 and retaining higher precision for other modules to reduce the model size and memory access pressure. At the same time, MiMo introduces DFlash speculative decoding, replacing the traditional Draft model's token-by-token autoregression with block-level masked parallel prediction, allowing the large model to verify more candidate tokens at once.

On the system side, TileRT customizes a compilation engine and computing cores for the FP4 quantization and DFlash process, and reduces the operator startup and synchronization overhead through methods such as a resident kernel engine and heterogeneous pipeline collaboration. Xiaomi's MiMo-V2.5-Pro-FP4-DFlash model card on Hugging Face states that this model is the underlying model behind UltraSpeed, including an FP4 quantized backbone and a BF16 DFlash drafter, with an MIT license.

In terms of pricing, the UltraSpeed API uses a limited-time experience price, which is three times that of the standard MiMo-V2.5-Pro. At the same time, it provides about 10 times the output speed improvement. Based on the official pricing, the standard MiMo-V2.5-Pro costs 0.025 yuan per million tokens for cached input, 3 yuan per million tokens for uncached input, and 6 yuan per million tokens for output. The output pricing of UltraSpeed is about 18 yuan per million tokens (about 2.65 US dollars per million tokens).

For reference, the public API pricing of Anthropic's latest flagship model, Claude Opus, is 5 US dollars per million tokens for input (about 34 yuan in RMB) and 25 US dollars per million tokens for output (about 170 yuan in RMB).

The speed of 1000 tokens/s is also impactful in the industry context. According to data from the AI benchmarking platform Artificial Analysis, the output speed of GPT-5.5 is about 62 to 68 tokens/s, Claude Opus is about 71 tokens/s, and Gemini Flash is about 192 to 200 tokens/s.

Previously, UltraSpeed has sparked a strong response in the overseas developer community. This topic has become a hot post on the technology community Hacker News. Some developers on the social platform X have directly said that "running a one-trillion MoE model at 1000 tokens/s on an 8-card general-purpose GPU node is crazy," and some have also questioned the comparability of "one trillion parameters" under the MoE architecture.

Application entrance: https://platform.xiaomimimo.com/ultraspeed

Chat experience entrance: https://ultraspeed.xiaomimimo.com

Hugging Face address: https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

This article is from the WeChat official account “ZDXX” (ID: zhidxcom). The author is Chen Jia, and it is published by 36Kr with authorization.