Xiaomi MiMo is trying to ride on the coattails of DeepSeek to enter the arena.
On May 27th, Xiaomi permanently reduced the prices of the MiMo-V2.5 series of APIs.
The price for input cache hits of MiMo-V2.5-Pro dropped to $0.025 per million tokens, the price for input misses is $3 per million tokens, and the output price is $6 per million tokens. The prices for the regular version of MiMo-V2.5 are even lower: $0.02 for cache hits, $1 for input misses, and $2 for output.
This is not an ordinary promotion.
When comparing the prices horizontally, it becomes clear that Xiaomi is not just making a random price cut but directly benchmarking against DeepSeek.
MiMo-V2.5-Pro is benchmarked against DeepSeek V4-Pro, and MiMo-V2.5 is benchmarked against DeepSeek V4-Flash.
Nowadays, DeepSeek is no longer just the name of a model. At least in the domestic large model market, it is becoming a price yardstick.
This yardstick keeps prodding major model companies: How much do you sell your model for?
When such a question is equally posed to everyone, it creates new opportunities. Latecomers like Xiaomi's MiMo can be more flexible than other models and thus "ride on" DeepSeek to get a chance to enter the game.
1
The price segmentation of tokens is getting finer
Let's first look at how the price cut happened.
In this price list, the most important detail is that it clearly separates the prices for cache hits and cache misses.
This has become an underlying trend in today's large model price war.
So-called cache hits, simply put, mean that if the prefix content of this request is the same as that of a previous request, the platform doesn't need to recalculate from the beginning but can reuse the previously saved intermediate results.
When large models process long contexts, the costs are generally divided into two stages.
The first stage is called prefill, which can be understood as "reading the question". System prompts, project codes, corporate documents, and historical conversations all need to be read into the model first.
The second stage is called decode, which can be understood as "answering the question". The model then generates responses token by token.
In the past, when people talked about API prices, they mainly focused on input and output. But now, large models are increasingly used in Agents, Coding, knowledge bases, and long conversations, and many inputs are actually repetitive.
Code assistants always need to look at the same repository, corporate assistants always need to read the same set of institutional documents, and Agents carry the same set of tool instructions and system rules in each round.
Perhaps the only real difference is the last instruction.
At this time, the cache becomes a key variable in the cost structure.
You need to do rough work when doing a question for the first time. If the first half of the question is the same the second time, you don't need to do the rough work again. This is why the price for cache hits can be incredibly low.
Taking MiMo-V2.5-Pro as an example, the price for input misses is $3 per million tokens, while it is $0.025 after a cache hit, a difference of 120 times.
The price war is intense, but large model manufacturers are no longer selling tokens as a unified commodity. New inputs, cached inputs, and output tokens have three completely different cost structures behind them. This round of price war is not about "making all tokens cheaper together" but about manufacturers starting to reprice tokens according to the real costs.
2
The price cut comes from the "computer room"
"Up to a 99% price cut" is the biggest gimmick, but the real reason lies elsewhere.
In the price cut announcement, the Xiaomi team mentioned that they fully support SWA, i.e., Sliding Window Attention, based on SGLang HiCache. This reduces the data transfer volume of KV Cache among GPU memory, CPU memory, and SSD multi-level storage to nearly 1/7 of that before optimization, and at the same time increases the number of cacheable tokens to nearly 5 times.
This passage explains another reason for this price cut.
Every time a large model generates a token, it needs to refer to the previous context. If it recalculates all the context at each step, the cost will be very high. What KV Cache stores are the Key and Value calculated by the previous tokens in the attention mechanism.
It is equivalent to turning the content that the model has already read into reusable "calculation drafts".
But these drafts also need a place to be stored. The best place is GPU memory, which is the fastest but also the most expensive; followed by CPU memory; and then SSD, which is cheap but slow. The more caches there are, the less likely it is to store them all in the memory.
So, which caches should be stored in the memory, which in the CPU memory, and which in the SSD? When should they be transferred? How much should be transferred? How to avoid the transfer itself slowing down the inference?
This is what is meant by "reducing the data transfer volume among multi-level storage" in Xiaomi's announcement.
In the past, to reuse the context, either expensive memory was occupied, or data was transferred back and forth between different storages, and the saved computing cost was eaten up by the transfer cost. Now, the system scheduling is smarter, with less transfer, more storage, and a higher hit rate, which makes it possible to further reduce the cache price.
So, if the low price is only due to subsidies, it is just a waste of money. If the low price comes from KV Cache, SWA, multi-level storage, expert parallelism, and input length bucketing, it represents infrastructure capabilities.
The former can only attract traffic for a while, while the latter may change the long-term price. According to Xiaomi, a technical paper with more detailed information will be released later.
3
Can the challenge posed by DeepSeek become a lifesaver for Xiaomi?
Undoubtedly, a price cut will bring an increase in users for a model in the short term. In addition to the possible technical changes revealed by the official, Xiaomi has also clearly designed the timing and rhythm of the price cut.
It chose to follow closely right after DeepSeek's latest round of price cuts.
DeepSeek has posed a question to all model manufacturers. When even a powerful model like DeepSeek can be used at a low price, why should other model manufacturers maintain their original prices?
In the past, domestic model companies only needed to be cheaper than GPT and Claude to explain their cost-effectiveness. But after DeepSeek lowered the price anchor, the industry has entered a more difficult stage.
If you are much more expensive than DeepSeek, you must prove that your capabilities are much stronger. If your capabilities are similar, you must prove that you are faster, more stable, and have a better ecosystem. If you don't have obvious advantages in capabilities, price, and experience, you can only retreat to narrower scenarios, such as multi-modal, edge-side, enterprise privatization, industry models, and toolchain binding.
If you don't have any of these, you can only exit early.
DeepSeek is like a catfish. It doesn't make all models immediately cheaper, but it makes the concept of "expensive" need to be re-explained.
Claude can explain its price with its coding and complex task capabilities, and GPT can explain its price with its complete ecosystem, multi-modal capabilities, and toolchain.
What about latecomers like Xiaomi that have not yet achieved any user scale effect? Especially since Xiaomi's current core business is not centered around an independent model brand but around mobile phones, cars, IoT, HyperOS, and the smart hardware ecosystem.
So, the biggest challenge for MiMo at present, both internally and externally, is: How can a basic model that is not the default choice enter the developers' candidate list?
This time, MiMo clearly decided to seize the opportunity presented by DeepSeek and match its prices pixel by pixel. This may be the only chance. It must hold on to DeepSeek to get on the table.
Only by matching DeepSeek's price level can it attract users. In the API market, developers won't hand over their call volume to a new model for no reason. Especially in scenarios such as Agents, Coding, and long contexts, a single task may involve dozens of calls. As long as the price is significantly higher than DeepSeek's, developers will be discouraged by the bill before they even notice the differences between the models.
On the other hand, there is also internal pressure: MiMo needs to prove as soon as possible whether it can become the AI basic capability in Xiaomi's ecosystem.
For Xiaomi, the model API may not be the end. Its ultimate destination is not just the developers' console but its own ecosystem.
But for the model to enter these scenarios, it can't rely solely on press conferences and parameter sheets. It needs a large number of real calls, developers need to test it repeatedly in real tasks, and users need to use it continuously in scenarios such as long conversations, coding, Agents, knowledge bases, in-vehicle systems, and device control. Only when this usage data is available can the model know which capabilities are really useful, which scenarios are worth optimizing, and which interfaces need to be redesigned.
So, even though Luo Fuli recently proposed that models should not "blindly cut prices", MiMo still has to launch a price war today. Luo Fuli also explained this in her latest tweet:
"Running at the new API price after the price cut, our production inference engine is operating close to full capacity and can still basically achieve a balance between revenue and expenditure. We previously advised LLM companies not to blindly cut prices precisely because few model architectures and inference optimizations can prevent API costs from incurring losses. If more architectures that save computing and KV cache appear, accompanied by better inference infrastructure to lower API costs, this will form an excellent positive cycle in the industry."
At the point just one day after the price cut, this description still seems more like a perfect assumption. If it is achieved, MiMo will completely enter the game; if not, it will be a different story.
This article is from the WeChat official account "Silicon Star People Pro", author: Dong Daoli. Republished by 36Kr with permission.