HomeArticle

MiniMax aims to secure the "ticket" to the AI Agent arena first.

极客公园2026-06-04 08:11
Why is everyone caught up in token anxiety?

After much anticipation, on June 1st, 2026, Children's Day, MiniMax officially launched its third - generation flagship model, M3.

Just from the official interpretation, six keywords can summarize all the highlights of this model: Coding ability, 1M context, native multi - modality, Computer Use, low - cost Token Plan, and open - source.

In terms of capabilities, as the first domestic open - source model in China to integrate the Frontier trifecta — frontier Coding/Agentic ability, ultra - long context of one million tokens, and native multi - modality, the strength of M3 goes without saying.

After all, before this, only overseas leading closed - source models such as Claude Opus 4.7, Gemini 3.1 Pro, and GPT - 5.5 could simultaneously meet these three requirements.

While its capabilities are indeed remarkable, what I mainly want to talk about this time is its price.

Official information shows that in the design of the MiniMax Token Plan this time, there are three tiers for individual developer packages: Plus costs 49 yuan per month for 6 billion tokens; Max costs 119 yuan per month for 18 billion tokens; Ultra costs 469 yuan per month for 55 billion tokens.

In terms of conversion, the Max tier is approximately 15 times the usage of the Claude subscription at a similar price.

In the past, during the Chatbot era, many people might not have had a clear concept of this kind of cost - effectiveness. After all, in the user - model interaction, the user asks a question and the model gives an answer, and the cost was relatively moderate. In the Agent era, the model has learned to read repositories, scan files, run tests, view logs, fix bugs, and run tests. Behind a single task, there may be dozens or even hundreds of model calls.

As a result, the model has become smarter, but the cost has become unbearable for many.

For many individuals and enterprises, a smart and cost - effective model is often the last crucial step for AI to be truly implemented.

01 From the pain points of Agent economics to the 49 - yuan Plus Token Plan

In the past, when people discussed AI replacing and liberating humans, it was often assumed that AI would definitely be cheaper.

However, this statement holds true only under certain conditions.

Especially in the Coding Agent scenario, recently, a study on the cost of Agentic Coding analyzed the operation trajectories of 8 frontier models on SWE - bench Verified and found an interesting phenomenon:

In Agentic Coding - type tasks, token consumption does not increase linearly and can even reach 1000 times that of ordinary code Q&A. What's more troublesome is that sometimes, when more tokens are consumed, the accuracy does not necessarily continue to increase. The accuracy of many tasks reaches its peak in the medium - cost range and then tends to saturate.

The underlying logic is that in Coding, users need to feed the complete project files and code context to the AI to generate truly usable code. This is a typical scenario where the input tokens far exceed the output tokens. In production - level scenarios, the context cost is extremely high and sometimes even exceeds the labor cost itself.

This explains why many enterprises that were very aggressive in AI adoption in the past have started to change their attitudes this year:

An extreme case is OpenClaw. Its founder, Peter Steinberger, once showed a bill for consuming approximately $1.3 million in OpenAI API tokens in 30 days, covering 603 billion tokens and 7.6 million requests. Behind this were about 100 Codex agents running automated development tasks.

Uber's CTO and COO publicly complained one after another that the company had exhausted its annual Claude Code budget by April 2026.

In this context, the cost - effectiveness of MiniMax M3 is not just about being a little cheaper; it is the last crucial step before the real popularization of Agents:

Agents cannot handle complex tasks without the ability to make trial - and - error attempts. However, if trial - and - error is too expensive, enterprises will hesitate, and individual developers will become conservative.

In the past, the core of model competition was the upper limit of intelligence. In the Agent era, the effective workload per unit cost is the real focus.

This is why I believe that the cost - effectiveness of M3 is actually part of its product capabilities.

But where does the root of this cost - effectiveness lie? And what is the actual product experience behind the cost - effectiveness?

02 Why, at this stage of industry development, do we need stronger Coding and long - range autonomous iteration?

The price determines whether users dare to use the model. The next thing users care about is whether it is worth using.

The Coding benchmark provided by the M3 official is quite impressive: SWE - Bench Pro 59.0%, Terminal Bench 2.1 66.0%, SWE - efficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.

These numbers are of course important, but I suggest taking them as a reference rather than a conclusion. The real highlights are actually two practical cases implemented by the official using M3: replicating a research paper and optimizing the Hopper FP8 GEMM kernel of CUDA.

Let's first look at the case of optimizing the Hopper FP8 GEMM kernel.

In this task, M3 only had the task description, a benchmark script, and a non - runnable Triton skeleton at the beginning, without a reference high - performance implementation.

M3 completed 147 benchmark submissions and 1959 tool calls in about 24 hours, increasing the hardware peak utilization rate of Hopper FP8 GEMM from 7.6% to 71.3%, achieving a 9.4 - fold acceleration.

The most important detail here is not the final 71.3% but that the optimal solution appeared in the 145th submission. In contrast, except for Opus 4.7 and M3, most other models stopped making new progress and voluntarily exited within the first 30 submissions.

That is to say, the model does not complete the task with a sudden flash of inspiration in the first few rounds. Instead, it continues to diagnose, attempt, verify, discard, and then try again during multiple plateaus.

In this process, the model needs to maintain its goal, remember history, understand benchmark feedback, and avoid messing up the system during multiple rounds of changes.

This is also the dividing line between Coding Agents and code completion tools. A reality that an ordinary vibe coding group may not realize is that in a real production - level environment, it is normal for both AI and humans to fail to run the code on the first try; it is also normal for the performance to be poor after it runs; and it is normal to introduce new bugs after optimization. Most of the time in engineering tasks is spent on diagnosis, verification, rollback, and retries.

Behind this ability, it is not only about having larger model parameters but also about having training data closer to the real - user logic. For this reason, MiniMax has built an interactive user simulator to simulate real developers continuously supplementing requirements, adjusting solutions, assigning tasks, and providing feedback and corrections within the same session.

This is why I said earlier that although a good benchmark result is very important, it cannot be directly applied to the production environment. Many coding benchmarks today are still single - turn tasks, but real - world collaboration must be multi - turn, multi - file, multi - tool, and multi - objective. The one who can advance training and evaluation from one - time problem - solving to continuous collaboration is closer to the next - generation Coding Agent.

Now let's look at the case of replicating a research paper, which is also very interesting. M3 was required to replicate the ICLR 2025 Outstanding Paper Award paper "Learning Dynamics of LLM Finetuning". It ran autonomously for nearly 12 hours, produced 18 commits and 23 experimental charts, completed the core experiments, and observed the changes in prediction probability during the SFT stage, the squeezing effect of DPO, and the mitigation method of Extend.

The characteristic of this task is that it is complex and requires multiple capabilities. The model needs to read the main text of the paper, understand formulas and charts, write experimental code, run training scripts, check whether the results align with the paper's conclusions, and then adjust the experimental settings according to the deviations. This requires the model's intelligence ceiling, long - context, programming, multi - modality, tool - calling, and fact - correction capabilities to all work simultaneously.

One of the major features of M3 is that it starts multi - modality mixed training from Step 0 and uses data where text, images, and other modalities are naturally interlaced.

In the context of Agents, this means that the model can more easily enter the real - world work scene, helping developers view architectural diagrams, error screenshots, performance curves, PR pages, and terminal outputs, assisting researchers in reading the main text of papers, as well as tables, images, curves, and formulas. It can also help enterprise employees switch between ERP, Excel, web back - ends, local clients, and chat tools, making multi - modality and intelligence an inseparable whole.

I directly asked the AI to create an interactive map based on the novel "Journey to the West" during the test.

The difficulty in completing this task lies in the fact that first, the model needs to find the original 100 - chapter, more than 600,000 - word text of "Journey to the West" and read and understand it.

On this basis, the most difficult part of creating a "Journey to the West" interactive map is that the place names in the original work are scattered, and virtual and real spaces are mixed. The itinerary description only mentions the mileage but has no coordinates. All the movement lines and events are distributed across hundreds of chapters, and the spatial relationships must be comprehensively sorted out based on the full - text context. In addition, there is no real - world GIS reference for various fictional scenes in multi - layer parallel spaces such as celestial caves. At the same time, although some mortal locations have real - world prototypes, they are not clearly stated in the book.

Converting these text descriptions into map images and automatically generating development code is a significant test for the model's context ability, tool - calling ability, multi - modality ability, agent collaboration ability, and even aesthetics.

This is a screenshot of the final generated HTML page. As can be seen, not only does the route map perfectly match the plot, but the possible corresponding real - world locations of different places are also basically consistent.

For example, Wuxing Mountain corresponds to Wuzhi Mountain in Hebei in the real world, Famen Temple is in Xi'an, Shaanxi, Tongtian River is near Yushu, Qinghai, and Liusha River corresponds to the Kaidu River in the Tarim Basin, Xinjiang in the real world, almost corresponding to the reference locations of the real - world prototypes one by one.

03 It's no longer new for sparse attention to handle 1M context, but how to ensure the hit rate?

After discussing the price and Coding, at this point, many people should be able to understand the logic behind the 1M context supported by the sparse attention mechanism designed for M3.

Long context is no longer rare. Many models are promoting 200K, 1M, or even longer contexts. The problem is that a long window does not mean the model knows how to use it.

An Agent cannot start thinking from scratch at every step. It must deposit past failures, user preferences, project structures, and tool feedback into the context. Correspondingly, the model's context will be filled with extremely long code files, terminal logs, failure records, benchmark outputs, user feedback, historical tool calls, and intermediate reasoning traces.

Long context is the foundation for achieving all this. However, sometimes, the longer the window, the more noise is composed of various intermediate states and irrelevant content, resulting in poorer output quality and a higher likelihood of cost explosion.

In this context, using dense attention will limit the expansion of context length and output efficiency, and the cost will also get out of control.

Using ordinary sparse attention can save costs but is likely to sacrifice the ability to locate fine - grained information.

However, in the process of Agent execution, missing details is the most feared thing. A critical error message in a tool call, a boundary condition in a code file, or an abnormal curve in a graph may determine whether the task can continue.

Therefore, it is not difficult to achieve long context itself. The real challenge is how to achieve a balance among cost, efficiency, and hit rate.

Those who are familiar with the industry background know that MiniMax has been working on long context and sparse attention for some time.

In early 2025, MiniMax - 01 used Lightning Attention and extended the model training context to 1M. In inference, it also attempted to extrapolate to a longer context of 4M.

Later, in the same period last year, MiniMax - M1 continued to use hybrid attention, combined with MoE and reinforcement learning, focusing on long context, long - range reasoning, and complex software engineering tasks.

With M2, MiniMax briefly reverted to the dense attention approach. Until M3, MiniMax returned to sparse attention with MSA.

Compared with other sparse attention solutions in the industry, such as DSA and MoBA, MSA can make the training and inference complexity linear through designs such as scalable sparse attention, document - wise RoPE, KV cache compression, and Memory Parallel. When expanding from 16K to 100M tokens, it maintains a performance degradation of less than 9%. Through the precise KV block upgrade, it reduces repeated reads by using KV outer gather Q at the operator level, and the overall computational memory access ratio is more than 4 times that of the open - source Flash - Sparse - Attention and FlashMoBA.

With MSA, M3 can achieve a per - token computation amount that is only 1/20 of the previous - generation model under a 1M context, more than a 9 - fold acceleration in prefill, and more than a 15 - fold acceleration in decoding. In most scenarios, its capabilities are directly comparable to the full - attention mode.

These optimizations may seem very low - level, but users will experience two things: long - term tasks are cheaper to run, and information is grasped very accurately. For example, here

I fed the entire "The Wealth of Nations" to M3 and created a simulated world game based on Adam Smith's logic.

The difficulty here is that "The Wealth of Nations" is full of qualitative social - science discussions. The economic transmission logics of division of labor, taxation, foreign trade, capital, and wages are scattered throughout the volume. Only a context of millions of tokens can fully read the entire book, extract the interlocking quantitative calculation rules, and convert Smith's written theory into numerical formulas for tax rates, productivity, and wealth linkage.

On this basis, to complete the construction of the simulated world game, the Agent needs to continuously perform long - term sequence deductions, understand the possible results of players' policies such as tax cuts and road construction, and finally be able to iterate the panel data in the short, medium, and long terms according to the logic of classical economics without violating the underlying economic laws of the original work.