MiniMax Evolution: A Group of "Paranoids" Riding the Wave of Progress Forward

Ranked top two globally in open source, this is the Aha moment for a Chinese large model.

If a technology takes three years to go from obscurity to changing the world, we call it the Nth Industrial Revolution.

If, during the process of this technology transitioning from a research paper to real - world implementation, the leading players change as frequently as the figures in a revolving lantern, and billions of dollars in traffic are invested only to result in a brief splash and then sink into silence, we call it a capital meat grinder.

If a technology combines all of the above characteristics and can evolve at a speed ten times that of conventional technologies, rendering the ecological moats, capital barriers, and scale effects of the Internet era completely ineffective, allowing startups to stand at the center of the world stage, then it is the large - model technology.

This trend has become particularly evident since DeepSeek "shook up the game" at the beginning of this year. To this day, in the large - model industry, the only survival rule for enterprises that can stay in the game is innovation above all else.

So, how should we understand the principle of "innovation above all else" in the large - model industry? Why do the traditional Internet business models completely fail in the era of large models? Why can the elimination in the large - model industry occur on a quarterly basis?

The just - concluded MiniMax Week might be the perfect starting point for us to examine these questions.

Using this as an entry point, you can see how a startup in an innovative industry breaks free from the reach of tech giants; how the bridge between technological innovation in large models and changing the world is built; and how a group of "obsessive" people ride the waves in an industry where everything is accelerating.

How a Diving Cat Made the World Go "Aha"

Initially, the attention on MiniMax Week was confined to discussions within the large - model industry: how many SOTA (State of the Art) achievements MiniMax would attain this time.

Until on global social media platforms, videos of various cats, such as orange cats, cows - patterned cats, and calico cats, diving into the water began to spread like a virus. Soon after, alpacas, pandas, and giraffes also appeared in videos, leaping from a ten - meter platform with three - and - a - half aerial spins and backflips. Even in the videos, you can see that animals of different weights correspond to different vibration amplitudes of the diving board and different sizes of water splashes when they jump.

Just like the release of DeepSeek R1 at the beginning of the year, MiniMax had its own "Aha Moment".

The "Aha Moment" originates from the fields of psychology and product design. It refers to the moment when a user suddenly realizes the value and potential of a product or tool, often accompanied by a sense of epiphany, surprise, or a cognitive leap.

Behind it, there is often a key breakthrough in technological development from quantitative to qualitative change. Subsequently, there is usually a significant breakthrough in product penetration and the arrival of a turning point for large - scale industry explosion.

The reason why the cat - diving video is regarded as the Aha Moment for video AI lies not only in the sensation it caused on social media but also in the fact that for a long time, complex actions such as diving, gymnastics, and multi - person interactions have been regarded as the "Turing Test" for video AI.

Because these actions require AI not only to precisely control each frame of the picture but also to ensure that the motion trajectories in the continuous frames, such as posture changes, jumping and rotation angles, and speed, conform to real - world physical laws such as gravity and inertia. Even the complex environmental interactions, such as the amplitude of the diving board corresponding to different animals' jumps and the angles of water splashes caused by different entry postures, must be accurately restored to reality.

All of this is due to MiniMax's newly released video model, Hailuo 02.

Vertically compared, the model parameters of Hailuo 02 have increased by three times compared to Hailuo 01, and the video resolution has been upgraded to native 1080P. It can generate 10 - second high - definition content in a single run, covering detailed limb movements, fluid dynamics simulations, mirror effects, and real - world physical interactions. It can even restore complex dynamics at the level of acrobatic performances and provide professional - level native camera movements.

The video is a demo generated by Hailuo AI Super Creator: Hu Sheng's AIGC.

Horizontally compared, Hailuo 02 ranks second globally in the Image - to - Video list of the Artificial Analysis Video Arena. Meanwhile, while Hailuo 02 outperforms Google Veo3 in terms of performance, its API cost is only one - ninth of that of Google Veo3.

The ranking data starts from the listing date and is up - to - date as of now.

So, why can Hailuo 02 maintain high fidelity while keeping costs low?

On the one hand, it is boosted by the scaling law: The total number of parameters in the Hailuo 02 model has tripled compared to Hailuo 01, and the amount of data has quadrupled, enabling the model to understand more complex instructions and physical scenarios.

Meanwhile, Hailuo 02 also adopts an innovative NCR (Noise - aware Compute Redistribution) architecture. This architecture uses a noise - aware mechanism to allocate computing resources to different regions according to their needs. Information in high - noise regions, which has a lower density, is compressed, while more computing resources are allocated to low - noise regions to capture key details. As a result, the HBM memory read - write volume is effectively reduced by more than 70%, and the training and inference efficiency is increased by 2.5 times.

Of course, this logic of focusing on the right things and constantly innovating is not only the underlying technical concept of NCR but also the best summary of MiniMax's corporate spirit and how it has achieved its current status.

Behind the Innovation of M1: How Large - Models Break Free from the Capital Gravitational Pull of Tech Giants

At the same time last year, one of the most headache - inducing problems for countless large - model entrepreneurs must have been:

Every move of the tech giants is a life - or - death test for small and medium - sized enterprises. So, how can you escape the capital gravitational pull of the giants?

Not only do partners care about this, but in any public occasion, the media and investors will definitely ask this question repeatedly.

The situation did seem grim: Almost all domestic and international Internet and technology giants have entered the large - model field. The competition among hundreds of models was so fierce that it once seemed like a repeat of the battles in the bike - sharing and food - delivery industries.

The reasons for the doubts seem reasonable: Large - model parameters have reached the trillion - level mark. Sufficient capital barriers are required for both training and inference. The evolution of large models depends on a vast amount of data, and the giants happen to have enough data resources. The research and development of large models require a high - density of talent, and the generous resources of large companies seem to be sufficient to attract any talent they want.

However, the reality is that just one year later, the battle among hundreds of models has subsided. Most of the SOTA rankings on various lists are occupied by startups such as OpenAI, Anthropic, MiniMax, and DeepSeek.

The logic is simple. A large amount of capital investment is just one of the conditions for model training. But developing large models is like making an investment. The higher the consensus on a technology path, the more it indicates that it is a lagging variable. Enterprises must continuously explore new and effective Alpha factors to achieve excess returns. In this regard, more flexible startups often have a more sensitive sense of smell and a more efficient decision - making chain than traditional giants.

Specifically for MiniMax, in the market, in just the first eight months of last year, the global downloads of its overseas product, Talkie, quickly exceeded ten million, surpassing Character AI to become the fourth most - downloaded artificial - intelligence application in the US market. According to a report by the Financial Times, MiniMax's revenue in 2024 was around $70 million.

Technologically, the MiniMax M1 model, which just achieved the second - best global result in the professional large - model benchmark test, the Artificial Analysis list, is also a good example. This is a large model with 456 billion parameters. In addition to ranking among the top in 17 mainstream industry evaluation sets, M1 is also the inference model with the longest context globally, natively supporting an input length of one million tokens, which is eight times that of DeepSeek R1. It also supports 80,000 output tokens, breaking the limit of 64,000 tokens of Gemini 2.5 Pro and becoming the model with the longest output in the world.

For large models, a longer context often means a stronger intelligent experience. Especially in high - complexity scenarios such as in - depth search and scientific research, a long context is the core source of in - depth reasoning (in math problems and code scenarios) and in - depth content synthesis (in paper writing and industry research). Especially in the agent scenario, as the mixing of multiple agents becomes a new industry trend, the output results of each sub - agent will be used as input for the main agent. If the context length is insufficient, the entire system will become meaningless.

Meanwhile, in the tool - usage scenario (TAU - bench), MiniMax - M1 - 40k leads all open - source weight models and even outperforms the closed - source model Gemini - 2.5 Pro. Data shows that even in tasks involving more than 30 rounds of long - chain thinking and tool calls, MiniMax - M1 - 40k still maintains a high level of stability.

So, the question is, since innovation is the path to success in the large - model era, what is the core innovation that supports M1's excellent performance?

The first answer is the innovation in M1's architecture.

Like the common practice in the industry, M1 is also built by performing reinforcement learning on a pre - trained base model (MiniMax - Text - 01) and also uses a Mixture of Experts (MoE) structure. However, few people know that around 2023, when MoE had not yet become an industry consensus, MiniMax had already launched the first MoE large model in China.

Also during this period, while most of its peers were still using the self - attention calculation mechanism of the traditional Transformer, MiniMax began to explore the mixed - attention mechanism and later applied this technology to the M1 model. The so - called mixed - attention mechanism uses self - attention for one - eighth of the calculation and the self - created Lightning Attention (linear attention) for the remaining seven - eighths. By first performing "tiling" (block - based calculation), using traditional attention for calculation within each block and linear attention for information transfer between blocks, it finally captures the global semantics, avoiding the slowdown caused by the cumulative summation operation (cumsum). This is also the underlying technical support for a longer context window.

In addition to architectural innovation, in the training method, MiniMax M1 also uses CISPO (Clipped IS - weight Policy Optimization) to replace the traditional PPO (Proximal Policy Optimization)/GRPO (Proximal Policy Optimization developed by DeepSeek), greatly reducing costs and improving training efficiency.

The traditional PPO/GRPO algorithms ignore tokens such as "However", "Wait", and "Aha", which are very important but have a low frequency, or assign them very low weights when dealing with mixed architectures. This leads to problems such as logical confusion in the model's complex reasoning. MiniMax's CISPO algorithm samples, clips, and assigns weights to tokens according to their importance, making long responses not only longer but also of higher quality.

According to the technical report, based on CISPO, the MiniMax team completed the reinforcement - learning training phase in just three weeks using 512 H800 GPUs, with a computing - power rental cost of only $530,000. Even compared with the latest DAPO, it can achieve the same performance with only half of the training steps.

On the inference side, when generating 100,000 tokens, M1 only requires 25% of the inference computing power of DeepSeek R1, and M1 is more efficient than the DeepSeek - R1 model in tasks such as mathematics and programming.

That

This article is originally produced by「晓曦」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

MiniMax Evolution: A Group of "Paranoids" Riding the Waves Forward

How a Diving Cat Made the World Go "Aha"

Behind the Innovation of M1: How Large - Models Break Free from the Capital Gravitational Pull of Tech Giants