Actual Test: Qwen's Next-Generation Infrastructure Surprises with Instant AIME Math Competition Problem Solving, 10x+ Speed Increase, 10x Cost-Effectiveness Boost

Surpassed the closed-source model Gemini-2.5-Flash-Thinking

The next-generation model architecture of Qwen is coming ahead of schedule!

Qwen3-Next is released. Lin Junyang, the person in charge of the Qwen team, said that this is the early preview version of Qwen3.5.

Based on Qwen3-Next, the team first open-sourced Qwen3-Next-80B-A3B-Base.

The model has 80B parameters, but its training cost is less than one-tenth of that of Qwen3-32B. Moreover, in the context inference throughput of over 32k, it can reach more than ten times that of the latter.

Based on this model, the team took continuous actions and simultaneously developed and released two new models:

Qwen3-Next-80B-A3B-Instruct: It shows significant advantages in the 256K ultra-long context processing task.

Qwen3-Next-80B-A3B-Thinking: It surpasses the closed-source model Gemini-2.5-Flash-Thinking in multiple benchmark tests.

Netizens said that the update frequency is amazing.

Without further ado, let's quickly take a look at the improvements of the new model.

Four Important Improvements

There are four core improvements in Qwen3-Next:

Hybrid Attention Mechanism
High-Sparsity MoE Structure
Stability Optimization
Multi-Token Prediction Mechanism

Hybrid Attention Mechanism

Linear attention is highly efficient in long context processing, but its recall ability is limited. Standard attention has high computational overhead and low inference efficiency. Using either of them alone has limitations.

To address this, the Qwen team introduced Gated DeltaNet, which outperforms the commonly used sliding window attention and Mamba2 in context learning ability. When adopting a 3:1 hybrid strategy (75% of the layers use Gated DeltaNet and 25% of the layers retain standard attention), it balances performance and efficiency.

Meanwhile, in the retained standard attention layers, they further introduced multiple optimization designs:

1. Continue the output gating mechanism of the previous work to alleviate the low-rank problem in attention;

2. Expand the dimension of a single attention head from 128 to 256;

3. Only add rotational position encoding to the first 25% of the dimensions of the attention head to enhance the long-sequence extrapolation ability.

High-Sparsity MoE Structure

Qwen3-Next adopts a high-sparsity MoE architecture with a total of 80 billion parameters, but only about 3 billion parameters are activated during each inference.

Compared with the 128 total experts and 8 routing experts of Qwen3-MoE, Qwen3-Next expands to 512 total experts and adopts a combined design of 10 routing experts plus 1 shared expert to maximize resource utilization while ensuring performance.

Training Stability Optimization

In Qwen3-Next, to further improve the model stability, the team adopted Zero-Centered RMSNorm. On this basis, they applied weight decay to the norm weight to avoid unbounded growth of the weights.

Moreover, they normalized the parameters of the MoE router during initialization to ensure that each expert can be selected unbiasedly in the early stage of training and reduce the perturbation of the initialization on the experimental results.

Multi-Token Prediction Mechanism

Qwen3-Next introduced the native Multi-Token Prediction (MTP) mechanism. It not only obtained the MTP module with a high acceptance rate of Speculative Decoding but also improved the overall performance of the model backbone.

In addition, it also conducted special optimization on the multi-step inference of MTP. That is, by training a consistent multi-step strategy for inference, it further improved the acceptance rate of Speculative Decoding in practical application scenarios.

Ten Times Faster but Ten Times Cheaper

Next, let's take a look at the performance of the new model.

First, Qwen3-Next uses a uniformly sampled subset of the Qwen3 36T pre-training corpus, containing only 15T tokens.

The GPU hours required for its training are less than 80% of those of Qwen3-30A-3B. Compared with Qwen3-32B, it only needs 9.3% of the GPU computing resources to achieve better performance.

Moreover, thanks to the innovative hybrid model architecture, Qwen3-Next also performs outstandingly in inference efficiency.

Compared with Qwen3-32B, Qwen3-Next-80B-A3B shows excellent throughput capacity in the prefill stage:

Under a context length of 4k tokens, the throughput is nearly 7 times that of the former. When the context length exceeds 32k, the throughput improvement is more than ten times.

In the decode stage, the model is also highly efficient. The throughput under a 4k context is improved by about 4 times, and it can still maintain a throughput advantage of more than ten times in the long context (32k+) scenario.

Based on Qwen3-Next, the Qwen team first trained the Qwen3-Next-80B-A3B-Base model.

This model only uses one-tenth of the Non-Embedding activation parameters, but it has already surpassed Qwen3-32B-Base in most benchmark tests and is significantly better than Qwen3-30B-A3B, showing excellent efficiency and performance advantages.

Based on the excellent performance of Qwen3-Next-80B-A3B-Base, the team further developed and released Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking.

Qwen3-Next-80B-A3B-Instruct

First, the performance of Qwen3-Next-80B-A3B-Instruct is significantly better than that of Qwen3-30B-A3B-Instruct-2507 and Qwen3-32B-Non-thinking, and it is close to Qwen3-235B-A22B-Instruct-2507 in most indicators.

In addition, in the RULER test, regardless of the context length, the performance of Qwen3-Next-80B-A3B-Instruct exceeds that of Qwen3-30B-A3B-Instruct-2507, which has the same number of layers but more attention layers.

It is even better than Qwen3-235B-A22B-Instruct-2507, which has more layers, within 256k, fully demonstrating the advantages of the Gated DeltaNet and Gated Attention hybrid model in long text processing scenarios.

Qwen3-Next-80B-A3B-Thinking

Let's look at Qwen3-Next-80B-A3B-Thinking. Its performance is also quite good.

It surpasses the closed-source model Gemini-2.5-Flash-Thinking in multiple benchmark tests and is close to Qwen's latest flagship model Qwen3-235B-A22B-Thinking-2507 in some indicators.

Quite Good Inference Ability

Next, let's test the inference ability of Qwen3-Next-80B-A3B in practice.

Use the Qwen Chat webpage and give it an AIME math competition question right at the beginning:

Since Qwen3-Next-80B-A3B supports multi-modal, we can directly upload pictures here.

Almost instantly, the model began to quickly list the detailed problem-solving ideas and calculation process. The final answer "588" is completely consistent with the AIME standard answer.

After a small test, let's move on to the programming section.

Create a playable Minesweeper game using p5js.

After the code ran successfully, we also had a simple try. The fluency is okay (doge).

But who can explain why the background of this game is bright red and there are no grid lines?

Some netizens also had creative ideas and used it to generate weather cards.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Actual test: Qwen's next-generation infrastructure makes a surprise attack, instantly solves AIME math competition problems, speeds up by over 10 times, and improves cost-effectiveness by 10 times.