HomeArticle

Just now, China's AI has broken into the global top two in programming, with only Claude ahead.

新智元2026-05-27 08:09
Code Arena has recently released its latest rankings. Qwen3.7-Max scored 1541 points and rushed into the fourth place globally, becoming the only non-Claude model among the top five. In programming, it's the first time a Chinese model has reached this position.

Just today, the latest leaderboard of Code Arena is out!

Qwen3.7-Max has entered the global top four with 1541 points, surpassing a host of top models such as GPT-5.5 and Gemini 3.5 Flash at one go.

Only Claude Opus 4.7 and Opus 4.6 are ahead of it.

In other words, in the arena of global programming models, Alibaba is the only Chinese manufacturer to make it to this table, ranking second only to Anthropic.

Qwen3.7-Max Enters the Global Top Five

The Only Non-Claude Model

Actually, before the release of the Code Arena leaderboard, Qwen3.7-Max had already made a name for itself in the overseas developer community.

Atomic Chat conducted a head-to-head comparison, pitting Opus 4.7, GPT-5.5, and Qwen3.7-Max against each other in a task of writing a self-training Tetris AI.

As a result, Qwen3.7-Max not only surpassed Opus 4.7 and GPT-5.5 with a token cost of only $1.32, but also improved its performance by 56%.

Another overseas developer chose to let Qwen3.7-Max build a 3D model of the universe, and the effect was nothing short of amazing.

In the task of generating a "3D pixel-style miniature pagoda model", Qwen3.7-Max also comprehensively outperformed in terms of output speed and quality.

Developer Paul Couvert even praised that after Qwen3.7-Max is connected to Hermes Agent and OpenCode, it can basically replace GPT-5.5 and Opus 4.7.

Exceptional in Programming

However, no matter how high the scores are, it's better to have a real test.

We arranged a hardcore "racing game" challenge for Qwen3.7-Max.

After throwing in a detailed prompt, in a short time, Qwen3.7-Max directly generated a playable HTML file.

There was a small bug in the first version. The left and right steering keys A/D were reversed.

But after a simple fine-tuning through a second-round conversation, a complete 3D racing game was up and running.

To be honest, I was a bit shocked when I opened it.

Four cars were on the track, racing on a three-lap circular track. There were more than 100 gold coins scattered on the track. Hitting an obstacle would cause the car to slow down and lose control.

The post-race result panel included rankings, lap times, the number of gold coins, and the fastest single lap, leaving nothing out.

But what really surprised people were two details that only Qwen3.7-Max achieved.

One was the start interface. After testing four models horizontally, only Qwen3.7-Max created a proper start page for the game. You had to click "Start" to enter the race. The other three models just started running as soon as you opened them, without even a title screen.

The other was the sound effects. There was a requirement attached at the end of the prompt to add the sound of the engine roaring and the sound of collecting gold coins. Among the four models, only Qwen3.7-Max fulfilled this bonus requirement, adding the engine sound and the sound of gold coins.

Let's take a look at the performance of other contestants.

The picture of Gemini 3.5 Flash was obviously one level thinner, lacking the sense of three-dimensionality.

There was also a problem with the UI layout. The dashboard information was scattered in the four corners of the screen, and the visual focus was in a mess.

In contrast, Qwen3.7-Max concentrated the key indicators in the center of the screen, which was more in line with the natural focus of the player's line of sight.

The effect of Claude Opus 4.6 was a bit hard to describe.

Not only were there very few gold coins on the track, but the three AI cars almost drove in sync, lacking any randomness, as if they were copy-pasted.

Finally, there was GPT-5.5.

It can be seen that the picture quality was indeed much better than the previous two, and the operation was also smoother.

But for some reason, the gold coins were made into yellow "doughnuts"...

The shape was a minor issue. The key thing was that Gemini, Claude, and ChatGPT all had to fix bugs several times before all the functions could run properly.

Only Qwen3.7-Max was basically playable in the first round of generation.

The scores are similar, the actual test results are reliable, and the price is only a fraction. The rest can be left to developers to vote with their feet.

The "Foundation" Model in the Agent Era

The reason why Qwen3.7-Max can achieve such a level on the most competitive programming stage lies in its product positioning.

A few days ago, when Alibaba released Qwen3.7-Max, it gave it a very special label: Agent Foundation Model.

It is a model designed for long-term autonomous task execution.

Internal test data shows that in an autonomous programming task, Qwen3.7-Max ran continuously for 35 hours and executed 1158 tool calls.

The final generated code achieved an amazing 10-fold geometric mean acceleration compared to the Triton reference implementation.

What's even more amazing is its "long-term battle" ability -

After 30 hours of deduction, the model still remained sharp and continuously discovered new optimization opportunities.

There was zero context degradation, zero instruction drift, and zero infinite loops throughout the process!

It has to be said that the difficulty of this task doesn't lie in the 1000 tool calls themselves. After the MCP protocol was implemented, calling tools 1000 times is not uncommon.

The difficulty lies in 35 hours of continuous reasoning.

Most models will collapse when running long tasks: either the context becomes more and more chaotic, and the goals set in the first half are completely forgotten later; or they enter an infinite loop and keep trying the same failed solution.

Qwen3.7-Max has achieved the goal of "continuously doing the right thing".

Revealing the Core Technology

We believe that Qwen3.7-Max's leap in programming may be related to the upgrades of two training methods.

The first is environment expansion.

When Qwen3.7-Max conducts programming training, each task is split into three independent dimensions: the task itself, the execution framework, and the verification method, and the three can be freely combined.

For the same problem, sometimes it is solved within the Claude Code framework, sometimes within the OpenClaw framework, and sometimes with a different verification method.

The effect is like an intern being rotated through all project teams. It is forced to learn general strategies for solving problems, rather than "how to take shortcuts in a specific framework".

This explains a counterintuitive phenomenon: Qwen3.7-Max performs stably in frameworks such as Claude Code, OpenClaw, and Qwen Code, without the situation of "being strong in its own framework but performing poorly in others".