Alibaba's Most Powerful Programming Model with 480 Billion Parameters Open-Sourced, Agent Score Crushes Kimi K2, Training Details Unveiled

The title of the most powerful open - source programming model has changed hands.

According to a report by Zhidx on July 23, just now, Alibaba's Qwen team open-sourced its latest flagship programming model, Qwen3-Coder-480B-A35B-Instruct. The Qwen team said that this is the most powerful open-source intelligent agent programming model developed by the team so far. It has 480 billion parameters, with 35 billion active parameters. It natively supports a 256K context and can be extended to 1 million context (input) through extrapolation. Its maximum output is 65,000 tokens.

In benchmark tests, Qwen3-Coder demonstrated good performance in programming and intelligent agent tasks. It achieved the state-of-the-art (SOTA) in the open-source field in three types of tasks: Agentic Coding, Agentic Browser-Use, and Agentic Tool-Use. It outperformed open-source models such as Kimi K2 and DeepSeek V3 and closed-source models such as GPT-4.1. It can also be compared with Claude Sonnet 4, a model known for its programming capabilities.

Qwen3-Coder will be available in multiple sizes. The version open-sourced this time is its most powerful variant. Its number of parameters exceeds the 235 billion of Alibaba's flagship model Qwen3 and is less than the 1 trillion of Kimi K2. According to Alibaba's official introduction, with Qwen3-Coder, novice programmers can complete in one day what senior programmers take a week to do. It can generate a brand official website in as fast as 5 minutes.

In addition to the model, Qwen also open-sourced an intelligent agent programming command-line tool, Qwen Code, which is forked from Gemini Code. This tool is adapted to custom prompts and function call protocols, enabling it to fully unleash the capabilities of Qwen3-Coder in intelligent agent programming tasks.

This model has been launched on Alibaba Cloud's large model service platform, Bailian. Its API uses a tiered pricing system, adjusting the price according to the number of input tokens. In the 256K - 1M tier, the input price is $6 per million tokens, and the output price is $60 per million tokens. In comparison, the input and output prices of Claude Sonnet 4 are $3 and $15 per million tokens respectively, which are the same as those of Qwen3-Coder in the 128K - 256K tier.

Qwen3-Coder has also been launched on the web version of Qwen Chat, allowing users to experience it for free. Additionally, its 480B version has been released on open-source communities such as Hugging Face and ModelScope, and can be downloaded and deployed locally. Qwen also shared detailed technical details of the model in a blog post.

Qwen Chat Launched Late at Night, Overseas Users Are Going Crazy

Before the Qwen team officially announced the release of Qwen3-Coder, the model had quietly gone live on the official website of Qwen Chat. Quick-fingered overseas users contributed a number of real-world test cases.

In one case, Qwen3-Coder was asked to create a Wordle word game, where the rule is to guess a 5-letter word within six attempts. The game page and source code delivered by Qwen3-Coder are as follows.

The user who provided the case said that Qwen3-Coder showed amazing capabilities in instruction following, UI design, and animation. Most test results ran successfully on the first try, without the need for any inference. However, in the task of designing the Wordle game, Qwen did not use a word parser or cite sources. Instead, it decided to enumerate all 5-letter words on its own.

In a development case of a "spot the difference" game, it can be seen that compared with Qwen3-235B-A22B-2507, which was released yesterday, Qwen3-Coder is significantly better in terms of aesthetics and completion.

Zhidx tried to ask Qwen3-Coder to develop a Chinese-English terminology library with basic functions such as addition, deletion, modification, and query. It was intuitively felt that since inference was not enabled, Qwen3-Coder developed at an extremely fast speed. It completed the initial result in just over 20 seconds. When further modifying the generated result, the speed was also quite fast.

The final result generated by Qwen3-Coder is indeed aesthetically pleasing and clear from a UI perspective, and the functions work properly. However, it did not follow the instruction in the prompt to use PHP + MySQL for development. The final delivered result is more than sufficient for function demonstration and prototype display, but its scalability in real deployment scenarios needs further optimization.

Zhidx also asked Qwen3-Coder to create a 3D HTML development problem, which was to create a 3D rotating cube display stand with different colors on each of the six faces, automatic rotation, and lighting and shadow effects. The result delivered by Qwen3-Coder had a good level of completion, basically achieving the main functions, with proper handling of rotation effects and shadows.

In addition to programming capabilities, Qwen3-Coder also offers many other features, including image and video generation, and supports the upload of documents, pictures, videos, and audio, which may be achieved through tool calls.

After the official release, the Qwen team also provided some use cases for Qwen3-Coder.

For example, it can be used to create a physics-based chimney demolition simulation with controlled explosions.

It can create an interactive solar system simulation with relatively accurate planetary relationships.

The web games developed by it have a good level of completion.

02 There Is Still Room for Expansion in Pre-training, Reinforcement Learning Conducted in 20,000 Independent Environments

The Qwen team shared some training details of Qwen3-Coder in a technical blog. The team believes that there is still room for further expansion in pre-training.

During the pre-training phase, Qwen3-Coder used 7.5 trillion token data, of which 70% is code. Therefore, the model performs excellently in programming while retaining general and mathematical capabilities.

In terms of context, Qwen3-Coder natively supports a 256K context and can be extended to 1M through YaRN. It is optimized for repository scale and dynamic data (such as pull requests) to adapt to intelligent agent programming scenarios.

The previous model of Qwen3-Coder, Qwen2.5-Coder, was used to expand synthetic data. Specifically, Qwen2.5 cleaned and rewrote noisy data, improving the overall data quality.

During the post-training phase, the Qwen team believes that, different from the common focus on competition-level code generation, all code tasks are naturally suitable for execution-driven large-scale reinforcement learning. The team expanded the scale of code reinforcement learning training on a wider range of real-world programming tasks.

By automatically expanding test cases for diverse programming tasks, the Qwen team created high-quality training instances, further unleashing the potential of reinforcement learning. This not only increased the code execution success rate but also brought benefits to other tasks.

This also inspired the team to further explore task types that are difficult to solve but easy to verify, which are expected to become fertile ground for reinforcement learning.

In real-world software engineering tasks (such as SWE-Bench), Qwen3-Coder must interact with the environment in multiple rounds, involving planning, using tools, receiving feedback, and making decisions. During the post-training phase of Qwen3-Coder, the Qwen team introduced long-horizon reinforcement learning (intelligent agent reinforcement learning), encouraging the model to solve real-world tasks through multi-round interactions using tools.

The key challenge in intelligent agent reinforcement learning lies in environment expansion. To address this issue, the team built a scalable system capable of running 20,000 independent environments in parallel. This infrastructure provides the necessary feedback for large-scale reinforcement learning and supports large-scale evaluation.

Therefore, Qwen3-Coder achieved the best performance among open-source models in SWE-Bench Verified without using inference (expansion during testing).

The simultaneously open-sourced Qwen Code is a command-line interface (CLI) tool for research purposes. It is developed based on the Gemini CLI and enhanced with a parser and tool support for the Qwen-Coder model.

In addition to Qwen Code, Claude Code can also be used for programming with Qwen3-Coder. Simply apply for an API key on the Dashscope platform and install Claude Code to start programming.

03 Conclusion: More Sizes to Be Released, Exploring Self-improvement of Programming Agents

When Cursor stopped providing models such as Claude for the programming field, the open-source release of Qwen3-Coder provides the latest alternative for domestic developers.

The Qwen team revealed that they are still working hard to improve the performance of the Coding Agent, aiming to enable it to handle complex and tedious tasks in software engineering, thereby freeing up human productivity.

More model sizes of Qwen3-Coder will be released soon to balance deployment costs and performance. Additionally, the team is exploring whether the Coding Agent can achieve self-improvement.

This article is from the WeChat official account “Zhidx” (ID: zhidxcom), author: Chen Junda. It is published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Just now, Alibaba's most powerful programming model was open-sourced. It has 480 billion parameters, and its Agent score crushes that of Kimi K2. The training details have been made public.

Qwen Chat Launched Late at Night, Overseas Users Are Going Crazy

02 There Is Still Room for Expansion in Pre-training, Reinforcement Learning Conducted in 20,000 Independent Environments

03 Conclusion: More Sizes to Be Released, Exploring Self-improvement of Programming Agents