LeCun's Team Open-Sources First Code World Model: Code Generation, Self-Testing, and Self-Repair Render Traditional Programming Models Obsolete Overnight

Faster than Qwen and approaching the capabilities of GPT-4

Just now, Meta FAIR has launched the Code World Model!

CWM (Code World Model), a dense language model with 32 billion parameters and a context size of up to 131k tokens, is a research model specifically designed for code generation and reasoning.

This is the world's first language model that systematically introduces the world model into code generation.

Compared with existing large code models, the most distinctive feature of CWM is that it can not only generate code and understand semantics.

More importantly, it "understands" how code is executed. It can simulate the state changes of variables and environmental feedback during the code running process, thereby promoting the overall improvement of code understanding, debugging, and even planning capabilities.

That is to say, it has thinking abilities close to those of human programmers.

In multiple code and reasoning tasks, CWM has shown excellent performance. For example, it scored 65.8% on SWE-bench Verified, leading all open-source models of the same scale and approaching the level of GPT-4.

More importantly, Meta FAIR has open-sourced the model code, training details, and weight checkpoints at multiple stages this time, showing great sincerity.

Someone left a message for LeCun asking:

"Don't you always think that language models are just a side branch on the AI path (LLMs are an off ramp)? Why have you launched a world model based on language models?"

LeCun replied casually:

Yes, but now we're talking about programming, not ASI.

Enabling large models to "understand dynamic execution"

The birth of CWM directly addresses a major pain point in current large models for code generation:

Although existing large models already have the ability to write code, the code execution effect is not stable. The generated content is difficult to debug, non - executable, and may even have hidden logical errors.

The FAIR team believes that the root cause is that large models only predict code as text.

It doesn't understand how the code will run and has only a partial understanding (or even no understanding) of the changes in variable states and the side effects of function calls.

In the view of the FAIR team:

If you want the model to think like a programmer, you must teach it the "world state" changes during code execution.

Therefore, CWM introduces the concept of code world modeling for the first time during the training process, explicitly allowing the model to learn "how the program state evolves step by step during code execution".

This means that CWM's understanding dimension has shifted from static text to dynamic execution.

Gabriel Synnaeve, a senior research scientist at Meta FAIR specializing in AI and code generation and a senior core contributor to CWM, shared an example on 𝕏 of CWM tracking the code to calculate the number of 'r's in "strawberry":

You can think of it as a neural 'pdb' that can be set to any initial frame state, and inference can be used as a tool to query in the token space.

Compared with the static prediction of traditional large code models that generate token by token, CWM has been upgraded in three major capabilities:

First, code execution simulation.

CWM can simulate the code execution process line by line, predict how each line of code affects the variable state, and even predict potential errors during execution in advance.

This ability makes it possible to build a "neural debugger".

During the inference process of CWM, the variable state can be continuously updated as the code runs.

It can even simulate termination conditions, loop unrolling, and boundary cases to understand the program logic more accurately.

Second, self - debugging and repair.

Not only can it write code, but CWM can also self - test and fix errors.

After generating the code, it can automatically construct test cases and try to self - repair by using multiple modification paths after finding that the code fails.

The whole process simulates the common development closed - loop of human programmers: write → test → modify → retest.

Third, reasoning and planning capabilities.

When facing complex problems, CWM can also perform reasoning and planning.

For example, in programming competitions or mathematical tasks, it can analyze steps according to the problem description, plan the function structure, and then gradually generate and verify the code in combination with execution prediction, demonstrating multi - round logical reasoning capabilities.

CWM model information: parameters, architecture, and performance at a glance

The model architecture of CWM uses a 64 - layer decoder - only Transformer with a parameter scale of 32 billion.

It supports a long - context input of 131k tokens - which greatly expands the processing capabilities of complex projects, multi - file code, and document contexts.

Correspondingly, the Attention structure uses an alternating mechanism of local and global, taking into account both efficiency and context coverage.

FAIR provides the following 3 checkpoints for researchers to use:

CWM pre - trained model: for example, for new post - training methods.
CWM SFT: for example, for reinforcement learning research.
CWM: for example, for inference time extension.

In the evaluation and comparison with multiple first - tier models, CWM's results are as follows:

SWE - bench Verified

Scored 65.8%, leading all open - source models of the same scale and approaching the level of GPT - 4;

LiveCodeBench v5

Scored 68.6%, demonstrating accuracy in high - complexity programming tasks;

Math - 500

Scored 96.6%, and 76.0% on the AIME 2024 mock questions;

Terminal - Bench

Scored 26.3%, higher than Gemini 2.5 Pro;

Aider Polyglot (multi - language code generation) scored 35.1%, similar to Qwen3 - 32B.

Overall, CWM has shown good performance in multiple aspects such as understanding, generation, verification, and repair.

The FAIR team said that CWM verifies the value of "code world modeling" in improving reasoning and code generation.

Gabriel Synnaeve said:

I'm extremely proud of the work done by my CodeGen team! This team consists of doctoral students and experienced senior employees. All of us have worked together, gone all out, and never blamed others for any problems. The entire Meta AI community has worked together for this. Thank you very much for the consistent support of the entire leadership.

Three - stage training process and dataset construction

CWM is trained in three stages:

First stage, pre - training stage (Pretrain).

In this stage, CWM uses 8T tokens of data for general language and code modeling training.

Among them, code accounts for about 30%, and the context length is 8k tokens.

Second stage, mid - training stage (Mid - train), which is also the most distinctive step of CWM.

In this stage, the model introduces 5T tokens of world modeling data to train the model to recognize "how the program state changes during code execution".

The core data types of this part include:

Python execution trajectory data

From tens of millions of function calls and code submissions, it records how the values of variables change when each line of code is executed;

ForagerAgent data

The model - driven agent runs code in a real Docker environment, fixes bugs, and executes tasks, generating real interaction trajectories (a total of 3 million);

Natural language description version

Convert the execution process into natural language for easy generalization and migration.

It is also in this stage that CWM's context ability is extended to 131k tokens, supporting the complete modeling of large - scale projects and code processes.

Third stage, post - training stage (SFT + multi - task RL).

Finally, CWM undergoes 100B tokens of supervised fine - tuning training (SFT) and 172B tokens of multi - task reinforcement learning (RL) training.

The training tasks cover real software engineering tasks (such as SWE - bench), programming competition problems (CodeContests, etc.), and mathematical reasoning questions (such as AIME mock questions, MathQA).

In this stage, the FAIR team uses an asynchronous RL mechanism, a distributed environment, and a bootstrapping method to improve the model's generalization ability across multiple environments and tasks.

In terms of infrastructure, CWM training uses FlashAttention - 3, FSDP + TP parallel strategies, and fp8 low - precision acceleration.

Meta FAIR emphasizes that its training process follows the advanced AI security framework in the Frontier AI Framework.

The results show that CWM does not pose a risk of abuse in high - sensitivity fields such as network security, chemistry, and biology.

In addition, it should be noted that the current world modeling data of CWM only supports the Python language and has not covered mainstream languages such as C++ and Java or symbolic execution tasks.

However, the research team said that they will explore multi - language expansion in the future, which is expected to form a general framework for automated programming assistants.

Two More Things

BTW, if you want to use CWM, there are two points you need to pay special attention to:

First, CWM is mainly oriented to code understanding and complex reasoning research and has not undergone RLHF.

Therefore, it is not suitable for dialogue tasks or as a chatbot.

Second, CWM is clearly positioned as "for research use only", that is, it is only for non - commercial research.

Anyway, in short, the CWM team has chosen to open - source the model, make the data transparent, and fully open the training reproduction. Through this, it also poses an important question to the research community:

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

LeCun's team open-sources the first code world model: it can generate code and self-test and self-repair, making traditional programming models seem obsolete overnight.

Enabling large models to "understand dynamic execution"

CWM model information: parameters, architecture, and performance at a glance

Three - stage training process and dataset construction

Two More Things