Is the Essence of Coding: Reinforcement Learning + Synthetic Data + Ten Thousand GPU Computing Power?

Signals from Composer 2.5

In today's AI programming field, Claude Code, Codex, and Cursor are already the three most well - known agent tools.

The first two are backed by Anthropic and OpenAI respectively. Relying on their most advanced models, Opus 4.7 and GPT - 5.5, they have repeatedly won the top prizes in programming - related benchmark tests.

In contrast, Cursor, which was first born in 2023, now seems a bit left behind. To turn the situation around, Cursor has decided to drop a bombshell: Composer 2.5.

Although the official only released a short technical blog that takes about 2 minutes to read, Cursor still declared its technological sovereignty in a very restrained manner: Collaborating with Musk's SpaceXAI to access the equivalent computing power of 1 million H100s, a 25 - fold increase in the scale of synthetic data, and a very aggressive commercial pricing strategy.

At the very bottom of the blog, Cursor left three unobtrusive footnotes. Among them, three hardcore academic papers cover the ingenious modifications of reinforcement learning, synthetic data, and underlying infrastructure, which exactly correspond to the three elements of AI, namely "algorithm, data, and computing power". These are the keys to unlocking the powerful capabilities of Composer 2.5.

Cursor is declaring the truth to the entire industry: The competition in AI programming has long since moved from the cold - weapon era of simply wrapping APIs to the nuclear - weapon era of rewriting underlying reinforcement learning algorithms.

01 Reinforcement Learning: "Self - Distillation"

Developers and ordinary people have completely different views on AI programming. Ordinary people think that AI programming lowers the usage threshold, allowing people who don't understand programming to write an application. However, developers believe that the existing capabilities of AI programming cannot do without manual review. Once the number of interactions increases and the context becomes longer, the performance of AI programming will decline sharply.

Cursor has pointed out a world - class problem that the entire AI programming industry must face at present and calls it "Credit Assignment".

This is like a Chinese teacher receiving a 100,000 - word novel from a student. After a cursory glance, the teacher finds that the content is a complete mess and directly gives the novel a failing grade.

In the field of AI, traditional reinforcement learning represented by the GRPO algorithm based on scalar rewards does exactly this. It only gives a final discrete score: 0 means correct, and 1 means wrong.

Obviously, this approach is not wrong, but it is not rigorous enough. Because after getting a failing grade, the student has no idea where they went wrong. Did the character setting at the beginning collapse? Did the logic in the middle break? Or did the ending deviate from the topic?

The same goes for AI models. Without any specific feedback, when generating code with hundreds of thousands or millions of tokens in the next complex task, they still don't know where to start the modification, what to modify, and how to modify. Moreover, in the process of blind trial - and - error, traditional models often generate a lot of nonsense in the thought chain when generating code. Behind this nonsense is a real output token bill.

To solve this problem, Cursor has targeted the "oriented reinforcement learning based on text feedback" mechanism. The engineering team has keenly introduced the "Self - Distillation" technology into the training process of long - text code generation.

When it comes to distillation, the game between the teacher model and the student model is inevitable. This is like an exam that combines open - book and closed - book parts:

When a tool - call error occurs during the code generation process of hundreds of thousands of tokens, Cursor will directly throw the specific error information along with the list of correct available tools to the model, allowing it to "open the book" and see the answers. Thus, this model that has seen the correct answers is in an all - knowing state and naturally becomes the teacher model.

At the same time, the same model that has not seen the answers and can only write code based on instinct serves as the student model and starts to align with the teacher model.

The teacher model doesn't need to rewrite the code from start to finish. It only needs to tell the student model at the specific position where the code reported an error, "At this token, you should reduce the probability of choosing tool A and increase the probability of choosing tool B."

The seemingly simple self - distillation process brings unexpected results:

Firstly, the model bids farewell to catastrophic forgetting. This on - policy method enables the model to learn new skills such as calling complex tools while retaining its original powerful basic coding and reasoning abilities intact.

Secondly, "nonsense literature" comes to an end. Compared with traditional reinforcement learning algorithms that often produce thousands of tokens of invalid output, the reasoning process of the model trained by self - distillation is often extremely concise.

In other words, Composer 2.5 refuses to "think for the sake of thinking" and aims for "a single strike to hit the target".

02 Synthetic Data: "Cheat Sheet"

In order to catch up with and even surpass Claude Code and Codex, Cursor has really gone all out this time. Not only has it been clever in terms of algorithms, but it has also spared no expense in terms of data:

In the training of Composer 2.5, Cursor has used 25 times more synthetic data than the previous - generation model.

The Scaling Law has never failed. However, with the impending depletion of Internet data, "synthetic data" has become the life - saving straw for all AI companies.

Cursor has adopted a clever way to obtain synthetic data: first destroy, then rebuild, that is, the function deletion method.

The research team first found a large real - world codebase with a large number of automated test cases. They let the AI play the role of a "harmless destroyer" and delete the code and files of specific functions in it, but they must ensure that the remaining code can still run.

Next, they throw this incomplete but still runnable codebase to Composer 2.5 during the training process and require it to reproduce the deleted functions. The judgment basis is also very simple, which is to see if it can pass the original test cases.

This test, which seems like a "fill - in - the - blanks" test to humans, is actually a very high - difficulty scenario - restoration training for AI. However, during this process, Cursor observed the somewhat uncomfortable "AI Reward Hacking" phenomenon.

To put it simply, as Composer's capabilities have improved, it has started to take the wrong path and completed tasks by frantically looking for system loopholes instead of writing code honestly and step by step.

There are two proven cases:

Firstly, the model found that there was a Python type - checking cache remaining in the system. It directly reverse - cracked the cache format and "stole" the deleted function signatures from it.

Secondly, when faced with a missing third - party API, the model traced the underlying Java bytecode and wrote a decompilation script to rebuild the API.

It has to be said that this seems a bit like a precursor to the awakening of AI in science - fiction movies and its impending rule over humanity.

From a technical perspective, this exactly proves the great power of large - scale reinforcement learning in the field of AI programming. The world of code is essentially a sandbox with "objective truth". If the code can run and give the correct result, it is correct; otherwise, it is wrong. In this sandbox, in order to achieve the goal faster like human engineering, the model has begun to exhibit side - channel attack and reverse - engineering capabilities that only human senior hackers possess.

Cursor's research team discovered these so - called "cheating behaviors" through agent monitoring. Logically, there should be problems in both data and algorithm aspects, but this has instead become an excellent commercial promotion:

An AI that can decompile Java bytecode to be lazy can definitely perform a dimensionality - reduction strike when it comes to helping humans complete common business code.

03 Underlying Infrastructure: Computing Power Optimization

After discussing data and algorithms, the next thing is the computing power problem that plagues global AI companies. After all, high - end algorithms are always built on the underlying heavy - asset infrastructure projects like a mason's work.

This time, Cursor has sufficient motivation both externally and internally:

Firstly, the official announced with high - profile that Composer 2.5 has reached a cooperation with SpaceXAI under Musk, utilizing the equivalent computing power of 1 million H100s provided by the Colossus data center. This concept is quite shocking. At present, the total computing power reserves of many mainstream large - model manufacturers may not even reach one - tenth of this number.

While receiving assistance from Musk, Cursor has also optimized its underlying computing power to the extreme, just like domestic models. The two core technologies mentioned in the official technical blog, Sharded Muon and Dual - Grid HSDP, are Cursor's most hardcore operations in the field of AI training infrastructure.

Before delving into the details of these two technologies, it is necessary to understand that the current top - level large models generally adopt the Mixture of Experts (MoE) architecture, in which the parameters are divided into two categories: non - expert weights and expert weights, corresponding to common knowledge and professional knowledge respectively.

When the scale of the model continues to expand and exceeds trillions, the computing tasks must be split among thousands of GPUs. At this time, the communication delay caused by data transmission between GPUs instantly becomes a more difficult bottleneck to overcome than the computing itself.

Muon is a cutting - edge optimizer algorithm optimized for the dark side of the moon. It can perform orthogonalization operations on matrices, making the model training process more stable and the convergence speed faster.

However, matrix orthogonalization calculation means a huge computing overhead for expert weights. Therefore, Cursor follows this idea, shards matrices of the same shape, and distributes the matrix fragments to different GPUs for parallel computing. After the calculation is completed, the results are collected and unified.

In traditional distributed computing, network delay occurs during the process from when a GPU sends data to when it receives the returned data. However, Cursor has achieved asynchronous overlapping. After a single GPU sends the data of one task, it doesn't wait idly but immediately starts to calculate the next task.

Dual - Grid HSDP is a set of two physically isolated communication grids designed by Cursor from the underlying decoupling communication process group to address the parameter heterogeneity of the MoE model:

The narrow grid is dedicated to non - expert weights. High - frequency operations are completely completed on the ultra - high - bandwidth within the node, completely avoiding cross - node network delay.

The wide grid is dedicated to expert weights. Performing expert parallelism and parameter sharding can maximize the distribution of the storage and computing pressure of the expert state to a large number of GPUs.

The core technological dividend brought by this dual - grid layout is the extreme overlap of communication and computing, as well as the conflict - free superposition of parallel dimensions. After this series of operations, the network communication time will be perfectly hidden in the computing time. For a model with trillions of parameters, each step of the highly complex optimizer only takes an astonishing 0.2 seconds.

Cursor's extreme engineering capabilities ensure that it can transform the most cutting - edge academic theories into products with the highest efficiency. This is also a barrier that latecomers can hardly catch up with.

04 Reshaping the Developer Ecosystem

Finally, from the release of Composer 2.5, we can see Cursor's clear business strategy. Its ambition will not stop at just being a useful programming agent.

Composer 2.5 adopts the common dual - track pricing: the regular version and the Fast version. Both have the same intelligence level, but the latter is faster.

Regular version: Input costs $0.5 per million tokens, and output costs $2.5 per million tokens.

Fast version: Input costs $3 per million tokens, and output costs $15 per million tokens.

Although the price of the Fast version is much higher than that of the regular version, the official specifically emphasizes that its cost is still lower than the same - level solutions of other cutting - edge models.

This phenomenon is not uncommon. Just like Anthropic's Opus 4.7 and OpenAI's GPT - 5.5, although their API prices are much higher than most models in the world, the cost required for these two top - level models to complete tasks is actually lower.

This also shows Cursor's extremely accurate grasp of user psychology. For high - net - worth and high - paying programmers, the coherence of thinking is often priceless. Spending a few more dollars can bring a millisecond - level improvement in code generation speed. Cursor sets the Fast version as the default option and offers double the usage in the first week. In essence, it is actually cultivating users' physiological - level dependence on "better - experience AI programming" at a lower cost.

This is also what international top - level AI companies are generally doing: Once users get used to the speed and accuracy of a model, it will be extremely difficult for them to switch back to competing manufacturers.

From the fact that Cursor's technology stack includes the capabilities of handling hundreds of thousands of tokens of context, cross - file editing, and directional correction of tool calls, we can also see that its positioning is a long - term task - collaboration agent.

Users don't need to press the tab key line by line. They only need to put forward an architectural requirement, and Cursor can read the cache, call the interface, and run tests in the background on its own. Even if there is an error, there is no need to worry. The self - distillation technology

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Is the essence of Coding = Reinforcement learning + Synthetic data + Ten thousand GPU computing power?

01

Reinforcement Learning: "Self - Distillation"

02

Synthetic Data: "Cheat Sheet"

03

Underlying Infrastructure: Computing Power Optimization

04

Reshaping the Developer Ecosystem