HomeArticle

Fix real bugs in 4 steps without an Agent. Ant CGM tops the SWE-Bench open-source list.

机器之心2025-06-27 19:28
The Chinese phrase "Less is more" translates to English as "Less is more." (Note: The original phrase "Less is more" is already in English. It's a famous design principle meaning simplicity and clarity lead to good design. If you meant to request translation of other content, please provide the Chinese text.)

The agentless + open - source model can also complete repository - level code repair tasks with high quality, and its performance is comparable to the industry's state - of - the - art (SOTA).

1. Agentless, 44%, and Ranked NO.1

When it comes to the AI's ability to write code, the most concerning question for everyone is: Can it really fix bugs?

The first fully automated AI software engineer, Devin, made a splash in the tech circle as soon as it emerged. Its status in the industry was further confirmed on the authoritative benchmark SWE - Bench —

It independently solved 13.86% of the problems, far outperforming GPT - 4's mere 1.7% and Claude2's 4.8%.

Not long after, Genie directly raised the score to 30.08% in the same test and once topped the list of the world's strongest AI programmers.

Why has SWE - Bench gained wide attention from the industrial community, academia, and startup teams? Because it is realistic.

This test set proposed by Princeton University consists of tasks from real GitHub projects —

The problems are either bugs encountered by developers in the production environment or typical requirements in feature development. They are difficult and have complex contexts, restoring the working state of programmers in real development to the greatest extent.

In other words, a model that can score high on SWE - Bench must possess the complex skills and experience of an experienced software engineer, which are difficult to cover in traditional code generation benchmarks.

Considering that SWE - Bench is too difficult, the team also proposed a slightly simpler subset, SWE - Bench Lite. Even so, the difficulty remains high.

All current industry SOTAs are based on closed - source models, and most of the top performers on the leaderboard are "luxury combinations":

Closed - source large models (such as GPT - 4o, Claude3.5) + Agent architecture (such as SWE - Agent), with capabilities "stacked" by large volumes and complex scheduling systems.

Recently, Ant Group found a completely different new solution: Code Graph Model (CGM), which achieved performance comparable to that of closed - source models based on an open - source model —

On the public leaderboard of SWE - BenchLite, CGM can successfully solve 44% of the problems, outperforming all open - source models and ranking first; it ranks sixth among open - source systems.

Outperforming all open - source models and ranking first

Ranking sixth among open - source systems.

Results on the SWE - BenchLite test platform

Specifically, the open - source CGM achieved three breakthroughs on SWE - Bench —

First, it breaks the closed - source monopoly. For the first time, using the open - source Qianwen large model, it can achieve performance comparable to SOTA, and the code graph data for training is also open - sourced simultaneously.

Second, it abandons the complex Agent architecture. Only a 4 - step lightweight GraphRAG process is needed to complete efficient problem location and repair.

Third, it innovatively enables the large model to directly understand the code graph structure at the repository level, linking the code and graph modalities, and allowing the model to fully understand the repository - level context.

Currently, CGM has been officially open - sourced. The model, code, and dataset can be obtained on HuggingFace and GitHub:

Paper: https://arxiv.org/abs/2505.16901

Model: https://huggingface.co/codefuse-ai/CodeFuse-CGM-72B

Code: https://github.com/codefuse-ai/CodeFuse-CGM

Data: https://huggingface.co/datasets/codefuse-ai/CodeGraph

In fact, CGM has never been outperformed by strong opponents.

As early as October 2024, it ranked first among open - source models on SWE - Bench Lite with a problem - solving rate of 35.67%;

It topped the list again two months later, and the problem - solving rate rose to 41.67%.

This latest version has refreshed the record again, with the problem - solving rate reaching 44%, achieving a "triple kill" in the open - source track.

2. LLM + Agent architecture? It Seems Promising but...

Writing code can be regarded as the "innate skill" of large AI models. After the popularity of ChatGPT, various AI code assistants have accelerated their integration into programmers' daily work.

In September 2023, Ant launched the AI code assistant CodeFuse, claiming to support the entire software development lifecycle, covering key stages such as design, requirements, coding, testing, deployment, and operation and maintenance.

After two years of development, CodeFuse has gradually built a relatively complete ecological system. Among them, CGM (Code Graph Model), which is used to handle repository - level tasks, has become one of the key fulcrums.

In real development, what really tests the code model is not writing a few functions, but repository - level tasks such as issue repair and code review. A large project often has tens of thousands of lines of code, thousands of files, and hundreds or thousands of functions. The inheritance and calling relationships between classes and modules are intricate — changing one line may affect a large area. It seems that only one function is modified, but in fact, an entire forest needs to be sorted out.

To solve such complex tasks, the current mainstream approach in the industry is based on the LLM Agent architecture.

For example, when a user asks "How to add a delete button" or "In which function is the password verification logic", the system will automatically dispatch multiple Agents to perform their respective duties, and at the same time perform operations such as code slicing, embedding calculation, and semantic retrieval on the code in the repository. Finally, it will recall relevant code and generate responses or modification suggestions.

However, in addition to the limited accessibility of the model, this solution has exposed many "hidden flaws" in real scenarios.

First, software development tasks are often complex.

The seemingly simple requirement of "How to add a delete button" involves multiple agents ("nodes"). The more nodes there are, the more uncontrollable it becomes. Any error (such as misjudging the file location or recalling irrelevant code) will affect the subsequent process and cause error accumulation.

Moreover, the more agents there are, the longer the execution path is, and the communication and computing costs also increase.

Second, the training data cannot keep up with the system complexity.

Evaluation datasets like SWE - bench, although real and authoritative, provide end - to - end samples — only the starting point (problem) and the ending point (repair) are marked, and the path information of "how to break down tasks and how to collaborate" among the agents is often missing.

In other words, the tasks are refined, but the data is still coarse - grained, and the training difficulty actually increases.

Furthermore, the way language models "read code linearly" has its own limitations.

The traditional approach usually "flattens" the entire file into a long string of tokens, ignoring the natural structure of the code. In essence, a code repository is more like a graph — the calling relationships between functions, the inheritance relationships between classes, and the dependency relationships between modules are complex but have clear rules.

To enable large models to truly have the ability to understand at the repository level, a feasible technical path is to directly feed in the structure.

3. The Agentless Route with "Structure Awareness"

Can an open - source large model efficiently complete repository - level code tasks without relying on agents? The All - modal Code Algorithm Team of Ant found the answer and proposed the CGM (Code Graph Model) architecture —

Instead of relying on complex agent scheduling, it innovatively uses the code repository graph structure as a modality input and directly integrates it into the large model, capturing complex relationships such as function calls, module dependencies, and class inheritance at once.

This is like giving the large model a pair of "engineering glasses", making the various relationships between code entities (files, classes, functions, variables, etc.), which were originally hidden, immediately clear.

The realization of this ability depends on three key breakthroughs.

1. Multi - granularity Code Graph Modeling to Capture Structural Information

CGM models the code repository as a graph data structure. To capture the structural information of the repository graph, the team first uses program analysis technology to convert the entire code repository into a corresponding code graph (as shown in Figure 1). The node types and edge types in the code graph are as follows:

Node types: Cover 7 types of code entities (REPO / PACKAGE / FILE / TEXTFILE / CLASS / FUNCTION / ATTRIBUTE)

Edge types: Include 5 types of dependency relationships (contains / calls / imports / extends / implements)

Figure 1. Repository code graph

In the code graph, the contains edges capture the hierarchical dependencies between code entities, and the other edge types capture the semantic dependencies between code entities. When constructing the code graph, the handling of complex dependencies is also included.

Inheritance: Supports the parsing of multiple inheritance (based on the CHA algorithm).

Call: Conservatively parses dynamic calls to ensure the integrity of semantic dependencies.

This modeling method currently supports Python and Java.

Through modeling, the originally scattered code will be organized into a structured and directional network. CGM can quickly generate a "code dependency graph" in its "mind", just like a programmer reading an unfamiliar repository for the first time, and clearly see who calls whom and who affects whom.

2. Two - stage Training for Structure - Semantic Bimodal Alignment

With the graph structure, the next step is to teach the LLM to "read" it: not only understand the semantics of individual nodes, but also perform efficient reasoning on the graph structure, so as to achieve in - depth integration of structure and semantics.

First, use CodeT5+ to encode the semantic information of each node, and map it to the input space of the large model through an adapter to ensure that the large model can understand the text content of the nodes (semantic alignment);

Second, convert the adjacency matrix of the graph into a Graph - aware Attention Mask to replace the standard causal attention mask when the LLM processes node tokens.

This change cleverly simulates the "message passing" mechanism in graph neural networks, allowing the attention calculation to only focus on the information flow between adjacent nodes in the graph, so that the LLM can directly perceive and utilize the structural dependency relationships of the code.

The training process includes two stages: pre - training and fine - tuning, which respectively strengthen "understanding ability" and "generalization ability":

Subgraph reconstruction pre - training is to reconstruct the source code based on the input subgraph, establish a mapping from the code graph to the formatted semantic space of the LLM, and strengthen the foundation for the integration of structure and semantics;

Noise - enhanced fine - tuning. At this stage, real GitHub problem - fix patch data is used to fine - tune CGM. To improve the robustness of the model, the team deliberately introduced 10% noise input in the prompts. For example, the prompt may contain an irrelevant file that actually does not need to be modified, or omit at least one key file that should be modified. Introducing such controlled noise in training helps the model better generalize to scenarios where the actual input information is incomplete or contains interference.

3. GraphRAG Framework: R4 Link for Efficient Patch Generation

To put the ability into use, the team also designed a lightweight GraphRAG framework.

Compared with the existing Agentless frameworks, GraphRAG further streamlines the number of core modules from 10 to 4 key modules —

Rewriter, Retriever, Reranker, and Reader.

The modules are executed sequentially and cooperate efficiently, restoring the thinking path and operation link of programmers when fixing bugs in daily life, and efficiently and accurately locating problems and generating repair patches in real scenarios.

Of course, for enterprises with SWE requirements, the attractiveness of CGM is far more than its ranking results.

While ensuring the security and controllability of core data, CGM brings greater freedom to enterprises —

It not only avoids the risk of privacy leakage but also eliminates the burden of continuously paying high API fees. Enterprises can deeply customize and optimize the deployment of the model based on their own business needs.

Open - source high - performance large models like DeepSeek - V3 have become the preferred choice for many private deployments, and the CGM architecture will also attract the attention of enterprises with the above requirements.

As OpenAI CEO Sam Altman said: "By the end of 2025, software engineering will change dramatically." CGM is undoubtedly a resounding step in this change.

If you are interested in the early research on large code models and code graphs proposed by the All - modal Code Algorithm Team of Ant, you are welcome to read further: