HomeArticle

Large model directly understands code graphs for the first time: Automatically fix bugs without an Agent and top the SWE-Bench open-source model list.

量子位2025-06-27 16:32
Completely based on open-source models.

AI automatically fixes bugs with a resolution rate of 44%! This represents the latest top - tier performance among global open - source models.

The new open - source model from Ant Group outperforms all open - source solutions on SWE - bench Lite, with performance comparable to closed - source models.

Here are the specific results on SWE - bench Lite:

  • Ranks first among all open - weight model methods (Open Weight Model);
  • Ranks sixth among all open - source system methods (Open Source System);
  • Holds the 14th overall rank;
  • Outperforms the current best open - source model on the leaderboard, "KGCompass", by 7.33%.

They are the first to integrate the repository code graph modality into large models (Code Graph Model, CGM), enabling large language models to directly understand code graphs and more efficiently fix bugs and complete code.

This completely eliminates the reliance on black - box models (such as GPT - 4 or Claude 3.7) and complex Agent workflows, achieving more controllable, transparent, and secure software engineering automation.

Moreover, CGM is entirely based on open - source models. It's worth noting that open - source models generally perform poorly on SWE - bench. Previously, almost all SOTA - level solutions were implemented based on closed - source models. However, CGM, based on the Qwen model, has achieved a level comparable to closed - source models.

CGM can quickly locate and generate patches in just 4 steps, eliminating the complex orchestration process in Agent solutions and significantly improving efficiency.

Enable AI to truly understand large - model code repositories

Since the rise of large - model trends, AI programming has rapidly emerged, especially excelling in small tasks like writing functions. For example, on benchmarks such as HumanEval, the accuracy of many models has exceeded 90%.

However, real - world software engineering is far more complex than "writing a single function". Tasks such as bug fixing and feature enhancement often require cross - file and cross - module operations, and demand that the model understand the complex structure, dependencies, and class inheritance systems within a project.

The current mainstream approach typically uses Agents based on closed - source models. They can simulate the behavior of human programmers, such as observing code, invoking tools, and conducting multi - round interactions to complete tasks.

But this approach also has several issues:

  • The behavior path is uncontrollable, and reasoning errors are likely to accumulate;
  • It relies on closed - source models like GPT - 4 and Claude, making private deployment or customization difficult;
  • The engineering cost is high, and the efficiency is low.

Meanwhile, current solutions using open - source models struggle to achieve SOTA - level results.

Therefore, the research team posed the question: Can we solve repository - level tasks using only open - source models without relying on Agents? This is how CGM was born.

Deep integration of graph structure and large models

CGM adopts a cross - modal modeling approach similar to Vision - Language Model (VLM). It combines the text - understanding ability of traditional LLMs with the structural graph (Graph) of the code repository to form a graph - language multi - modal model. The core of the model integrates two modalities:

  • Graph modality: Build the repository into a structured graph, with nodes including 7 types such as functions, classes, files, and packages, and edges representing dependencies such as calls, inclusions, and inheritances;
  • Language modality: The natural - language description and code prompts input by the user drive the model to generate patches or answers.

The model takes a code graph and a text - form prompt as inputs, and performs two - modal alignment of structure and semantics in the LLM.

The specific structure - integration method is as follows:

Use a small encoder (CodeT5+) to encode each node and compress it into a single "node token". Each node is segmented into text blocks of up to 512 tokens.

Map the encoded node representations to the LLM input embedding space through an adapter (a two - layer MLP). This is equivalent to expanding the LLM context by 512 times, enabling better handling of the massive context of the code repository.

Use a graph - aware attention mask. Replace the original causal attention in the LLM so that the attention mechanism only acts between adjacent nodes. Similar to the message - passing mechanism of GNNs, it allows the LLM to directly perceive and utilize the structural dependencies of the code.

Two - stage training: Structure understanding + Problem generalization

Based on this model architecture, the team enables the LLM to understand the topological structure of the code graph through two - stage training.

Stage one: Sub - graph reconstruction pre - training

To train CGM to effectively capture the semantic and structural information of the code graph, the team designed a "Graph - to - Code" task. Sub - graphs are randomly sampled from large code graphs (limiting the number of nodes to control the length of the output code). The model needs to reconstruct the original code snippets based on these input sub - graphs (which only contain node types and connection relationships, without complete code content).

Then, a hierarchical method is adopted to maintain the structural consistency and readability of the reconstructed code. Concatenate the repository context according to the topological sorting and line - number order: High - level nodes (such as REPO, PACKAGE) are placed at the beginning of the output sequence or file; file nodes are ordered through topological sorting; nodes within a file (such as CLASS, FUNCTION) are concatenated in line - number order.

Stage two: Noise - enhanced fine - tuning

In this stage, CGM is fine - tuned using real GitHub issue - fix patch data.

The model learns to generate code patches based on two inputs: (i) a relevant code sub - graph; (ii) a text prompt indicating the actual files that may need to be modified according to the patch. To improve the model's robustness, 10% noisy input is deliberately introduced into the prompt: for example, the prompt may contain an irrelevant file that actually does not need to be modified, or omit at least one key file that should be modified. Introducing this controlled noise during training helps the model better generalize to scenarios where the actual input information is incomplete or contains interference.

Inference stage: Graph - RAG framework replaces Agent

Finally, to further enhance the practical application ability, CGM constructs a lightweight Agent - free framework, Graph - RAG.

It replicates the bug - fixing workflow of human programmers but is more efficient than existing Agent solutions.

The number of core modules is further streamlined from 10 to 4: Rewriter → Retriever → Reranker → Generator (CGM model).

Rewriter: Rewrite the problem description and extract keywords and relevant files;

Retriever: Extract connected sub - graphs from the code graph through semantic and structural retrieval;

Reranker: Rank the retrieval results and select the most critical files for generation;

Generator: Generate the final fixed code by combining the sub - graph and the prompt.

Based on the above, CGM has achieved leading results in multiple test benchmarks. The details are as follows:

Experimental results

The research team systematically evaluated the performance of CGM on multiple mainstream benchmarks, covering two main task categories: (1) Code fixing and (2) Code completion.

Repository - level code fixing

On the SWE - bench Lite Leaderboard, CGM ranks first among open - weight models with a result of 44.00%.

On SWE - bench Verified, CGM improves by 10.20% compared to the best open - source baseline, reaching 50.40%;

For Java projects, CGM reaches 14.29% on SWE - bench - java Verified, an improvement of 4.4% compared to the best open - source baseline.

These results indicate that CGM can handle cross - language, cross - project large - scale repository - level bug - fixing tasks, demonstrating strong structural understanding and generalization abilities.

Repository - level code completion

In complex code generation tasks, CGM also significantly outperforms open - source models of the same size on ComplexCodeEval and CrossCodeEval, especially in scenarios requiring cross - file reasoning and completion.

In addition, the research team deployed CGM on different base models (CodeLlama - 7B and DeepSeek - Coder - 7B) and compared it with recent RAG systems. The results show that CGM has good versatility, can adapt to multiple base models, and outperforms traditional RAG methods.

In summary, CGM does not rely on complex Agent systems. For the first time, it integrates the code - graph modality into a large model, enabling AI to understand the complex dependencies between text and code in a repository like a human, "truly understanding a project".

More importantly, it can be implemented based on open - source models and is not limited to specific models. It provides a flexible, transparent, and controllable solution for enterprises and developers.

Finally, the technical paper, core code, model weights, and training data of CGM are all open - source. Students interested can learn more details.

  • Technical paper: https://arxiv.org/abs/2505.16901
  • Open - source code: https://github.com/codefuse - ai/CodeFuse - CGM
  • Model weights: https://huggingface.co/codefuse - ai/CodeFuse - CGM - 72B
  • Training data: https://huggingface.co/datasets/codefuse - ai/CodeGraph

Previous work of the team:

  • Code LLM review: Awesome - Code - LLM (TMLR)

https://github.com/codefuse - ai/Awesome - Code - LLM

  • Previous research on Graph + LLM: GALLa (ACL 2025)

https://github.com/codefuse - ai/GALLa

  • Efficient attention architecture: Rodimus (ICLR 2025)

https://arxiv.org/abs/2410.06577

  • Code multi - task fine - tuning framework: MFTCoder (KDD 2024)

https://arxiv.org/abs/2311.02303

This article is from the WeChat official account