The First Long-Range Doc2Repo Training Set: Code Agents Evolve from Bug

DeNovoSWE is a dataset designed to train code agents to generate complete repositories from scratch, containing 4,818 real task instances. Through structured documentation and a rigorous validation mechanism, it helps agents master the ability to build complex systems rather than merely fix code. This provides critical support for code agents to advance to more advanced software engineering tasks.

As the capabilities of LLM Code Agents continue to improve, more and more researchers have realized that it's time to move on to the next stage: long-horizon tasks that are closer to real-world scenario requirements. Consequently, some benchmarks for long-horizon task evaluation have emerged, such as NL2RepoBench and BeyondSWE.

Recently, the Gaoling School of Artificial Intelligence at Renmin University of China completed relevant research and released the DeNovoSWE dataset, which focuses on long-horizon software engineering tasks, especially the task of generating code for an entire repository from scratch.

Paper link: https://arxiv.org/pdf/2606.10728

Repository link: https://github.com/AweAI-Team/DeNovoSWE

Data link: https://huggingface.co/collections/AweAI-Team/denovoswe

By constructing a high-quality dataset through the Divide & Conquer and Critic & Repair mechanisms, and successfully scaling long-horizon SWE tasks, an open-source high-quality long-horizon SWE task dataset containing 4,818 real data instances was built. This achievement provides large-scale data for the long-horizon capability training of Code Agents, significantly enhancing their long-horizon task capabilities.

The paper also provides a method for filtering based on the difficulty level of the questions, effectively alleviating the trade-off between the proportion of difficult questions and the quality of the trajectories.

Experiments show that Qwen3-30B-A3B-Instruct, trained on DeNovoSWE, increased its performance from 5.8% to 47.2% on BeyondSWE-Doc2Repo and from 4.3% to 23.0% on NL2RepoBench, demonstrating the significant improvement of long-horizon data on the repository-level code generation capabilities.

Rebuild an entire repository from a single document

In the past year, with the scaling of large-scale SWE data in works like Scale-SWE, code agents have made rapid progress in real software engineering tasks such as SWE-bench. However, as models become more proficient in "fixing an issue" or "correcting a few lines of bugs", a more critical question emerges: Do agents really possess long-horizon software engineering capabilities? Judging from the performance of cutting-edge models on BeyondSWE-Doc2Repo and NL2RepoBench, the results are not ideal.

In real-world software development, it's often not about modifying a single function or adding a conditional statement. Instead, it involves understanding requirements, planning the architecture, creating files, designing APIs, handling dependencies, integrating modules, and ultimately making the entire repository pass the tests.

In other words, the challenge lies in long-horizon repository-level generation: starting from a task document, generating a complete, executable, and verifiable software repository. This is exactly the problem that DeNovoSWE aims to solve.

High-quality "repository generation from scratch" task documents

In document-to-repository generation, the document is not just a README, nor is it a simple list of APIs. It is essentially the only task entry point for the agent to rebuild the entire repository.

A high-quality task document must meet at least two core criteria.

First, it must be well-organized.

Repository-level tasks are inherently complex, involving multiple modules, interfaces, configurations, data structures, and interaction processes. If the document simply piles up function descriptions, the agent can easily get lost in the fragmented information. Therefore, the document should first provide a clear overview of the repository and then break down the chapters according to capabilities or workflows, so that each part corresponds to a clear functional boundary.

Second, it must be designed from the perspective of reliable evaluation.

The document should not be too brief; otherwise, the task becomes an under-defined problem, and the model may need to make random guesses to pass the evaluation. Nor should it be too detailed; otherwise, it directly reveals implementation details, making the task less challenging.

A truly high-quality document should describe the key behaviors on which the evaluation depends, including import paths, public APIs, input and output, default parameters, exception behaviors, configuration items, pattern strings, return fields, etc., and also outline the functions to be completed. In other words, the document should be sufficient for the agent to reproduce testable behaviors but should not be a copy of the implementation code.

This is also the core idea of DeNovoSWE: to make the document readable, implementable, and verifiable.

DeNovoSWE method

DeNovoSWE formulates the "generating a complete repository from a document" as a large-scale, verifiable long-horizon software engineering task. Instead of manually writing documents, it automatically constructs high-quality instances through a sandboxed multi-agent workflow. The entire method can be summarized in two steps: Divide and Conquer.

In the Divide phase, the system first analyzes the target repository and breaks it down into multiple repository capabilities.

Each capability corresponds to a core ability or workflow in the repository, such as authentication and connection, data reading and writing, batch processing, and export processes. In this way, the originally large-scale repository generation problem is broken down into several well-structured document chapters.

Meanwhile, DeNovoSWE runs the original unit tests and collects execution traces to identify which functions, classes, and interfaces truly affect the evaluation. It further distinguishes between direct components, core indirect components, and non-core indirect components: interfaces directly called by the tests must be recorded in detail; core indirect components that affect observable behaviors also need to be covered; non-core internal implementations can be left to the agent's free play.

In the Conquer phase, DeNovoSWE uses the Draft-Critic-Repair mechanism to generate documents for each capability. The Draft agent first writes a draft; the Critic agent checks whether the document misses key APIs, behavioral contracts, or structural information; the Repair agent then fixes the document based on the feedback. This cycle iterates continuously until each capability chapter is clear, complete, and aligned with the evaluation.

Finally, the documents for different capabilities are merged into a single complete task document, which serves as the only basis for the agent to generate the repository from scratch.

Difficulty: Why is this a long-horizon task?

The difficulty of the DeNovoSWE task stems from a fundamental change: it is no longer issue-level fixing but whole-repository generation.

In traditional SWE tasks, agents usually face an existing repository and only need to locate bugs, modify local code, and pass the tests.

In DeNovoSWE, the agent faces a cleaned environment: the original source code and tests are removed, the git history is reset, and potential leakage channels such as caches, site-packages residues, pip wheels, and temporary compilation products are also cleared. This means that the agent must truly rely on the document to rebuild the entire repository. It needs to plan the project structure, create module files, define public interfaces, implement cross-file interactions, handle dependencies and configurations, and continuously fix errors through multiple rounds of editing and test feedback.

Any deviation in an API signature, return field, exception type, or default behavior may lead to test failures. Errors can also accumulate over the long-horizon process: an early poorly designed module may affect multiple subsequent files and call chains.

To further handle the difficulty differences between different repositories, DeNovoSWE also proposes difficulty-aware trajectory filtering. Simply put, easier tasks should require a higher pass rate, while difficult tasks should not be discarded just because they do not achieve a perfect score. DeNovoSWE sets different filtering thresholds for different difficulty intervals based on structural complexity and LLM difficulty judgment, thus achieving a balance between quality and diversity.

This is especially important for long-horizon tasks: the more complex the repository, the more difficult it is to pass all tests at once. However, the trajectories of difficult repositories, low scores, and partial successes still contain valuable long-horizon planning and implementation capabilities.

Experimental results

DeNovoSWE finally constructed 4818 high-quality document-to-repository task instances. It is an executable, evaluable, and trainable long-horizon software engineering environment.

Experimental results show that DeNovoSWE significantly improved the long-horizon repository generation capabilities of the model. On Qwen3-30B-A3B-Instruct, the original model only achieved 5.8% on BeyondSWE-Doc2Repo and 4.3% on NL2RepoBench. Scale-SWE-Agent, trained with regular issue-level SWE data, could increase the performance to 29.2% and 18.3%, indicating that ordinary SWE data does have a transfer effect. However, when the model was trained with DeNovoSWE, the performance further increased to 47.2% and 23.0%.

This shows that data for "bug fixing" cannot completely replace long-horizon data for "generating a complete repository". To enable agents to truly learn repository-level engineering, a training environment specifically designed for long-horizon tasks is needed.

On the more powerful Qwen3.5-35B-A3B backbone, DeNovoSWE also brought stable benefits: the performance increased from 43.8% to 50.0% on BeyondSWE-Doc2Repo and from 23.5% to 27.1% on NL2RepoBench. This further indicates that the benefits of DeNovoSWE do not come from accidental adaptation to a single model but from the high-quality long-horizon data itself.

Conclusion

The next stage for code agents is not just to fix individual issues more quickly but to understand documents, plan architectures, organize modules, implement interfaces, and ultimately generate a complete and runnable software repository.

DeNovoSWE systematically formulates this goal into a trainable, verifiable, and scalable dataset. It answers a key question: what kind of data can truly train agents with long-horizon software engineering capabilities?

The answer is not more fragmented code or simpler questions but high-quality, structured, evaluation-aligned, and anti-leakage whole-repository generation tasks.

Starting from a single document, rebuilding an entire repository is the threshold that long-horizon code agents need to cross.

Reference: https://arxiv.org/pdf/2606.10728

This article is from the WeChat official account “New Intelligence Yuan”. Editor: LRST. Republished by 36Kr with permission.