Is the OpenClaw code getting worse with each modification? New research EvoClaw reveals that the continuous development success rate of Agents is only 13.37%.
By the end of 2025, AI programming had fully transitioned from the auxiliary tool Copilot to the Agent era, where AI takes the lead and humans provide supervision.
If it's just about writing a function or fixing an isolated bug, today's top models can almost always provide satisfactory answers.
However, with the rise of OpenClaw in early 2026, Agents began to evolve from single - task sessions to long - cycle operating systems. To go from being usable and useful to ultimately replacing and surpassing humans, AI must continuously and autonomously iterate all software interfaces that interact with the real world according to requirements and the environment.
However, the biggest obstacle to realizing this vision lies precisely in the fact that real - world software development is not a one - time code generation but a long - term game about time and complexity. The codebase will continuously expand as requirements change, and hidden problems buried in the early stage may be magnified into systematic risks months later. When development spans multiple major versions, can AI really remain reliable in this continuous evolution?
Recently, Deng Gangda from USC, Chen Zhaoling from UCR, Cong Le from Stanford, Wang Mengdi from Princeton, Tang Xiangru from Haven, Wang Xingyao from OpenHands, etc. jointly released a new and important evaluation benchmark EvoClaw. The research team extracted the real code evolution history from open - source projects and reconstructed it into a Milestone DAG (Milestone Directed Acyclic Graph). It aggregates scattered commits into functionally cohesive milestones and strictly preserves the code timing dependencies between tasks.
Based on this new benchmark, the research found that once out of the "single - point repair" and into the real development scenario of "continuous evolution", the performance of AI drops precipitously (from a score of >80% to a maximum of less than 40%). This means that there is still a significant gap for AI to truly be competent in long - term, continuous, and autonomous software evolution work.
Overview of the comprehensive performance (y - axis) and inference overhead (x - axis) of each model framework in EvoClaw
Programming Evaluation 2.0: From Single - Point Repair to Continuous Evolution
Why do existing AI programming evaluations (such as SWE - bench) often overestimate the real capabilities of Coding Agents?
In the past few years, most programming benchmarks have focused on "independent tasks": fixing an issue, completing a large PR, and then verifying the results in a relatively static code snapshot. This evaluation method is of course valuable, but it naturally weakens a key issue: software engineering is never a collection of independent tasks but a continuous evolution process. The implementation choices made earlier will restrict the subsequent development space; small problems left in the early stage may also be continuously magnified in subsequent versions.
The paper points out that existing evaluations generally ignore the time dimension of software evolution, often resulting in inflated scores for many AIs on the leaderboard. However, when it comes to actual development, the true capabilities are immediately revealed, and the gap becomes obvious.
EvoClaw makes a key adjustment in the evaluation design: it requires AI to continuously execute multiple interdependent tasks in the same codebase. Except for the externally provided task requirements, all other development and maintenance are autonomously completed by AI. This "persistent development environment" design directly reveals the vulnerability of AI in the scenario of continuous and autonomous iteration.
Comparison of evaluation paradigms between independent tasks (e.g., SWE - bench) and continuous evolution (EvoClaw)
DeepCommit: Rebuilding the "Milestones" of Software Evolution with Agents
An important question is how to extract the evolution history from existing software projects. The research team did not directly use the existing commit and release granularity to divide tasks but proposed the milestone level.
The reason is simple: a single commit is often too fragmented, containing a large number of trivial modifications and difficult to represent a complete development goal; while a release is too coarse, compressing a large number of intermediate dependencies and evolution paths.
Comparison of three granularities: Commit, Release, and Milestone. Milestone strikes a balance between semantic integrity and dependency preservation
A milestone is defined as a "function unit that is semantically complete and retains the evolution dependency relationship", which is more suitable as the task granularity for evaluating the strength of AI's code evolution ability.
Based on this idea, the paper further proposes DeepCommit, an Agent - driven automated pipeline that reconstructs the originally noisy git history into a verifiable Milestone DAG.
Phase 1: Static Analysis and Denoising (Preprocessing)
First, filter out modifications unrelated to the core function (such as documentation, CI/CD configuration, etc.), and extract the dependencies between commits at the code line level and symbol level through static analysis to lay the foundation for subsequent graph construction.
Phase 2: Agent - Driven DAG Construction (DAG Construction)
The large - model Agent acts as an "architect" and reconstructs the history through four sub - steps: first, find the pioneering "seed" commits; then gather and merge semantically related commits into a complete milestone; then infer the dependencies between them; finally, dynamically split overly large milestones to ensure uniform granularity.
DeepCommit pipeline architecture diagram, showing how to extract the Milestone DAG from the original commits
Phase 3: Agent - Based Runtime Environment Resolution and Verification (Runtime Resolution)
A theoretical graph is meaningless. The real difficulty lies in how to make the reconstructed evolution history work in the real environment.
Since the new Milestone DAG disrupts the original Git timeline, when reapplying commits in the new topological order, it is easy to encounter interface mismatches and large - scale compilation errors. If the error - reporting modules are simply skipped, the collection rate of test cases will plummet, losing the evaluation value.
Therefore, the research team designed an "iterative repair cycle". Facing compilation failures, the Agent will actively analyze the error logs and dynamically modify the Dockerfile to ensure executability; more importantly, it will trace the implicit dependencies originally missed on the basis of the original Milestone DAG and completely solve the interface conflicts by adjusting the sequential constraints of the milestones.
Through continuous iterative repair, it is finally ensured that at least 85% of the original test cases can be collected, providing a sufficient test basis for evaluation.
Example of the DeepCommit Milestone DAG in the 1.5.2 → 1.6.0 release interval of the scikit - learn repository
It is worth mentioning that the paper's authors also compared the evolution graph automatically generated by DeepCommit with the evolution graph manually annotated by human experts. The results are very interesting: human experts tend to conduct top - down semantic division according to "strategic intentions (such as refactoring, compatibility)"; while DeepCommit shows a completely different underlying logic - it is completely based on the real dependency relationship at the code level and reorganizes the skeleton of software evolution with a bottom - up topological thinking.
This strict automated pipeline successfully ensures that all the selected Milestone tasks are 100% marked with pre - dependencies and are truly executable.
EvoClaw: Truly Incorporating "Continuous Evolution" into Evaluation
To ensure clear evaluation requirements, the authors used an Agent to reverse - generate a Software Requirements Specification (SRS) for each milestone based on the correct code (Gold Patch) and strictly aligned with the acceptance tests. Subsequently, human experts checked them one by one to ensure that no implementation details were leaked, the tasks were absolutely solvable, and all unstable test cases were removed.
The final EvoClaw covers 5 mainstream programming languages, and each project selects a real development cycle spanning multiple release intervals (up to 750 days). At the same time, the overall cost is also controlled within a reasonable range. For example, for Claude Opus 4.5, the cost of running the dataset once is about $500.
Overview of the EvoClaw dataset
Therefore, EvoClaw does not simply look at a single pass rate but additionally adds two core dimensions to evaluate the performance of the Agent:
Recall: Measures the completeness of function implementation. That is, how many of the changes required by the target task the Agent has actually completed.
Precision: Measures the reliability of modification behavior, quantifying how much of the old code the Agent has "broken" while "implementing new functions".
Finally, the evaluation calculates the comprehensive score Scorem by taking the F1 weighted average of these two indicators as the score for each milestone.
Model Performance Drops Significantly in the Continuous Evolution Scenario
Main experimental results of EvoClaw
The paper tested various framework and model combinations such as Claude Code and OpenHands. When the tasks are evaluated independently, the scores of top models are generally between 80% and 90%. However, once entering the "continuous evolution" mode of EvoClaw:
Comprehensive Score (Score): The highest score (Claude Opus 4.6) is only 38.03%.
Complete Resolution Rate (Resolve Rate): The highest is only 13.37% (Gemini 3 Pro), and most of the tasks that the model can correctly implement are those without pre - dependencies.
Score comparison of each software project, continuous evaluation (Continuous, bar chart) vs. independent evaluation (Independent, scatter plot)
GPT - 5.3 - codex ranks second with a comprehensive score of 28.88%, second only to Claude Opus 4.6, but its cost is less than one - third of the latter. However, looking at each repository, 5.3 - codex performs weakly in Rust projects (Nushell, ripgrep), dragging down the overall performance, but can approach or even exceed Opus 4.6 in other repositories.
Inevitable Evolution Stagnation: Recall Keeps Growing, but Precision Saturates Quickly
Fitting the evolution dynamics of each model with