AI prediction authority: I still underestimated the speed of AI. It's really possible to achieve "AI R & D automation" by the end of this year.
The rate at which artificial intelligence capabilities are leaping forward is catching even the most meticulous forecasters off guard.
Well - known AI forecasting researcher Ajeya Cotra recently publicly admitted that her prediction of AI progress in 2026, released just two months ago, was significantly too conservative. What triggered this self - correction was the performance of Anthropic's latest model, Claude Opus 4.6, in the METR benchmark test conducted by an authoritative evaluation institution. The "time span" of this model in software engineering has reached approximately 12 hours, far exceeding the level of about 24 hours that Cotra previously predicted for the end of 2026. This means that the actual progress of AI in the field of software engineering is nearly ten months ahead of her prediction.
Even more impactful is that Cotra subsequently raised her probability judgment of "full automation of AI R & D". She maintained the probability of AI completely taking over research concepts and implementation without human intervention by the end of this year at 10%, and clearly stated, "This is the first time I can't find any solid trends to extrapolate from to claim that this won't happen soon." This statement has attracted wide attention in the AI forecasting circle.
Cotra used to be the person in charge of AI safety research funding at Coefficient Giving, one of the world's largest AI safety funding institutions, and currently works at METR, an institution focused on AI capability assessment.
01 Prediction Missed: The Judgment Two Months Ago Is Outdated
On January 14th this year, based on the historical trend that the time span nearly doubled less than twice a year from 2019 to 2025, Cotra predicted that the time span with a 50% success rate for the most advanced model by the end of 2026 would be about 24 hours, and the 80th - percentile prediction was 40 hours.
However, only about two months after she released her prediction, Opus 4.6 was evaluated to have a time span of about 12 hours. In the METR test set, among the 19 software engineering tasks estimated to take humans more than 8 hours, Opus 4.6 was able to complete at least part of 14 of them and stably solve 4 of them. Cotra admitted that it "is no longer credible" that AI agents still fail half of the time on 24 - hour tasks considering there are still a full ten months of progress ahead.
It's worth noting that Cotra also pointed out that the uncertainty of the current time - span estimation has significantly increased. The 95% confidence interval of Opus 4.6 is from 5.3 hours to 66 hours, partly because the number of long - term tasks is scarce, the time taken by humans to complete them is mostly estimated, and the benchmark test itself is approaching saturation.
02 Capability Boundaries: The Traditional Evaluation Framework Is Failing
As AI agent capabilities approach and even exceed the task magnitude of dozens of hours, Cotra believes that the applicability of the concept of "time span" itself is being challenged.
She pointed out that the decomposability of tasks increases significantly with the increase in scale. A one - hour debugging task can hardly be split and parallelized. A one - day development task can barely be divided into tasks, but the boundaries are blurred. And a project lasting a month or even several months is naturally suitable for being disassembled into multiple parallel subtasks. Once an AI agent can stably complete tasks on the order of 80 hours, theoretically, it can continuously promote projects of any scale through the method of an "AI management layer" assigning tasks and an "AI execution layer" advancing in parallel.
Cotra's colleague Tom therefore proposed using the calendar time required for a large - scale team to complete a task, rather than single - person man - hours, as a better indicator to measure "intrinsic difficulty". Cotra believes that as AI enters this new magnitude, the "single - person time" indicator may start to show super - exponential growth, making it extremely difficult to estimate the upper limit of software engineering capabilities by the end of the year.
She also admitted that this large - scale task decomposition will not work perfectly in practice. The intuitive understanding of the overall background by project participants is difficult to be completely replaced by Jira tickets or Asana tasks. But she believes that for a fairly large class of software projects, this model "may be unexpectedly effective".
03 Key Node: AI R & D Automation May Become a Reality This Year
Among all the predictions, the most attention - grabbing is Cotra's probability judgment of "full automation of AI R & D".
She defined this probability as: the AI system completely undertakes the research concept and implementation work without human participation. In her prediction in January, she gave a probability of 10%, and after the release, she received feedback from several peers in the AI forecasting field, who thought this number was too high. But after the performance of Opus 4.6 came out, she said that 10% "again feels within a reasonable range".
Cotra also remains cautious. She pointed out that fully automated AI R & D not only requires software engineering capabilities, but also needs to make breakthroughs in areas such as "research judgment" and "creativity", which are precisely the areas where current AI systems are still significantly lacking compared to human researchers. She believes that the possibility of achieving this goal within the next three to five years is much higher than within this year.
But her wording has undergone a fundamental change: "This is the first time I can't find any solid trends to extrapolate from to claim that it won't happen soon."
This article is from the WeChat official account "Hard AI", author: Focused on technology R & D. Republished by 36Kr with permission.