The timeline for AI evolution is emerging. The capabilities of large language models (LLMs) double every seven months. Will the workplace cease to exist by 2030?
LLMs are evolving at an unprecedented pace: METR has found that their intelligence doubles every seven months. By 2030, a single model may be able to complete in just a few hours what would take human engineers months. Don't blink, your job might already be on the countdown.
As the capabilities of large models skyrocket, various evaluation benchmarks have emerged everywhere.
From the classic MMLU and HellaSwag, to the multi - modal MMMU and MathVista, and then to the AGI - style Arena duels, Agent tasks, and Tool - use tests.
It is crucial to scientifically measure the capabilities of LLMs in long - term, complex, and real - world tasks.
In March this year, METR released a significant study titled "Measuring AI Ability to Complete Long Tasks", for the first time proposing a new and eye - catching indicator:
50% task - completion time horizon
— That is: How long does it usually take for a human to complete a task that an AI can complete with a 50% success rate?
Paper link: https://arxiv.org/pdf/2503.14499
Based on this, METR conducted a series of studies, including task complexity setting, measurement of human baseline time, multi - model comparison experiments, and layer - by - layer statistical regression modeling.
Finally, the team accurately quantified the evolution speed of AI intelligence and made an astonishing prediction:
At the current growth rate, five years from now, large models may be able to automatically complete complex tasks in a single day that would originally take humans months.
Don't blink! The capabilities of LLMs double every seven months!
The METR team selected the strongest model in each time period and established an accurate "chronology" to further quantitatively analyze the growth of model capabilities over time.
The results show a clear exponential growth trend: In the past six years, model capabilities have doubled every seven months.
The shaded area in the figure represents the 95% confidence interval calculated through hierarchical bootstrap between task families, tasks, and task attempts.
However, this exponential growth trend is very steep, so it has a high tolerance for errors.
Even if the absolute measurement error reaches 10 times, the arrival time of capabilities will only change by about two years.
Therefore, the team's predictions about when different capabilities will appear are basically accurate.
Model vs. Human: Measuring the Intelligence of Large Models with "Human Time Consumption"
The core of METR's research is the indicator they proposed: "task - completion time horizon".
This indicator is equivalent to creating a mapping between humans and AI that complete tasks respectively:
Imagine a set of different tasks that take humans different amounts of time to complete.
Assign these tasks to an AI model, and then find the level of tasks that the AI can complete with a 50% success rate (without considering the time the AI takes).
Then check how long it usually takes for a human to complete this level of tasks.
This time required by humans is the 50% - task - completion time horizon of the model, namely the "task - completion time horizon".
To prove the effectiveness of this benchmark, the METR team conducted detailed statistical analysis.
The results show that there is a negative correlation between the time required for humans to complete a certain task on the baseline and the average success rate of each model on that task.
In short, the slower a human takes to complete a task, the more likely the model is to fail.
Moreover, fitting this negative correlation trend with an exponential model works very well .
By performing regression analysis on the logarithm of human completion time using model success rate, the calculated R² is approximately 0.83, and the correlation coefficient is 0.91, which is higher than the correlation coefficient of the average success rates between different models.
Therefore, the indicator of "measuring task difficulty with human time" is very reasonable.
The Newer the Model, the Harder the Task: The Evolution of Capabilities Follows a Pattern
After proving the effectiveness of this indicator, the next step is to examine the performance of each model on this indicator.
The team further examined the human time consumption corresponding to the tasks that different models can complete.
The results are quite intuitive:
Models before 2023 (such as GPT - 2 and GPT - 3) can only complete simple tasks that require writing a few sentences.
For tasks that take humans more than one minute to complete, they quickly fail.
In contrast, the latest cutting - edge models (such as Claude 3.5 Sonnet and o1) can complete some tasks that take humans several hours, and can even maintain a certain success rate in ultra - long - term tasks of more than ten hours.
Outperforming Humans in Efficiency: The Alarm Has Been Raised for 2030
At the rate of "doubling every seven months", the METR team reached an astonishing conclusion:
By 2030, the most advanced LLMs are expected to complete, with a 50% reliability, a task that would take a human engineer working 40 hours a week a month to complete.
What's even more terrifying is that the speed of LLMs may far exceed that of humans — perhaps it only takes a few days, or even a few hours.
By 2030, LLMs may already be able to easily start a company, write a decent novel, or significantly improve existing large models.
AI researcher Zach Stein - Perlman wrote in his blog that the emergence of LLMs with such capabilities will have a huge impact, both in terms of potential benefits and potential risks."
Kinniment admits that the speed at which the capabilities of LLMs double is terrifying, like the prelude to a science - fiction disaster.
But she also says that in reality, there may be many factors that affect and slow down this progress. No matter how intelligent AI is, it may still be constrained by bottlenecks in hardware, robotics, etc.
References
https://spectrum.ieee.org/large - language - model - performance
This article is from the WeChat official account "New Intelligence Yuan", author: Beaver. It is published by 36Kr with authorization.