HomeArticle

Who says the Scaling Law has reached its limit? New research: Small improvements at each step can lead to exponential growth.

机器之心2025-09-16 15:45
Even if the accuracy improvement of the model on "single-step tasks" becomes slower and slower, the accumulation of these small improvements can still enable the length of tasks completed by the model to achieve an "exponential growth".

Many people believe that the Scaling Law is facing diminishing returns, and thus the practice of continuously expanding the computational scale to train models is being questioned. Recent observations have presented a different conclusion. The research found that even though the accuracy improvement of models on "single-step tasks" is getting slower, the accumulation of these small improvements can lead to an "exponential growth" in the length of tasks that the model can complete, which may have greater economic value in reality.

If the marginal returns are diminishing while continuing to expand the computational scale, is it still a reasonable choice for enterprises to invest real money in training larger models? Probably since last year, the AI field has been debating this issue.

Recently, a paper presented an interesting view: Although the scaling law shows diminishing returns in indicators such as test loss for large language models (LLMs), the value of models in the real world often stems from the length of tasks that an intelligent agent can complete. From this perspective, larger models do not experience diminishing returns. Instead, they can compound and amplify the small improvements in single-step accuracy, leading to an exponential leap in the length of tasks completed.

  • Paper title: The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
  • Paper link: https://arxiv.org/pdf/2509.09677
  • Code link: https://github.com/long-horizon-execution/measuring-execution
  • Dataset link: https://huggingface.co/datasets/arvindh75/Long-Horizon-Execution

This paper is from institutions such as the University of Cambridge. The paper points out that for a long time, completing long-range tasks has been a fatal weakness of deep learning. Autonomous driving demos are very cool, but it took more than a decade to really drive long distances on the road. AI can generate amazing pictures, but shooting a coherent and consistent long video is still a difficult problem to this day. Now enterprises all want AI to help handle entire projects, rather than just answer scattered questions. But there is a core question: How should we measure how many steps of work an LLM can reliably execute?

The failure of LLMs on simple long tasks is considered a fundamental defect in reasoning ability. Although LLMs have made great improvements in complex reasoning benchmark tests, some papers still claim that thinking models only give the "illusion of thinking" (arXiv:2506.06941), because when tasks become longer, they will ultimately fail.

These results have sparked a lot of debate in the community. However, the authors of this paper believe that we can solve this problem by decoupling the requirements of planning and execution in reasoning or intelligent agent tasks.

Planning involves deciding what information to retrieve or what tools to use and the order of use, while execution is to turn the plan into reality. In the paper "The Illusion of Thinking", the LLM clearly knows the plan because it initially correctly executed many steps. The researchers in this paper believe that the ultimate failure lies in execution - as the task becomes longer, the model is more likely to make mistakes when executing the plan. Although people pay great attention to the planning ability of LLMs, execution remains an under-researched challenge. As LLMs start to be used in long reasoning and intelligent agent tasks, this direction becomes increasingly important.

In this paper, the authors measured the long-range execution ability of LLMs in a controlled environment. They isolated the execution ability of LLMs by explicitly providing the required knowledge and planning. By controlling the number of rounds and the number of steps per round (which together constitute the task length), they revealed insights into the long-range task execution ability of LLMs:

1. Is there diminishing returns in scaling?

The authors observed that although the improvement in single-step accuracy is decreasing, the small improvement in accuracy can be compounded and amplified, leading to an exponential growth in the length of tasks that the model can complete.

In the past, people thought that scaling the model size was useful because it would improve the model's ability to store parameterized knowledge or search for plans.

However, the authors found in the experiment that after explicitly providing the required knowledge and planning, scaling the model size can still significantly increase the number of rounds that the model can successfully execute. This shows that the value of scaling the model is not only reflected in enabling the model to remember more knowledge or be better at finding problem solutions.

2. Self-Conditioning effect

People may think that the failure in long tasks is simply due to the continuous accumulation of a small and constant error rate per step. However, the authors found that as the task progresses, the error rate per step itself will increase. This is in contrast to humans, who usually improve through practice when performing tasks.

The authors speculated that since a large part of model training is to predict the most likely next token based on the context, letting the model condition on its own error-prone history will increase the possibility of future errors. They tested this by controlling the error rate in the history shown to the model. As the error rate in the history increased, they observed a sharp decline in the accuracy of subsequent steps, which verified that the model will set self-condition.

The authors showed that in addition to the previously discovered long context problem, the self-conditioning setting can also lead to a decline in the performance of the model in long-range tasks, and unlike the long context problem, this performance decline cannot be alleviated by increasing the model size.

3. The impact of thinking

The authors found that recent thinking models are not affected by previous errors and can correct the self-conditioning limitation. In addition, a significant increase in sequential test time compute has led to an increase in the length of tasks that the model can complete in a single round of dialogue. Without the chain of thought (CoT), cutting-edge large language models like DeepSeek V3 cannot even complete two-step execution, while its thinking version R1 can execute 200 steps, which highlights the importance of reasoning before taking action.

The authors benchmarked cutting-edge thinking models and found that the thinking version of GPT-5 (code-named Horizon) can execute more than 1000 steps, far exceeding its closest competitor, Claude-4-Sonnet, which can execute 432 steps.

The "unevenness" of LLM capabilities is both fascinating and confusing. Unlike traditional machines, large language models are more likely to malfunction when performing repetitive tasks. Therefore, the authors believe that execution failure in long tasks should not be misinterpreted as a lack of reasoning or planning ability. They found that by increasing the model size and the computational volume of sequential test time, the long-range execution ability of the model will be significantly improved. If the length of tasks that a model can complete indicates its economic value, then continuous investment to increase the computational volume may be worthwhile, even if short-task benchmarks give the illusion of slow progress.

This paper inspired many people, and some even proposed that we should design more benchmarks for the execution depth of models to better measure the benefits brought by model scaling.

Here is the detailed content of the paper.

Detailed explanation of the paper method

In the paper, the authors detailed how they reached each of their conclusions.

Although the returns of single-step accuracy are diminishing, scaling still has value

The authors first analyzed the relationship between the single-step accuracy of the model and the length of its prediction range. To derive the mathematical relationship, they made two simplified assumptions similar to those of LeCun (2023). First, they assumed that the step accuracy of the model remains constant during the task. Second, they assumed that the model does not self-correct, which means that any single error will lead to task failure. They only made such assumptions in this analysis, which can provide useful intuition. Their empirical analysis went further and also studied how LLMs do not show stable step accuracy during long-range task execution in real situations and how they may correct errors.

Proposition 1: Assuming a constant step accuracy p and no self-correction, the task length H when the model reaches a success rate s is given by the following formula:

The authors plotted this growth function when s = 0.5 in Figure 2. Note that when the step accuracy exceeds 70%, a small improvement in step accuracy will bring a faster-than-exponential improvement in task length. This derivation shows that even though the improvement in accuracy seems to slow down in question-and-answer benchmarks that usually contain short tasks, from a mathematical perspective, people can still expect significant benefits on longer tasks.

For example, in software engineering tasks, Kwa et al. (2025) empirically observed that the length of tasks that cutting-edge models can complete at s = 0.5 is growing exponentially, doubling every 7 months. Using the above results, the authors showed in Figure 1 that even under the mechanism of diminishing returns in step precision, this exponential growth in task length will still occur. If s = 0.5 is set, we get

. Therefore, to maintain the exponential growth of H_0.5 over time (x), the required step precision p is

, which is indeed a decreasing function.

The authors noted that human labor is often paid by time. If the economic value of an intelligent agent also stems from the length of tasks it can complete, then single-round or short-task benchmarks may not be reliable references for evaluating the benefits of further investing in the computational resources of large language models. These benchmarks may give the illusion of slow progress, while the authors believe that the more economically valuable indicator - the length of tasks that the model can complete - is actually still growing rapidly.

Isolate execution by decoupling planning and knowledge

Next, the authors described how to empirically measure the long-range task execution ability of the model.

First, the team gave an inspiring example: an intelligent agent for the popular and economically valuable flight booking task.

After receiving the search results, it must evaluate the displayed flights to determine which one to book. The plan for evaluating a single flight option may include a series of operations, such as viewing detailed information, verifying whether the flight time, baggage allowance, and airline reviews meet the user's preferences, applying any available discounts or reward programs, and finally making a choice based on cost and travel time. Each of these independent steps requires retrieving some information and combining it with the existing information state to ultimately evaluate a flight option, and both operations require knowledge. The successful evaluation of multiple flight options constitutes the execution process of the plan until the final booking decision is made.

This paper focuses on the execution link because the authors believe it is a key component of long-range task completion ability. Traditionally, the execution link has received less attention than abilities such as reasoning, planning, and world knowledge, which have been the main focus of discussions on LLM abilities. This relative neglect is important because failures in execution are wrongly attributed to limitations in reasoning or planning abilities. This view may stem from the idea that execution is a simple or trivial task. After all, this is what machines have always been good at. Once humans learn how to complete a task, they are also quite reliable in execution and may even improve through practice. However, since LLMs do not have a guarantee of correctness, the authors assume that execution may be surprisingly challenging for LLMs in long-term tasks. They speculate:

Even if reasoning, planning, and world knowledge are all perfect, LLMs will still make mistakes during long-term execution.

To prove this, they isolated the situation of execution failure by explicitly providing the necessary knowledge and planning. They strung together the "retrieve then combine" steps proposed in the previous flight selection intelligent agent example. Each step includes retrieving relevant information or tools specified in the plan, and then combining the output to update the current state. The plan is responsible for deciding what to retrieve and how to combine, while execution is to actually perform these operations. This conforms to a natural abstraction - a key-value dictionary. The key, as a step in the plan, specifies the knowledge to be retrieved or the tool to be called, while the value represents the output of the knowledge or tool, which then needs to be combined with the current state.

In this study, the authors provided the plan as the key in each query, thus eliminating the need for the LLM's planning ability. They also provided the key-value dictionary in the context, eliminating any dependence on the model's parameter knowledge. Through this design, the authors directly controlled two important dimensions, the product of which gives the task length (the number of "retrieve then combine" steps): the number of rounds and the round complexity (K). The round complexity can be adjusted by changing the number of keys queried per round.

Experimental results

In the experimental part, the authors reached the following core conclusions:

  • Long-range task execution is challenging. Significantly increasing the model size will greatly increase the number of rounds that the model can correctly execute.
  • The model will learn from its own previous mistakes as new context (self-conditioning), which leads to a decrease in the accuracy of each step. Increasing the model size is not enough to alleviate this problem.
  • The thinking model can solve the problem of self-conditioning limitation and can execute significantly longer tasks