Cambridge Unveils the Black Box of Large Model Failures: Stop Blaming It for Inability to Reason, It's the Actions That Go Wrong
[Introduction] Why do large models often fail when performing long - term tasks? This has led some experts to question the reasoning ability of large models, wondering if they only provide an "illusion of thinking". Recently, a study by institutions such as the University of Cambridge has proven that the problem lies not in reasoning but in the execution ability of large models.
Large models also have their "EMO" moments.
For example, after Gemini debugged a compilation error in Cursor, it entered a self - blaming "loop mode", repeating "I am a disgrace" 86 times.
Although large models have made great progress in complex reasoning ability, the above phenomena still make some experts believe that:
Thinking models only provide an "illusion of thinking" because they will ultimately fail when tasks are prolonged.
Recently, a study by institutions such as the University of Cambridge explained these "failures". The researchers believe that:
The problem does not lie in the reasoning ability of large models but in their ability to execute plans.
The illusion of diminishing marginal returns: Measuring the long - term task execution ability of large language models https://arxiv.org/pdf/2509.09677
In other words, the problems with large models may not stem from the "illusion of thinking" but are more likely to be caused by "slippage" in the execution stage.
The researchers found that a small increase in single - step accuracy will compound and amplify the total number of steps you can complete, making the task "mileage" increase exponentially.
As the number of steps increases, the per - step accuracy of the model will decrease. It's not just because "the context is too long", but there is also a more "weird" phenomenon: the self - conditioning effect.
The so - called "self - conditioning" means that when the context contains previous mistakes made by the model, it will be more likely to make mistakes in subsequent processes.
Do large models fail when tasks get longer because they can't reason?
The industry is competing to create agents that can handle entire projects rather than isolated problems. A fundamental question that follows is:
How to measure the number of steps that a large model can reliably execute?
Do large models fail when tasks get longer because they "can't reason"?
The researchers believe that the fact that large models can correctly follow instructions in the early multi - step process shows that they have the ability to execute according to plans.
This also proves that large models do not fail in reasoning but in execution:
As tasks get longer, models are more likely to make mistakes when executing plans.
Currently, a large amount of research focuses on the reasoning ability of large models, while there is insufficient attention to execution stability.
This is becoming increasingly important as large models are used in long - term reasoning and agent tasks.
A little more stability in each step, a longer run in the long - term task
Long - horizon tasks require a large number of steps, and the task length is the number of steps required to complete the task.
The researchers evaluate performance through the following indicators:
- Step Accuracy: Measures the proportion of samples where the state update from step i - 1 to step i is correct, regardless of whether the state of the model at step i - 1 is correct;
- Turn Accuracy: One turn is an interaction with the model, which may require multiple steps to be executed. Turn accuracy measures the proportion of samples where the state update from turn t - 1 to turn t is correct, regardless of whether the state of the model at turn t - 1 is correct;
- Turn Complexity (K): Defined as the number of steps the model must execute in each turn;
- Task Accuracy: Measures the proportion of samples where the model completes the task without making any mistakes during the execution of i steps;
- Horizon Length (Hs): Given a success rate threshold 0 ≤ s ≤ 1. The horizon length of the model is defined as the position where the average task accuracy of the model at the i - th step drops below the probability s.
As shown in Figure 2, the length of tasks that the model can execute with an accuracy of over 50% increases faster than exponentially as the single - step accuracy exceeds 70%.
Figure 3 shows how to abstract a "long - horizon task" into a series of controllable small steps and how to measure only "execution ability" without involving "planning ability".
In the left - hand figure, the framework models the long - horizon task as a series of "retrieve - then - synthesize" steps.
In the right - hand figure, the researchers designed a simple task to decouple planning from execution:
In each turn, a plan is given by a key, and the model is required to retrieve the corresponding value and calculate the cumulative sum.
This derivation shows that even if the improvement in accuracy in question - answering tasks seems to be slowing down, significant benefits can still be expected in longer tasks.
For example, in software engineering tasks, the horizon length of leading models at s = 0.5 is growing exponentially, doubling every 7 months.
The researchers believe that single - turn or short - task benchmarks may create an illusion of "slowing progress" when evaluating the benefits of further investment in LLM computing power. However, the length of tasks that models can complete is a more indicative indicator of economic value, and they may be growing rapidly.
Test only "execution ability", remove planning and knowledge first
The researchers feed both "what to do" (planning) and "what is known" (knowledge) to the model, and only test whether it can stably complete the steps in one go.
In this way, the long - horizon execution ability of LLM can be purely measured.
Take booking a flight ticket as an example.
In reality, booking a flight ticket is not just a matter of saying "book one for me". It is a series of processes:
- Open the details of a certain flight;
- Check the departure and arrival times, baggage allowance, transfer duration, on - time rate, and reputation;
- Apply mileage, membership, and coupons;
- Make a choice based on the trade - off among "price × duration × preference".
Each step requires "retrieving" information/calling tools first and then integrating new information with the current judgment.
Evaluating one flight is one execution; evaluating multiple alternative flights until placing an order is a long - horizon execution.
People often attribute execution failures to "inability to reason/plan".
The researchers believe that even if reasoning, planning, and world knowledge are perfect, LLM may still make mistakes in long - chain tasks due to "unstable execution".
Therefore, they isolate and test execution by explicitly providing plans and knowledge and only asking the model to follow them.
The researchers first verified the following hypothesis:
Even in tasks that do not require world knowledge and planning, long - horizon execution is difficult. Then, they studied the benefits of expanding the model scale for long - horizon execution.
The researchers evaluated the Qwen3 and Gemma3 model families.
In the experiment, the researchers set the turn complexity to the simplest form (K = 1), providing only one key in each turn and changing the number of turns.
Result 1: Long - horizon execution is still very challenging.
As shown in Figure 4, except for Gemma3 - 4B and Qwen3 - 4B, all models achieved 100% accuracy in the first step, indicating that they have the knowledge and reasoning ability required to complete a single step of our task.
However, the task accuracy declined rapidly in subsequent turns.
Even the best - performing Qwen3 - 32B saw its accuracy drop below 50% within 15 turns.
This confirms the researchers' hypothesis:
Even after removing the need for planning and knowledge, long - horizon execution is still difficult.
As shown in Figure 4, the researchers changed the model scale and studied the complete task accuracy (a) and turn - by - turn accuracy (b) as the number of turns increased.
The bold lines are the 5 - turn moving averages.
The turn - by - turn accuracy in the dotted line (b) shows that the single - step accuracy of tasks is 100% for all models except the smallest one.
However, as the number of turns increases, the performance gap between small models and large models widens (a), and the latter have a significantly longer horizon length (c).
Result 2: The benefits of expanding the model scale do not diminish.
As shown in Figure 4(a), larger models can maintain higher task accuracy in more turns, resulting in a clear scaling trend of the horizon length (Figure 4(c)).
This verifies two important conclusions:
Long - horizon execution is difficult;
Expanding the model scale can significantly increase the number of turns that the model can correctly execute.
Self - conditioning effect: Why does turn accuracy degrade?
People may expect the model to remain constant in each turn.
However, Figure 4(b) shows that as the number of turns increases, the accuracy of each turn declines steadily.
The researchers examined two competing hypotheses:
Regardless of the context content, the performance of the model degrades only because the context becomes longer.
The model self - conditions based on its past mistakes: after seeing mistakes in previous turns, it is more likely to make mistakes in subsequent turns.
To disentangle these two factors, the researchers conducted counterfactual experiments by manipulating the chat history of the model.
They injected artificial output histories with selected error rates to control the error rate while keeping the format consistent.
If the history is completely "cured" (the induced error rate is 0%), the accuracy degradation between the first turn and a later turn of the model can be attributed to the long - context problem.
If, while keeping a "later turn" fixed, the accuracy of the model continues to deteriorate as the error rate of previous turns increases, this indicates that the model conditions based on its past mistakes, increasing the likelihood of future mistakes.