Having understood many principles, AI still goes crazy.
Recently, many papers have been discussing the current dilemmas of Agents.
The dilemmas are real. At the application level, currently, once an Agent loses artificial aids like Skills, it becomes completely unreliable when dealing with long - term tasks in the real world.
This dilemma is usually attributed to two reasons.
The first is the Context Black Hole. As pointed out by the CL Bench conducted by Yao Shunyu, the chief AI scientist at Tencent, leading the Hunyuan team a couple of days ago, the model may simply lack the ability to fully understand complex contexts, and thus cannot follow instructions properly.
The second is actually more fatal. It is called the Collapse of Long - term Planning. That is to say, once the planning steps become longer, the model starts to get confused. It's like someone who has drunk too much. They can walk straight for a couple of steps, but start to wander in circles after ten steps.
Researchers at Anthropic published a significant paper titled The Hot Mess of AI in late January, attempting to explain the cause of the second problem. As a result, they clearly found the Achilles' heel of autoregressive models (all based on Transformer).
We've all heard Yann Lecun often say that "autoregressive models only do Next Token Prediction, so they can never achieve understanding and AGI."
Previously, this was just a judgment or belief without any empirical evidence. This paper provides some empirical evidence.
Moreover, it foreshadows a terrifying reality, that is as the model becomes stronger, it does get smarter, but it doesn't become less chaotic.
01
The Illusion of Ability and the False Truth
The above statement is quite counter - intuitive. Didn't METR just propose a new Agent Moore's Law stating that the processing time of AI programming tasks doubles every seven months?
In programming tests like SWE - bench, leading models have been constantly breaking records. They write longer code and fix more difficult bugs.
So our intuition tells us that as the model becomes stronger, its ability to handle complex long - term tasks improves, and AGI is just around the corner.
However, Anthropic's paper is more concerned about where the errors of current models in long - term tasks actually come from.
To figure this out, the research team introduced a classic tool in statistics, Bias - Variance Decomposition.
The authors mainly used KL Divergence Decomposition to quantify these two indicators.
They used a fixed model and obtained multiple answer samples of the model for the same question through multiple samplings (by changing the few - shot samples of the input or the sampling seeds of the output). Then, they took the average of the probability distributions of the model's multiple outputs to represent the distribution that the model tends to most. The researchers call it Average Model Prediction.
Bias quantifies the distance between the model's "average prediction" and the "true result." This value measures how far the model is from the correct answer on average. If the model always firmly chooses the same wrong answer every time, this value will be very large.
And variance quantifies the expected value of the distance between the model's "specific prediction each time" and its own "average prediction." This value measures the degree to which the model's performance deviates from its own average level each time. If the model's output is the same every time (whether right or wrong), the variance is 0. But if the model's output is very random, this value will be very large.
This is like shooting at a target. If you're a bad shooter but every shot hits two meters to the upper - left of the bullseye, this is called bias. You're wrong, but you're consistently and systematically wrong.
If you're a Parkinson's patient with a very shaky hand and every shot scatters randomly around the target, this is called variance. You're wrong, and your mistakes are unpredictable.
Here, the author proposed a core indicator: Incoherence, which refers to the proportion of the total errors caused by "variance."
This value is used to measure whether an AI fails because it's stupid (doesn't know how to do it) or because it's crazy (acts randomly).
There are mainly two results from the experiment.
First, the longer the task, the crazier the AI becomes. Whether in GPQA (scientific Q&A) or SWE - bench (programming), as the reasoning chain gets longer or the number of action steps increases, incoherence rises linearly. This shows that the source of the model's errors has undergone a qualitative change. Initially, more errors were caused by bias, but later, more errors were caused by variance.
In long - term tasks, an AI's failure is no longer because it lacks knowledge, but because it falls into random madness.
The second conclusion is that the larger the model, the more incoherent it is in difficult tasks. This is the most counter - intuitive point. We usually think that large models are more stable. But in the most difficult tasks, experimental data shows that although the total error rate of larger models decreases, their incoherence actually increases.
For large - model families like Qwen3, in simple tasks, the larger the scale, the more it can suppress chaos. But in the most difficult task group, as the number of parameters increases, the bias drops quickly (it is indeed very smart), but the variance drops very slowly (the madness doesn't improve much). This results in larger models making mistakes more because of random choices.
You may think that the situation isn't that bad. After all, if the variance does decrease with scale, why can't we further increase the model scale to reduce it to a very low level and make the model never go crazy again?
Regarding this, the research team conducted a comparative analysis in the paper: Model scale vs. Reasoning length, which has a greater impact on variance? The answer is that the chaos (entropy increase) introduced by each additional step in the reasoning chain may require the model scale to be expanded by several orders of magnitude to offset. Theoretically, if the model is infinitely large, the variance may approach zero. But the cost - effectiveness is too low.
When we try to move towards AGI, the complexity (length) of tasks often increases exponentially (from writing 10 lines of code to managing a company). If the model scale has to catch up with the task length at a more terrifying exponential rate, then in this race, the model will never be able to meet the task requirements.
This is a terrifying signal. It means that the Scaling Law fails here. Simply making the model larger cannot eliminate this inherent randomness. Instead, it may make the errors more unpredictable because the model becomes more confident and changeable.
02
The Original Sin of Autoregression
Why does the super - brain we create end up becoming a gambler rolling dice?
The paper provides an explanation from a physical perspective, that is the essential conflict between Dynamical Systems and Optimizers.
Current LLMs are essentially autoregressive. They are dynamical systems. Their working principle is to predict the next state (Token) based on the current state (Context). They can be cyclic, chaotic, and divergent. They can go anywhere without a definite end.
And the Agent we want is an optimizer. We hope it can set a long - term goal and then all actions are aimed at minimizing the loss function for this goal. The system has a clear lowest point (goal/loss function), and each step of change must be to make the system closer to this lowest point. Its behavior is strictly locked by the goal and cannot act randomly.
And "in the set of all dynamical systems, the subset that can behave like an optimizer with a fixed loss function has a measure of zero (measure zero)."
This is a mathematical verdict. It means that the possibility of an autoregressive model performing the work of an optimizer is infinitely close to 0.
To prove this, the paper's author trained a bunch of Transformers from scratch and let them simulate a mathematical optimizer (gradient descent) to find the lowest point of a function. As a result, although the model became larger and the bias dropped quickly, the variance (how stable the path is) still dropped very slowly, and even dominated the errors at certain stages.
This directly proves that even if you train an autoregressive model specifically to make it an optimizer, increasing the model scale can only make its cognition more accurate, but cannot make its actions more stable.
When you let an autoregressive model perform a long - term task, you are actually forcing a system used to wandering to walk on a tightrope. The world of dynamical systems is vast, while the world of optimizers is just an extremely thin line within it.
When the model is small, it may not even see the tightrope (high bias). As the model becomes larger, its state space expands exponentially. It does see the tightrope, but it also has more ideas in its head. Because the larger the model parameters, the larger its internal state space and the more possibilities there are. The tiny random perturbations (Variance) brought by each step of prediction are continuously amplified by the long - chain reasoning in the huge state space.
Our current training, especially reinforcement learning, is actually trying to force this huge dynamical system onto that line of the optimizer with a measure of zero by adjusting parameters, making it seem to pursue a goal. But for smaller models with a smaller state space, it may be easier to fit in. However, as the model scale increases (the dimension increases), the volume difference between an ordinary dynamical system and an optimizer will explode exponentially. Post - training is also of limited use.
In the paper, the author compared the Base, Instruct (instruction fine - tuning), and Reasoning (reinforcement learning/chain of thought) versions of Qwen3. As a result, RL did make the Scaling Law of accuracy steeper.
However, when we look at incoherence, the Reasoning version of Qwen3 performs the same as other non - RL models. In the most difficult tasks, as the model becomes larger, incoherence still increases. This shows that the existing post - training techniques (RLHF/CoT) have not changed this underlying dynamic characteristic.
This also explains why long - term tasks are the hardest - hit areas because variance is cumulative (Accumulates). Without an external error - correction mechanism, a little distraction in the first step, after 100 steps of reasoning amplification, will lead to completely different results.
This directly impacts the roadmap to AGI.
Because if this problem is an endogenous disease of the autoregressive architecture, then no matter how much data you feed or how much computing power you use, you cannot eradicate this incoherence.
Going down this path, the future picture of AI failure may be completely different from what we imagine.
In Hollywood movies, a runaway AI is like Skynet in The Terminator, with a firm goal of destroying humanity (this is high Bias, i.e., Misalignment). But the paper predicts that in reality, AI failure is more likely to be like an industrial accident.
For previous weak models, 9 out of 10 errors were because they didn't understand, and 1 was due to madness. Suppose we increase the model size by 100 times. Now it only makes 1 error. But in this 1 error, 0.01 is because it doesn't understand, and 0.99 is due to madness.
This means that future AIs will usually perform perfectly, but once they make a mistake, it will be a completely unpredictable and unrepeatable bout of madness