Is the path of pre - training to AGI dead? Yann LeCun reveals the cognitive gap that large language models (LLMs) can't cross.
For years, Yann LeCun, one of the three giants in the field of artificial intelligence and the chief AI scientist at Meta, has been skeptical of the technical approach behind mainstream large language models (LLMs).
Yann LeCun said: Autoregressive models are terrible.
He believes that the current mainstream autoregressive models, whose core task is to generate text by predicting the next word, cannot in essence give rise to true intelligence. No matter how large the model scale is, this mechanism cannot achieve true understanding, reasoning, or human-like intelligence.
However, his view has long been regarded as a "factional struggle" in the academic route. Due to the lack of direct empirical support, he has even been questioned for striving for resources for his leading "world model" research.
Just this month, with the release of the JEPA 2 paper, its excellent results finally redeemed Yann LeCun.
A new and influential research paper co-authored by him, "From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning," has finally found solid theoretical evidence for his long - standing criticism.
Paper source: [2505.17117] From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
This research shows that although large language models are far from being "random parrots" that only imitate, the way they understand the world is profoundly, perhaps fundamentally, different from that of humans.
More importantly, this difference may not be bridged by the "Scaling Law" of simply expanding the model scale and data volume. It touches on the underlying foundation of the current AI paradigm.
Following the path of LLMs, achieving AGI may really be impossible.
Create a ruler to measure the difference between human and LLM thoughts
So, how did the researchers transform an almost philosophical question - "What's the difference between machine understanding and human understanding?" - into a measurable and quantifiable scientific question?
Instead of directly defining the vague term "understanding," they took a different approach and chose to measure the information organization strategies behind "understanding."
Therefore, they designed a tool that can serve as a "cognitive efficiency scorer" to measure the cognitive efficiency of different intelligences.
The task of this scorer is to evaluate the "work quality" of any intelligent system (whether it's the human brain or an AI) when organizing information. High - quality work requires achieving a perfect balance between two conflicting goals: extreme compression of information (Complexity) and faithful preservation of meaning (Distortion):
It's like organizing a large library. You hope that the classification labels (such as "science fiction," "history") are as few and precise as possible, making the whole system clear at a glance. A highly compressed system means you can grasp the whole picture with little information, and its "complexity cost" is low.
But while pursuing simplicity, you don't want to lose too many details. For example, you can't simply put "whales" and "tunas" into the same "fish" box just because they both live in water, ignoring the essential difference between mammals and fish. Any classification will cause "distortion" of the original information, and the "distortion cost" measures this kind of loss in meaning.
The final score of this scorer, which we call L, is the sum of the "complexity cost" and the "distortion cost."
For a perfect system, its L score should be as low as possible, indicating that it preserves the original meaning of things to the greatest extent in the most economical way.
LLMs and the human brain have fundamental differences in understanding
Armed with this powerful "ruler" that can measure both the macro - system complexity and the micro - category purity, the researchers designed a total of three experiments to measure the gap between the human brain and LLMs.
They selected multiple well - known model families in the industry, including six Llama series models (with parameters ranging from 1 billion to 70 billion), five Gemma series models (with parameters ranging from 2 billion to 27 billion), thirteen Qwen series models (with parameters ranging from 500 million to 72 billion), four Phi series models, and a 7 - billion - parameter Mistral model for this experiment.
Conclusion of the first experiment: Models can form abstract "class" concepts
The first experiment was to see from a macro perspective whether the concept categories spontaneously formed by LLMs were similar in overall structure to human classification habits.
They let a series of LLMs process the classic words used in cognitive psychology experiments, clustered their word embeddings, and then compared the results with human classifications.
The results showed amazing consistency. Whether they were large or small models, they could generally correctly identify and group the members of concepts such as "fruits," "furniture," and "vehicles." Their clustering results were very close to human judgments and significantly higher than the random level. This proves that LLMs are not talking nonsense. They have indeed learned profound semantic associations from a large amount of text data. This scene seems to indicate that AI is steadily approaching human intelligence.
This figure shows the similarity between LLM word clustering and human clustering, and most of them exceed the random distribution.
Among them, Bert performs most similarly to humans.
Conclusion of the second experiment: They can't tell the difference within categories
Superficial similarity doesn't tell the whole story.
When the researchers delved into the interior of each category, problems emerged. The second question was: Can LLMs understand the fine semantic structure within a category, such as "typicality"?
For humans, a category has a "center of gravity." "Sparrows" are obviously more typical "birds" than "ostriches" or "penguins." This judgment comes from our rich, multi - modal real - world experience. We know that birds usually fly, are small in size, and can chirp. But do LLMs have this "feeling"?
All concepts are grouped together without hierarchical distinction
The answer is no.
The research found that although the internal representation of LLMs can group sparrows and penguins together, it cannot stably reflect the key semantic detail that the former is more representative than the latter. In the "eyes" of LLMs, all members within a category are more like a group of points with different distances from the center but relatively equal status, lacking the strong "prototype" or "exemplar" structure in human cognition.
Conclusion of the third experiment: LLMs and the human brain adopt different compression strategies
The differences in phenomena must stem from different underlying logics. The third experiment was to answer what the respective strategies of the two intelligences are when facing the fundamental trade - off between "compression vs. meaning."
At this time, the "efficiency scorer" (L) finally played its final decisive role. The researchers put both human classification data and the clustering results of all LLMs into this unified scoring framework.
The results were quite clear. All LLMs, from the smallest to the largest, without exception, obtained extremely low L scores. They are natural "kings of efficiency."
Their internal operating mechanism seems to be driven by an invisible force to find the optimal statistical compression scheme in the data, organizing information with the minimum complexity cost and distortion cost. In contrast, human cognitive data obtained significantly higher L scores and "lost miserably" in this pure statistical efficiency competition.
Left: The information entropy of humans is generally higher than that of LLMs; Right: The L scores of humans are much lower than those of LLMs, indicating low compression
This is the most profound insight of the whole paper: The "inefficiency" in the human cognitive system is not a defect but a manifestation of its powerful functions. Our brains did not evolve to be a perfect compression software. Its primary task is to survive and reproduce in the complex, dynamic, and uncertain real world.
For this reason, our concept system must be flexible, rich, and malleable, capable of supporting us in conducting complex causal reasoning, functional judgment, and effective social communication.
This "redundancy" and "ambiguity" retained for "adaptability" naturally appear as "inefficiency" on a pure statistical scorer.
So the question is, can an intelligence that can't tell whether a penguin or a sparrow is more like a bird really understand the world, even if it's efficient?
Has the Scaling Law failed?
You may ask, what about the Scaling Law approach? Can expanding the number of parameters make the model's compression rich enough to understand more complex semantic structures and be more like humans?
But a core finding of the paper is that the number of parameters is not the decisive factor in this fundamental strategic difference.
In the task of "aligning with human concept classification" (RQ1), larger models don't necessarily perform better. The research clearly points out that relatively small (about 340 million parameters) encoder models like BERT - large often perform as well as, or even better than, much larger decoder models.
In the second experiment, the scale effect was also not obvious: In the chart measuring the alignment degree, you can see that the performance points (AMI scores) are scattered, and there is no clear, continuously rising curve as the model size (from 500 million to 70 billion parameters) increases. This shows that simply increasing the number of parameters cannot guarantee that the model can better capture the structure of human concepts.
Therefore, the scale effect (Scaling Law) completely fails here.
This perfectly confirms Yann LeCun's core argument over the years.
It shows that the current autoregressive training paradigm of LLMs cannot produce human - like intelligence that can understand the world.
LLMs and humans play by completely different rules. One is a master of compression, and the other is a master of adaptation.
Simply feeding more "food" (increasing the number of parameters) to this "compression beast" of LLMs will only make it bigger and stronger, but it won't make it evolve into an "adaptive hunter." The "genes" of a species (i.e., model architecture and training paradigm) determine its basic survival strategy.
Is the LLM sentenced to "death"?
After thoroughly analyzing the profound gap between human and machine intelligence revealed by this research, an inevitable question arises: Does this mean that the technical route of current large language models represented by the GPT series has been sentenced to "death"?
The answer may be no.
Currently, there are actually three possible paths to break through this bottleneck. The first one has even become the mainstream in the industry.
First, the most direct path, which we can call "software - level" fine - tuning: introducing richer reward signals.
This is the improvement plan that the industry has invested the most in and is closest to real - world applications.
The core idea is that since the autoregressive model itself is a powerful but value - lacking "statistical engine," can we guide its behavior through a smart enough "navigation system" - that is, the reward model in reinforcement learning?
Theoretically, we can design an extremely precise reward mechanism to reward those traits that reflect in - depth thinking. This is actually the path currently adopted in reinforcement learning. When the model can recognize and explain the "typicality" of concepts, when it can construct clear causal reasoning chains, and when it can admit its knowledge limitations and express uncertainty, it will be given high "rewards."
However, the models used in Yann LeCun's experiment are non - reasoning models. Therefore, whether the richness of reward signals can truly change its internal representation strategy of "statistical compression" is at least unknown at present.
At least from the progress of current reinforcement learning, this patch is quite effective.
Second, there is a more radical and revolutionary "hardware - level" architecture innovation:
That is, fundamentally change the generation granularity of the autoregressive model. Since the linear mode of "generating word by word" has the inherent defect of short - sighted planning, can we force the model to "think carefully" before "speaking"?
A perfect example is the "Large Concept Models (LCMs)" framework proposed by Meta earlier this year. The design of this framework jumps from predicting the next "word" (Token) to predicting the next "concept" (Concept).
The appealing aspect of this idea is that it no longer satisfies with local and chained language fluency but requires the model to conduct higher - level overall planning in terms of architecture. But to achieve this, we need to make the model a dual - system.
The "Production Model" in LCMs plays the role of "System 2" (planner/thinker). This module is the "brain center" of the whole system, responsible for slow, conscious, and in - depth thinking.
It doesn't directly generate words and sentences. Instead, it first carefully plans a series of "concept vectors" representing the outline of thoughts and logical processes in an abstract "concept space."
This step is equivalent to conducting logical planning, constructing causal chains, and designing grand narratives.
The "Realization Model" in LCMs perfectly corresponds to "System 1" (exec