HomeArticle

AI researcher TIAN Yuandong: The truth about "AI epiphany" and how large models learn to compress the world

36氪的朋友们2025-10-31 18:38
Tian Yuandong on AI Epiphany: From Memory to Generalization, Reducing Data Requirement to O(M log M)

Meta CEO Mark Zuckerberg recently approved a layoff plan involving approximately 600 employees in the AI department. This is the largest adjustment in Meta's artificial intelligence field this year, mainly affecting the company's core R & D institutions.

Tian Yuandong, the then head of Meta's FAIR team, confirmed on social media platform X: "I and some of my team members have also been affected by this layoff." As one of the core pillars in the "Super Intelligence Laboratory" (MSL) scientific research system, Tian Yuandong's departure has also attracted wide attention from the industry.

After this news was announced, Tian Yuandong made his first public appearance and accepted an exclusive in - depth interview with Tencent Technology's special author 'Class Representative Attention'.

Facing the doubts in the industry, Tian Yuandong clarified and "vindicated" here: His team also made a lot of contributions and important work in Meta's large - model development. However, the biggest challenge they faced was not the technology itself, but how to convince the product team.

Subsequently, the focus of the interview shifted to Tian Yuandong's recent research results, with an emphasis on the "Grokking" of large AI models.

"Grokking", a term from science - fiction writer Robert Heinlein, means a profound understanding of the essence of things. A high score of a large - language model does not mean wisdom. The real critical point is the moment when it first learns to "think".

In September this year, Tian Yuandong published an independent paper pointing out that Grokking is not a mysterious emergence but a computable energy landscape dynamics (Energy Landscape).

  • Paper title: Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking
  • Paper link: arxiv.org/abs/2509.21519

Tian Yuandong's research reveals a core breakthrough in AI learning: In group operation tasks, when the task complexity is M (such as the number of vocabulary or concepts), traditionally, it is believed that the model needs to exhaust M² combinations to learn the rules, and the data requirement increases with the square of M. However, he proved strictly mathematically that the model only needs O(M log M) samples to achieve generalization - nearly linear growth. Taking M = 1000 as an example, in the past, millions of samples were needed, while the new theory only requires about 7000 samples.

This means that AI doesn't need to learn violently by "seeing the whole world". It can also have an epiphany of the deep - level structure from very few samples like humans, providing a theoretical basis for efficient training in the era of limited data.

In this interview, Tian Yuandong interpreted the research on Grokking and revealed the key to AI learning: the internal mechanism of how large models transition from "memory - based fitting" to "structured generalization".

In addition, Tian Yuandong revealed in the interview that AI also contributed a lot to this paper. Some of the ideas were generated after his conversations with GPT - 5. Tian Yuandong joked: "This sounds a bit like self - play. However, in the process of conversation, you need to give it some insights and thoughts, and it will have different outputs."

The core viewpoints of this interview are as follows:

  • Grokking reveals the mathematical mechanism from memory to generalization. The transition from memory to generalization is not a mysterious emergence but an optimization dynamics: When the data is insufficient, the "memory peak" dominates; when the data increases, the "generalization peak" rises. Once the generalization peak is slightly higher, the parameters collectively cross over, resulting in the epiphany phenomenon.
  • Representation learning is the foundation of all intelligent abilities. Whether it is chain - of - thought reasoning or intuitive judgment, it ultimately depends on how the model "represents" and "understands" the world. Just as mathematical induction replaces exhaustive enumeration, the real leap comes from the change in the representation method.
  • The Loss Function is just a proxy signal for optimization. Its role is to generate a suitable gradient flow to guide the representation to be updated in the correct direction. Different loss functions can learn similar representations if they induce similar gradient structures. The objective function itself is not the purpose but a "computable proxy" for optimization.
  • Black - box Scaling emphasizes increasing parameters and adjusting configurations, which is efficient in the short term; understanding the mechanism pursues explanation and structure, with a higher long - term ceiling. When the data reaches the limit and samples are scarce, the Scaling Law fails, and only mechanism - oriented improvements can break through the limitations.
  • The essence of generalization is to let the model learn to "compress" the world: extract reusable structures from redundant memories. There are two criteria for real understanding: one is to give correct answers in new situations; the other is to reduce complex problems to simple and general logic. When the evidence and inductive bias reinforce each other to the critical point, the model will "cross the peak" and enter the generalization state.

The following is the full - version interview content, which has been carefully edited and organized by Tencent Technology without changing the original meaning:

01. Clarification after the Meta layoff event: Vindicating the team

Class Representative Attention: Recently, I've seen some news about your departure from Meta.

Tian Yuandong: Yes, now I'm relatively "free" and can do anything I want.

Class Representative Attention: Congratulations! I only noticed when preparing for this interview that you had worked at Meta for a full ten years.

Tian Yuandong: When I joined, there were about more than ten thousand people.

Class Representative Attention: Actually, Meta was not a small company at that time. I remember it went public in 2012, right?

Tian Yuandong: Yes, now there are about nearly 80,000 people.

Class Representative Attention: We can start this interview by talking about your paper and also chat about your recent developments.

Tian Yuandong: That's fine. I'd prefer to talk about the paper. The reason I spoke up on the X platform recently is that I saw some people speculating and doubting whether our team didn't achieve the results the company expected. I must clarify this for my team: Our team has done a lot of very important work, and we can't be blamed. This point must be made clear.

Class Representative Attention: Then, what specific key roles did your team play in the process of large - model training?

Tian Yuandong: Our team was the first to discover key problems such as chunk attention in the pre - trained model design and promoted the implementation of solutions, effectively improving the stability of long - context RL. Other contributions include dataset generation and evaluation, the construction and optimization of RL infrastructure, etc.

In addition, we also had in - depth discussions with multiple teams on the company side about some design problems in the large - model architecture. At first, the communication was very difficult because they thought these problems were not serious or even not problems at all.

Although I joined Meta as a research team, the team responsible for the specific development of large models naturally trusted their own judgments more. So we could only verify through a large number of experiments and use data and results to prove that our judgments and insights were correct. Eventually, the facts did prove that these problems existed, and they finally accepted our conclusions. This whole process actually reflects the important value of our team.

In addition, we also overcame many difficulties in large - model training. For example, how to make long - context length training more stable. This process solved the common blow - up problem in training. Although these technological achievements were not directly reflected in the official release, they did lay a solid foundation for subsequent model R & D.

It can be said that our team is more like the "unsung heroes", not in the spotlight, but playing a connecting and underlying role in key links.

02. The core value of a researcher is insight, but the real difficulty is convincing others

Class Representative Attention: Regarding the problems you mentioned just now, I'd like to learn more about two aspects:

First, as a research team, you were not fully trusted. Was it because of the lack of direct experience in training large models or other reasons? What was the large - model team you communicated with like? Did they have rich experience in large - model training?

Second, why could you quickly discover the problems after getting in touch with the product capabilities of large models?

Tian Yuandong: They do have very rich overall experience. However, there were program bugs in some experiments, leading to wrong judgments. Although our team was not directly involved in the training of super - large models, we had been doing research related to large models and had published many papers.

I personally had done research on Sparse Attention and was relatively familiar with the mechanism and significance of the attention structure. Therefore, as soon as I saw some design details, I immediately judged the problems.

Of course, this kind of judgment is not unique to me. Many researchers can detect the problems. But the real difficulty lies in how to convince others. We need to spend a lot of time and energy to explain and prove the existence of these problems. Usually, only when the other team realizes the seriousness of the problems during their internal self - inspection will their attitude start to change.

Class Representative Attention: In other words, even without directly training super - large models, the intuition and experience in the research process can still help you quickly locate problems, judge deviations, and propose correction directions.

Tian Yuandong: Yes. This is the core value of a researcher: Even in the case of "sparse data points", one can infer key conclusions and apply them to more complex problems. On the contrary, if a person has no insight and only runs experiments and adjusts parameters continuously, such work is very easy to be replaced. The advantage of a researcher lies in identifying structural problems with limited signals, thus avoiding a large amount of invalid calculations and resource waste.

Class Representative Attention: You just mentioned "sparse data points". What does it specifically refer to? Are they scattered results from different papers or experiments?

Tian Yuandong: You can understand it in this way. For example, a novice may need to run ten thousand groups of experiments to get ten thousand values, but these data are "dead" - lacking structural analysis and summary.

An experienced person can judge whether a route is feasible by seeing twenty or even ten data points, or even just observing a part of the training curve, and then stop losses in time and adjust the direction.

This is also why AI researchers usually have high salaries: a truly high - quality "insight" may save the trial - and - error costs of hundreds, thousands, or even tens of thousands of GPU cards. GPUs are of course important as they can support larger - scale experiments and bring more observation opportunities; however, insight and computing power are complementary.

Class Representative Attention: You used two words just now, "experience" and "insight". I'd like to explore this issue in depth: What do you think insight ability is? Some people think it is taste, and some think it is intuition. What's your opinion?

Tian Yuandong: We need to observe how a person thinks through conversations and questions. Let me give an example: In the PhD qualifier, teachers will keep asking questions around a certain topic (such as partial differential equations) until the candidate can clearly explain the connections between key concepts and express the relationship between the two most core elements in the most concise language.

If a person can only recite definitions but cannot clarify the principles, such as when A→B and when A→C, it means that he has not formed a truly transferable mental model. The most taboo thing in research is "concept - stacking" without mastering the relationships and usage conditions between them.

Current large - language models also generally lack this ability - to make robust extrapolations under the condition of "very little data". This is exactly where humans still have an advantage in some cognitive tasks.

03. How does the "epiphany" occur?

Class Representative Attention: This also echoes the original intention of my conversation with you - one of your research focuses is Grokking: explaining how models transition from "memory - based fitting" to "structured generalization". Your paper is centered around this mechanism.

Tian Yuandong: Yes. Grokking provides a dynamic path to observe the transition from "uncompressible" to "compressible representation". Understanding this path helps us obtain generalizable representations and stronger models with fewer samples and more reliable training signals in an environment with limited data and computing power.

Class Representative Attention: The "epiphany" you mentioned is not just an ability at the level of a specific task but a more underlying mechanism: At a certain point in time, the model completes a reorganization of representation, just like "learning" something.

I've noticed your previous exclusive interview, and in my discussion with Denny Zhou on the X platform about chain - of - thought, we also explored similar phenomena. Theoretically, if the logical chain can be fully expressed, chain - of - thought should be solvable;

but in reality, models often need a large amount of data to approximate the solution, while humans can grasp the key points instantly. This difference seems to be related to the underlying mechanism mentioned above. If you were to define this ability, would you prefer to call it reasoning or something else?

Tian Yuandong: More accurately, it occurs in the "common underlying" mechanism below reasoning or other tasks, which is representation learning.

As the training progresses, the model's representation will continuously evolve. At first, it is more like rote memorization; but with sufficient accumulation and connection, the structure will suddenly "connect", resulting in a turning point similar to "After reading a book a hundred times, its meaning will become self - evident". For example, in primary - school education, teachers may first require students to memorize some knowledge. After a period of time, through new knowledge connections, the originally vague meanings gradually become clear. This is part of the epiphany.

Class Representative Attention: That is to say, whether it is chain - of - thought or intuitive judgment, it ultimately depends on the underlying mechanism of "how I represent and understand the world"?

Tian Yuandong: Yes. For example, primary - school students may solve problems by exhaustive enumeration; while in junior and senior high schools, mathematical induction is introduced, and only a concise proof is needed to cover infinite situations. The "representation" behind this method has undergone a fundamental change. The key difference in the learning of neural networks also lies in the representation method.

04. Two research paths: Scaling Law and mechanism understanding, choosing the more difficult latter

It refers to that neural networks achieve the best generalization ability by finding the "shortest program" (the most concise model) that can fit the training data. Image source: Class Representative Attention

Class Representative Attention: Ilya Sutskever mentioned two questions in his 2016 MIT speech: Why does backpropagation work? And whether the theoretically optimal hypothesis space is equivalent to short programs. Do you mean that the model originally had to take many paths but suddenly found a more efficient connection, achieved compression, and thus obtained stronger generalization ability?

Tian Yuandong: Yes, "compression" is a popular but appropriate way to describe it. However, we still don't know when compression is possible and when it isn't.

This is exactly the significance of researching Grokking: It provides a dynamic path to show how the system transitions from an "uncompressible" state to a "compressible" state.