首页文章详情

"First Principles Thinking of Large Models": A Record of the Conversation between LI Jianzhong, GPT-5, and Lukasz Kaiser, the Inventor of Transformer

CSDN2025-10-13 18:43
Li Jianzhong had a conversation with Lukasz Kaiser to discuss large models, reasoning paradigms, and the future of AI.

Dialog Guests | Jianzhong Li, Lukasz Kaiser

The development speed of large intelligent systems is so rapid that it's almost impossible to keep up. New architectures and models are emerging in an endless stream, and each iteration may reshape the industry landscape. Sora 2, just released by OpenAI, is the latest example of this rapid evolution. It not only demonstrates the improvement of model capabilities but also reflects the rapid and complex iteration of intelligent systems.

In this technological wave, the in - depth dialogue column "AI Evolution" specially planned by CSDN is committed to analyzing the essence and development context of cutting - edge technologies and presenting the industry's thinking and practices to the public. Recently, Jianzhong Li, the Dean of the Singularity Intelligence Research Institute and the Senior Vice President of CSDN, and Lukasz Kaiser, a senior research scientist at OpenAI, had an in - depth dialogue on "First - Principles Thinking of Large Models" in this column.

Lukasz Kaiser is one of the most influential scientists in the field of AI. In 2017, he co - wrote the groundbreaking paper "Attention Is All You Need" with seven other Google colleagues (later known as the "Transformer Eight"). Historically, they proposed the Transformer architecture, which laid the core foundation of today's large language models. Later, he joined OpenAI and deeply led the research on GPT - 5, GPT - 4, and the inference models codenamed "o1" and "o3". As an AI researcher who has changed the world, his work directly defines the large language model technology we are familiar with today. Therefore, he has a profound understanding of the technical architecture of large models, the boundaries of the Scaling Law, and the new paradigm towards AGI - the inference model, which is beyond the reach of others.

Facing such a leader in the technological frontier, Jianzhong Li, with his in - depth insights and thinking on model architecture, Agents, Scaling Law, and future paradigms, put forward sharp and insightful questions. The exchange between the two is not only an analysis of technical details but also a bold deduction of future development directions.

The following are the ten most important topics of this dialogue:

  1. Dialogue 1 | What exactly does language mean for intelligence?
  2. Dialogue 2 | Challenges of multimodality and world models
  3. Dialogue 3 | AI programming: Is natural language the ultimate goal or a new "Tower of Babel"?
  4. Dialogue 4 | The generalization dilemma of Agents: Is it a method problem or a fundamental limitation?
  5. Dialogue 5 | Computing power and algorithms: Is the Scaling Law a belief or a path - dependence?
  6. Dialogue 6 | Challenges of embodied intelligence: Is it a data problem or a fundamental difference between bits and atoms?
  7. Dialogue 7 | Reinforcement learning: Is it a super - optimizer or an engine for scientific discovery?
  8. Dialogue 8 | Organizational leap of AI: How to achieve large - scale Agent collaboration?
  9. Dialogue 9 | Bottleneck of AI memory: How far is the model from true "native memory"?
  10. Dialogue 10 | How can large models get rid of transient learning and learn continuously like humans?

At this critical moment in the development of AI, it is believed that the in - depth speculation on cutting - edge issues in this dialogue will provide important references and inspirations for us to understand the next development stage of AI.

Dialogue 1 | What exactly does language mean for intelligence?

Jianzhong Li: I'd like to start by talking about the roles of language and vision in AI. There are some views in the industry, represented by Yann LeCun and others, that using language models to reach AGI is a dead end. The reason is that language is a low - bandwidth, lossy description of the physical world. AI must learn from high - bandwidth data such as vision. However, if we look back at the development history of AI, before the emergence of large language models, neural networks had been widely used in the field of vision, but the intelligence level of AI at that time was quite low. It was not until the emergence of large language models like ChatGPT that the intelligence of AI really took off. What do you think of the roles of language and vision in the process of building intelligence?

Lukasz Kaiser: I think it is very useful to understand language from the dimension of time. There is a famous saying, although I've never verified its authenticity: There is an animal (sea squirt) swimming in the sea that has a brain. But when it settles on a rock and never moves again, the first thing it does is to eat its own brain because the brain is useless for a non - moving creature. This story shows that if you don't take action, intelligence is actually not very useful.

Most of the vision models we used to talk about were static, such as answering questions like "Is there a cat in this picture?" There were no real video models at that time. Therefore, I believe that existence in the time dimension - which may mean taking action, even if it's just explaining changes over time - is crucial for intelligence. Language obviously has a time dimension. It is always generating the next word, and then the next, continuously.

What we now call language models were called sequence models when we developed the Transformer. It doesn't matter what kind of sequence it processes. Even now, it can process "protein sequences" or "audio sequences". Therefore, time sequences are an important part of expressing intelligence.

Jianzhong Li: Personally, I tend to think that language has been encoded and compressed by humans, and it is more efficient in representing intelligence than vision. Even videos with time sequences are often less effective in representing intelligence than language. Yuval Noah Harari proposed in his book "Sapiens: A Brief History of Humankind" that the biggest difference between humans and animals is that we can use language to describe things that don't exist in the world. The famous philosopher Ludwig Wittgenstein also had a famous quote: "The limits of my language mean the limits of my world." I once expressed the view that looking back at the past decade, the milestone development in the field of AI is due to our realization of the core role of language in intelligence. The success of ChatGPT and the Transformer both stem from this.

Lukasz Kaiser: I also believe that language is the key to endowing intelligence with a special power. Although many animals without language also have a certain level of intelligence, and intelligence can develop without language. Technically, it is very convenient to train with language. We have a vast amount of language data on the Internet, and training with language is much cheaper than with video. Some of these advantages are at the practical level. In the future, to obtain more excellent intelligent models, we still need to train based on video and audio. Technically, this will be different from pure language models, but on the other hand, sequence processing and attention mechanisms are also applicable when processing such data.

Jianzhong Li: Some people think that current large language models are just "parroting", and they believe that the models don't really understand the text they learn and generate. But if we carefully observe the learning mechanism of large models, it is very similar to the human learning process. For example, a paper by Anthropic in March showed that when a model is trained on language, it will form "abstract concepts" internally. The paper talked about how a model learns words in different languages, such as "apple". It creates an independent "abstract concept of apple" in the neural network that is not bound to any specific language. And during the training process, the model was never explicitly taught an "abstract concept of apple". This seems very similar to the process of humans building a complex system of abstract concepts in the brain when learning a language.

Lukasz Kaiser: We can now prove in practice that language models do form concepts, especially since current models are trained in multiple languages in parallel, which is easy to observe. You can give a model a math problem and rephrase it in five different languages. Although the model generates answers token by token, and the tokens of different languages are completely different and have nothing in common, the answers are basically the same. If the model makes a mistake in English, it will make the same mistake in Chinese. If the model uses a certain problem - solving method, then the answer in another language is basically a translation of the previous answer.

This clearly shows that somewhere in the activation state of the network, the model is solving problems and thinking about concepts in a very abstract space, and then expressing it in a certain language in the upper - layer network. In this sense, there are obviously abstract concepts independent of language in the model, and some people have already studied this. You can even see concepts corresponding to specific topics or behaviors.

But we should also remember that at least for those models that have not been trained with a large amount of multimodal data, they may not have concepts corresponding to certain physical entities similar to our human concepts. For example, concepts like "pain" or "love" that we believe in. The model knows these words and can tell you beautiful stories, but this is different from our concepts rooted in the real feelings of the physical world.

So, the model does have concepts, but we should also understand that at least some of these concepts may be different from our human concepts. Although the words used by the model seem similar because they come from our language and the Internet, it doesn't mean that their connotations are exactly the same. In many fields, such as mathematics, this difference may not matter much. Because mathematics is also very abstract to us, and we mainly learn it through symbols and pictures, just like the model. But in things closely related to the body and the physical world, the situation is a bit different. We may be confused by the model's words because it uses the same words as us, but the connotations are not exactly the same.

Dialogue 2 | Challenges of multimodality and world models

Jianzhong Li: Currently, multimodality is developing very rapidly. There is a trend in the industry to pursue a "unified model, unified modality" - using a general architecture to handle all modalities and tasks. However, different modalities seem to be suitable for different models. For example, language is suitable for autoregressive models, while vision is suitable for diffusion models. I noticed that in the same month when you "Transformer Eight" published "Attention Is All You Need" in June 2017, seven of you also published a paper "One Model to Learn Them All". Eight years later, how do you view the relationship between "unified modality" and "unified model"? What are the biggest challenges here?

Lukasz Kaiser: From a practical perspective, modern large language models like GPT - 4 are already multimodal models. They can receive image and audio inputs and also generate images and audio. In a sense, I could say that we have solved this problem. But I also admit that the transfer level between modalities is still not satisfactory.

When the model is large enough and there is enough data, they can manage to complete multimodal tasks. You can enable the voice mode in ChatGPT, and it will talk to you. When necessary, it will also transcribe the voice into text, think, and answer, and even sing. So from a practical perspective, great progress has been made on this issue.

But I admit that when you carefully observe videos, there are some unsatisfactory aspects. The current way large language models handle multimodality is usually through VQ - VAE. Each part of an image or audio will get a special code through an encoder. This encoder is usually pre - trained and fixed, and sometimes it may be trained together with the large language model, but the training volume is usually small and has a fixed frequency. For audio, it may correspond to a symbol every few seconds; for images, it may correspond to a symbol for every certain number of pixels. This method is effective, and we have successfully made it work. But it doesn't feel very satisfactory because our eyes are not like sensors with a fixed resolution. Of course, in a sense, they are, but we can move our eyes around to dynamically obtain information.

So, I think we can integrate multimodality more deeply into the model. This requires the VQ - VAE codes we currently use to become more trainable and be able to interact more with language. There is excellent research going on in this area, and as people get more used to models handling multimodal tasks, it will promote the in - depth integration of this research into large language models.

Jianzhong Li: I don't understand why many vision - oriented researchers often deny the importance of language. As you said, interacting with language is very important for multimodality. Without language, vision seems to be just some pixel signals. Language plays an important role in assigning semantic meanings to each object in vision. Personally, I think if some vision - oriented researchers continue to deny the value of language in intelligence, they may fall into the wrong path again, just like before the release of ChatGPT in 2022. At that time, the vision - oriented approach was very popular, but recognition ability is a very low - level ability in intelligence. True cognition and understanding seem to be inseparable from language.

Now let's talk about world models. Some scholars, including Yann LeCun and Fei - Fei Li, believe that it is impossible to achieve general artificial intelligence (AGI) with large language models because they believe that the world model is the core of AGI. They think that AI must first learn the rules of the physical world by observing the world and then can really conduct reasoning. However, I highly doubt that AI can understand all the laws of the physical world just by observing the world?

Lukasz Kaiser: I believe that modern large language models are, to some extent, world models. The question is, are they good enough world models? To answer this question, we need to ask ourselves what aspects of the world they should describe?

I think when it comes to text and mathematics, they are amazing models. If you ask "What's the next word?", they are almost unparalleled excellent language models and can accurately tell you what people usually say after this sentence on the Internet. But their performance as physical models is not as good as their performance as language models. There are several reasons behind this.

First, as we said, they have not been trained on enough video data. Second, the video data formats commonly used in our computers are very different from the way we experience the world because we also take actions and move our eyes. Our experience is never like pure images playing in front of us. Maybe it was like that in the early days of infants, but it disappears quickly. So, both the quantity and quality of the data are not good enough. Moreover, as I said before, I think the current architecture is still not sufficient for this, although the multimodal capabilities of large language models have been steadily improving and I think they will continue to improve.

So I think improving the architecture and loss function, along with better and more data, will help bridge the gap between what people think of as a "world model" and a "language model". In addition, models like Sora, Genie, and Veo show that if you learn from videos, even using current methods, you can get very close to a world model. Maybe it's not completely there yet, and in terms of data efficiency, the learning process is definitely not as efficient as humans, but we are making significant progress in bridging the gap.

Jianzhong Li: Personally, I feel that a real world model needs to integrate language models and other modalities, as well as language - based reasoning. Simply observing the world cannot form intelligence. Just like before the scientific revolution in the 16th - 17th centuries, people might get the wrong concept that "the earth is the center of the universe" by observing the world. And now every educated child knows that the sun is the center in the Milky Way. This obviously cannot be obtained by simply observing the world but through training based on text.

Dialogue 3 | AI programming: Is natural language the ultimate goal or a new "Tower of Babel"?

Jianzhong Li: Let's talk about programming. AI programming seems to have become a killer application for large language models. When you created the Transformer architecture, did you think it could not only handle human language but also handle programming languages so well?

Lukasz Kaiser: Of course, Ilia Polosukhin, the co - inventor of the Transformer, even left Google before the publication of the "Attention Is All You Need" paper to found a company dedicated to automated programming. I almost became a co - founder of that company, but I thought the timing was a bit too early at that time. Later, the company successfully transformed into the cryptocurrency field, but it may return to the field of automated programming in the future. So, this was indeed within our expectations. Compared with foreseeing the emergence of products like ChatGPT so soon, we may have been more confident in the feasibility of automated programming at that time because it