HomeArticle

The author of Transformer reveals the inside story of GPT-5.1, and the internal naming rules of OpenAI have become chaotic.

量子位2025-12-01 09:21
The underlying paradigm of AI is undergoing a transformation.

We are experiencing a quiet but fundamental transformation in the AI paradigm.

Its significance is no less than that of the Transformer itself.

In the past year, there has been a divergence of two views regarding AI development:

  • On one hand, there is the view that “AI growth has slowed down, models have reached their peak, and pre - training is useless.”
  • On the other hand, there have been frequent “AI big weeks”: GPT - 5.1, Gemini 3, Grok 4.1.

Recently, Łukasz Kaiser, one of the authors of the Transformer and a current research scientist at OpenAI, gave a first - hand account in an interview.

The interview is extremely informative, covering topics such as the underlying paradigm shift in AI, the naming rules of GPT - 5.1, the future development trends of AI... and some anecdotes behind the birth of the Transformer.

AI is not slowing down; it is undergoing a generational change.

GPT - 5.1 is not just a simple minor version update. There have been changes in the internal version naming rules at OpenAI.

Multimodal reasoning will be the next breakthrough point.

AI will not make humans completely unemployed.

Household robots are the most visible AI revolution after ChatGPT.

Let's take a look at the detailed content below:

AI development is not slowing down but growing steadily

In the past year, there have been constant voices claiming that “model progress has slowed down,” but Łukasz believes this view is incorrect.

His explanation is straightforward:

From an internal perspective, the growth of AI capabilities follows a very smooth exponential curve.

This is similar to Moore's Law. Moore's Law has remained valid for decades and has even accelerated with the promotion of GPUs. Ultimately, this is because it has gone through several generations of technological iterations.

Therefore, from an external perspective, the trend of AI is stable; from an internal perspective, its progress also depends on the combined effects of new technologies, improvements in computer capabilities, and engineering optimizations.

As for why some people think it has “slowed down,” there is only one reason: The underlying paradigm of AI has quietly shifted from pre - training to reasoning models.

This is another crucial turning point after the birth of the Transformer.

If the process of technological development is described as an S - shaped curve (startup → rapid growth → stable period), then pre - training is in the later stage of the upward slope of the S - curve, while reasoning models are still in the initial stage.

However, this does not mean that the Scaling Laws of pre - training have become invalid. They still play a role, but compared with the new reasoning paradigm, they require more capital investment.

So, for economic reasons, industry insiders have generally started to shift their focus to smaller and cheaper models with the same quality. This is also one of the reasons why the outside world thinks that pre - training has stopped.

Then, regarding reasoning models, since this paradigm is still in its emerging stage, the progress rate will be quite fast.

Taking ChatGPT as an example, GPT - 3.5 would directly give answers based on the memory of training data without using any external tools or reasoning. In contrast, the latest ChatGPT will actively browse websites, conduct reasoning and analysis, and then give accurate answers.

For ordinary users, if they don't make a careful comparison, they may think the difference between the two is not significant. But in fact, there is a qualitative leap in performance behind this.

Another example is Codex. The working mode of programmers has changed to the model of “let Codex process first, and then make manual fine - tuning” in recent months. This change is actually quite radical, but if you are not a professional programmer, you naturally won't notice this fundamental change.

So, generally speaking, all these changes have happened so fast that people haven't even noticed them.

The essence of reasoning models is actually similar to that of large foundation models. The only difference is that before giving the final answer, they will think first, which is the so - called chain of thought.

During the thinking process, the model is allowed to use tools, such as browsing the web, to give more accurate answers. The reasoning process will also be regarded as part of the model and be trained.

Compared with traditional deep neural network gradient - descent training, reasoning models use more reinforcement learning.

Specifically, reinforcement learning promotes the model to obtain better answers through a reward mechanism. It also requires researchers to provide more detailed data preparation to complete the parameter adjustment of reinforcement learning.

Then, through reinforcement learning, the model can learn to correct its own mistakes.

In the future, the industry will continue to shift to more complex reinforcement learning, such as using a large model to judge the correctness or preference of answers, or integrating more human preferences.

In short, the application scope of reinforcement learning will be more extensive in the future. It will not only be applicable to specific fields but also be able to handle more general data, such as multimodal reasoning. Although Gemini has recently been able to generate images during the reasoning process, overall, it is still in the initial stage. It is believed that there will be further improvements with the help of reinforcement learning.

GPT - 5.1 is by no means a simple minor version update

Regarding the recently released GPT - 5.1, Łukasz also revealed more details.

Although GPT - 5.1 seems to be just a minor version update, in fact, it is a huge stability iteration from an internal perspective.

First, looking back at the transition from the original GPT - 4 to GPT - 5, simply put, thanks to the application of reinforcement learning and synthetic data, the reasoning ability of GPT - 5 has been significantly improved.

The improvements in GPT - 5.1 are more concentrated in the post - training stage, such as increasing security, reducing hallucinations, and adding multiple style options such as nerdy and professional.

The naming method of versions is no longer linked to technical details but is instead oriented towards user experience. For example, GPT - 5 is a model with stronger basic capabilities, GPT - 5.1 is a version with better capabilities, Mini is a smaller, faster, and cheaper model with slightly weaker performance, and the reasoning model focuses on complex tasks.

This change in the naming method has also brought more flexibility to OpenAI internally. Now, multiple projects such as reinforcement learning, pre - training, and slide optimization are working in parallel. Then, through distillation technology, the results of multiple projects can be integrated into one model.

This has greatly shortened the model iteration time and can better meet the needs of user experience. So, although GPT - 5.1 seems to be a minor version update, in fact, it is a strategic adjustment made by OpenAI based on users' expectations of its capabilities and goals.

However, to be honest, GPT - 5.1 still has some shortcomings in certain capabilities.

For example, Łukasz gave an example using his 5 - year - old daughter:

GPT - 5.1 can handle Olympic competition questions with ease, but it makes a lot of mistakes when dealing with odd - even number questions for first - grade primary school students.

The question is that there are two groups of dots in the picture, with a shared dot in the middle. The question is whether the total number of dots is odd or even.

A 5 - year - old child can calculate the answer within 10 seconds (because the existence of the shared dot makes the total number of dots odd), but both GPT - 5.1 and Gemini 3 will automatically ignore this shared dot and misjudge it as an even number.

This is mainly because the model lacks sufficient multimodal capabilities and fails to transfer the reasoning experience of one problem to similar scenarios. Therefore, in the future, they will further strengthen multimodal reasoning and context - based reasoning transfer capabilities during training.

From Google's Transformer to OpenAI

As one of the authors of the Transformer, Łukasz also added many details about its birth in the interview.

Łukasz himself was originally a scholar focused on theoretical computer science. He was interested in mathematics and computers since high school and obtained a doctorate in theoretical computer science and mathematics in Germany.

He has always been curious about questions such as “how does thinking work” and “what is the essence of intelligence.” He also obtained a tenured position in France and engaged in research on logic and programming.

It wasn't until the rise of deep learning that he joined Google.

He first became a member of Ray Kurzweil's team and then transferred to Google Brain, where he began to cooperate with Ilya Sutskever and others.

During the development of the Transformer, Łukasz was mainly responsible for coding and system work and participated in the development of the TensorFlow framework.

Interestingly, according to his recollection, the eight co - authors of the Transformer paper never appeared together in the same physical room.

Although they had never met each other in person, they jointly built this model from different perspectives:

Some focused on the attention mechanism itself, some studied how to store knowledge through feed - forward networks, and others, like himself, were responsible for solving engineering implementation problems.

From today's perspective, the Transformer is undoubtedly a milestone in today's AI architecture. But at that time, many people did not understand the idea of using the same model to handle multiple tasks. They generally believed that different tasks should be trained with different dedicated models.

However, the eight of them firmly believed in their choice, and later facts proved that their idea was correct.

One of the reasons why he left Google and joined OpenAI was because of Ilya.

Ilya was Łukasz's direct supervisor at Google. After founding OpenAI, he repeatedly invited Łukasz to join. Coincidentally, Łukasz couldn't adapt to the expansion of the Google Brain team and the remote work atmosphere, so they hit it off and he came to OpenAI.

OpenAI didn't disappoint him. There is no strict organizational structure here. Teams are formed spontaneously according to projects and will be adjusted flexibly according to project progress. The team will only be gradually expanded until the project matures.

Of course, there will also be resource competition between different projects, because the GPU resources inside OpenAI are limited.

From a technical perspective, pre - training currently consumes the most GPU resources, followed by reinforcement learning and video models. Resource allocation is largely determined by technical requirements.

So, competition is inevitable, and Łukasz himself is no exception.