HomeArticle

Has AI reached its peak? The chief scientist of OpenAI denies it, and the industry is shifting from piling up computing power to pursuing intelligence density.

硅基观察Pro2025-12-01 08:12
AI isn't slowing down. You just don't understand it.

Has artificial intelligence reached its peak? The theory of "slowing down of AI progress" has emerged frequently in the past year.

Lukasz Kaiser, a co - author of the Transformer paper, the chief research scientist at OpenAI, and one of the core founders of inference models, recently gave a completely opposite view in the "Mad" podcast.

He mentioned that the development of AI has not slowed down. Instead, it is still accelerating along a stable and continuous exponential curve. The "stagnation" felt by the outside world actually stems from the change in the form of breakthroughs. The industry has shifted from simply creating "large models" to building smarter and more thinking - capable models.

In his opinion, pre - training is still crucial, but it is no longer the only engine. The emergence of inference models is like adding a "second brain" to the basic models, enabling them to learn to derive, verify, and self - correct, rather than just predicting the next word. This means that with the same cost, the model's capabilities will leap more significantly, and the reliability of the answers will be higher.

However, the "intelligence topographic map" of AI is still extremely uneven. Lukasz admitted that the most powerful models can solve Olympiad math problems but may not be able to count objects in a children's puzzle; they can write code better than professional programmers but may still misjudge the spatial relationship in a photo.

At the same time, the new paradigm has also brought new business realities. Facing hundreds of millions of users, cost - efficiency has outweighed the stacking of computing power. Model distillation has changed from an "optional item" to a "necessity". Whether small models can reproduce the wisdom of large models determines whether AI can be truly popularized.

In this interview, Lukasz not only negated the "AI slowdown theory" but also described a more refined, intelligent, and multi - layered future: the basic models are still expanding, the inference layer is continuously evolving, multi - modality is waiting for a breakthrough, and the efficiency war on the product side has just begun.

Here is the full text of the organized interview, enjoy~

/ 01 / AI isn't slowing down; you just don't understand

Host: Since this year, there has been a view that the development of AI is slowing down, pre - training has reached its ceiling, and the scaling law seems to have come to an end.

But just as we were recording this episode, the industry witnessed a flurry of major releases. Models such as GPT - 5.1, Codex Max, GPT - 5.1 Pro, Gemini Nano Pro, and Grok - 4.1 were almost unveiled simultaneously, which seems to break the theory of "AI stagnation". As experts in the forefront of AI laboratories, what progress signals have you seen that the outside world cannot capture?

Lukasz: The progress of AI technology has always been a very stable exponential improvement in capabilities, which is the overall trend. New technologies are emerging in an endless stream, and progress comes from new discoveries, increased computing power, and better engineering implementation.

In terms of language models, the emergence of Transformer and inference models are two major turning points, and its development follows an S - shaped curve. Pre - training is in the upper part of the S - curve, and the scaling law has not failed. The loss will decrease logarithmically and linearly with the increase of computing power, which has been verified by Google and other laboratories. The question is how much money you need to invest and whether it is worth it relative to the benefits.

The new inference paradigm is in the lower part of the S - curve. With the same cost, more benefits can be obtained because there are still a large number of discoveries to be released.

From ChatGPT 3.5 to the present, the core change is that the model no longer only relies on memory weights to output answers but can search the web, conduct reasoning and analysis, and then give correct answers.

For example, the old version would make up answers from memory when asked questions like "What time does the zoo open tomorrow?" It might read the opening time written on the zoo's website five years ago and fabricate outdated information. The new version can access the zoo's website in real - time and cross - verify.

ChatGPT or Gemini itself already has many capabilities that are not fully recognized. You can take a picture of a damaged thing and ask how to repair it, and it will tell you; give it a university - level assignment, and it can also complete it.

Host: I really agree with this statement. There are indeed many obvious areas for improvement, like "low - hanging fruits", which are easy to spot and solve. For example, the model sometimes has inconsistent logic, makes mistakes when calling tools, or forgets long conversations. These are problems that the industry has realized and is working hard to solve.

Lukasz: Yes, there are a large number of extremely obvious areas that need improvement. Most of them are engineering - level problems: laboratory infrastructure and code optimization. Python code usually works, but low efficiency will affect the quality of the results; in terms of training methods, reinforcement learning (RL) is more difficult and tricky to do well than pre - training; in addition, data quality is also a bottleneck.

In the past, we used raw Internet data warehouses like Common Crawl and needed to invest a lot of work in cleaning and refining the raw network data. Now, large companies have set up special teams to improve data quality, but it is still very time - consuming and labor - intensive to extract high - quality data. Synthetic data is on the rise, but every step in its generation, model selection, and specific engineering implementation is very important.

On the other hand, the development of multi - modal capabilities also faces challenges. Currently, models are far less mature in processing images and sounds than in processing text. Although the improvement direction is clear, to make a substantial breakthrough, it may be necessary to train a new generation of basic models from scratch, which means months of time and huge resource investment.

I often wonder how powerful these improvements can make the models. Perhaps this is an underestimated question.

/ 02 / AI learns to "self - doubt", and GPT starts to correct its own mistakes in advance

Host: I want to talk about inference models because they are really new. Many people actually haven't fully understood the difference between them and basic models. Can you explain in the simplest terms what the difference is?

Lukasz: Before giving the final answer, the inference model will think it over in its "mind", form a "chain of thought", and can also use external tools like search to clarify its thinking. In this way, it can actively search for information during the thinking process and provide you with more reliable answers. This is its visible ability on the surface.

Its more powerful aspect is that the focus of the model's learning is "how to think" itself, with the goal of finding a better reasoning path. Previously, models were mainly trained by predicting the next word, but this method is not very effective for "reasoning" because the reasoning steps cannot be directly used to calculate gradients.

So, we now use reinforcement learning to train it. It's like setting a reward goal and letting the model try repeatedly to figure out which thinking methods are more likely to get good results. This training method is much more difficult than the previous one.

Traditional training is not very picky about data quality and can generally run, but reinforcement learning requires extra care, careful parameter adjustment, and data preparation. Currently, a basic method is to use data where right and wrong can be clearly judged, such as solving math problems or writing code, so it performs particularly well in these fields. There are also improvements in other fields, but they are not as amazing.

How to do reasoning in multi - modality? I think it's just beginning. Gemini can generate images during the reasoning process, which is very exciting but still very primitive.

Host: There is a common view that pre - training and post - training are separated, and post - training is almost equivalent to reinforcement learning. But in fact, reinforcement learning has been involved in the pre - training stage, and we just didn't realize it in our previous understanding.

Lukasz: Before the emergence of ChatGPT, pre - trained models already existed, but they could not have real conversations. The key breakthrough of ChatGPT was the application of RLHF to pre - trained models. RLHF is a reinforcement learning based on human preferences, which trains the model by letting it compare different answers and learn the options preferred by humans.

However, if RLHF is over - trained, the model may over - "please" the user, making its core seem fragile. Nevertheless, it is still the core for realizing dialogue ability.

The current trend is shifting towards larger - scale reinforcement learning. Although the data scale is still smaller than that of pre - training, it can build models with the ability to judge correctness or preferences. This method is currently applicable to fields where evaluation can be clearly made and can be combined with human preferences for more stable long - term training to avoid the failure of the scoring system.

In the future, reinforcement learning is expected to expand to more general data and broader fields. The question is: Does doing certain things really require a lot of thinking? Maybe it does, or maybe we need more thinking and reasoning than we do now.

Host: To improve the generalization ability of reinforcement learning, is the key to having a better evaluation method? For example, the cross - economic field evaluation you launched before to test its performance in different scenarios. Is this kind of systematic measurement really necessary?

Lukasz: People usually think before writing. Although it's not as rigorous as solving math problems, there is always a general idea. Currently, models have difficulty fully simulating this process, but they have started to try. Reasoning ability can be transferred. For example, after learning to search the web for information, this strategy can also be used in other tasks. However, the training of models in visual thinking is still far from sufficient.

Host: How does the chain of thought work specifically? How does the model decide to generate these thinking steps? Are the intermediate inferences we see on the screen the real and complete thinking process of the model? Or is there actually a more complex and longer reasoning chain behind it?

Lukasz: The summary of the chain of thought you see in ChatGPT is actually a refinement of the complete thinking process by another model. The original chain of thought is usually quite verbose. If we just let the model try to think step by step after pre - training, it can indeed generate some reasoning steps, but the key is not just this.

We can train it like this: First, let the model try various thinking methods. Some will get correct results, and some will make mistakes. Then we select the thinking paths that lead to the correct answers and tell the model "This is the thinking method you should learn". This is where reinforcement learning plays a role.

This training has truly changed the model's thinking mode, and the effects have been seen in the fields of mathematics and programming. There is greater hope that it can be extended to other fields. Even in solving math problems, the model has begun to learn to correct its own mistakes in advance. This self - verification ability emerges naturally from reinforcement learning. Essentially, the model has learned to question its own output and will rethink when it thinks it may be wrong.

/ 03 / Pre - training is still a power - hungry behemoth, and RL and video models are frantically competing for GPU resources

Host: Talk about the transition from Google to OpenAI and the differences between the two cultures.

Lukasz: Ilya Sutskever was my manager when he was at Google Brain. Later, he left and founded OpenAI. He asked me several times over the years if I wanted to join. Then the Transformer was released, and then came the pandemic. Google completely shut down and restarted very slowly.

As a small team within a large company, the working atmosphere at Google Brain is very different from that of a startup.

Ilya told me that although OpenAI was still in its early stages, it was working on language models, which might be a good fit for my direction. I thought, "Okay, let's give it a try." Before that, I had only worked at Google and in universities. So joining a small startup was indeed a big change.

Overall, I think the similarities between different technology laboratories are more than people think. There are of course differences between them, but from the perspective of a French university, the difference between a university and any technology laboratory is actually much greater than the difference between laboratories. Whether it's a large company or a startup, they are more similar to each other in terms of "having to deliver".

Host: How is the internal research team at OpenAI organized?

Lukasz: Most laboratories are doing similar work, such as improving multi - modal models, enhancing reasoning ability, optimizing pre - training, or infrastructure. Usually, there will be specialized teams for these directions. Personnel may move around, and new projects will also be launched, such as diffusion models. Some exploration projects will expand in scale, like video models, which require more people to participate.

The allocation of GPUs is mainly based on technical needs. Currently, pre - training consumes the most GPUs, so resources are allocated to it first. The demand for GPUs from reinforcement learning and video models is also growing rapidly.

Host: What will happen to pre - training in the next one or two years?

Lukasz: I think pre - training has entered a stable development stage in terms of technology. Investing more computing power can still improve the effect, which is very valuable. Although the return is not as significant as that of reasoning technology, it can indeed enhance the model's capabilities and is worth continuous investment.

Many people ignore a real - world change: a few years ago, OpenAI was just a research laboratory, and all computing power was concentrated on training. It could build GPT - 4 without hesitation. But now the situation is different. ChatGPT has one billion users, generating a huge number of conversation requests every day, which requires a large amount of GPU resources to support. Users are not willing to pay too much for each conversation, forcing us to develop more economical small models.

This change affects all laboratories. Once the technology is productized, cost must be considered. Now we no longer only pursue the largest models but strive to provide the same quality with smaller and cheaper models. This pressure to reduce costs and increase efficiency is very real.

This has also brought renewed attention to distillation technology. By distilling the knowledge of large models into small models, we can ensure quality and control costs. Although this method has been around for a long time, we only truly realized its value when facing real economic pressure.

Of course, training ultra - large models is still important because it is the basis for distilling high - quality small models. With the continuous investment in GPUs in the industry, a new round of pre - training development is expected. But in essence, these changes are adjustments on the same technological evolution path, depending on the resources and needs at different stages.

The most important thing is to see that: Pre - training is always effective and can complement reinforcement learning. Running reasoning on more powerful basic models will naturally yield better results.

Host: The evolution of modern AI systems combines laboratories, RL, and many technologies. In the era of deep learning, people often said they understood AI at the micro - level, such as matrix multiplication, but they didn't fully understand what would happen when these elements were combined. A lot of work has been done on interpretability in the past few years, especially for complex systems. Is the behavior of the models becoming clearer, or is there still an element of a black box?

Lukasz: I think both views are reasonable. Fundamentally, we have indeed made great progress in understanding models. A model like ChatGPT has conversations with countless people, and its knowledge comes from the entire Internet. Obviously, we cannot fully understand everything that happens inside it, just like no one can understand the entire Internet.

But we have indeed made new discoveries. For example, a recent paper from OpenAI shows that if we make many connections in the model sparse and unimportant, we can more clearly track its specific activities when processing tasks.

So, if we focus on researching the inside of the model, we can indeed gain a lot of understanding. There is already a lot of research exploring the internal working mechanisms of models, and our understanding of the high - level behavior of models has made great progress. However, most of this understanding comes from smaller models. It's not that these rules don't apply to large models, but large models process too much information at the same time, and our understanding ability is ultimately limited.

/ 04 / Why can GPT - 5 solve Olympiad problems but fails at a 5 - year - old's math problem?

Host: I want to talk about GPT - 5.1. What has actually changed from GPT - 4 to 5 to 5.1?

Lukasz: This question is difficult. From GPT - 4 to 5, the most important changes are the addition of reasoning ability and synthetic data, and pre - training has significantly reduced the cost. By GPT - 5, it has become a product used by one billion people. The team has been constantly adjusting between security and friendliness to make the model respond more reasonably to various questions, neither being overly sensitive nor randomly rejecting. Although the hallucination problem still exists, it has improved a lot compared to before through tool verification and training optimization.