A Half-year Review of Large Models in 2025: O3, Agent, and Scaling Law
At the middle or end of each year, some scientists, entrepreneurs, or industry KOLs will make a summary and prediction for the fields in which they are active. In the era of large models where "one day in AI is equivalent to three years in the real world", such reviews and outlooks are of great reference value.
Recently, Nathan Lambert, a machine learning researcher and the head of post-training at the Allen Institute for Artificial Intelligence, conducted an in-depth discussion on topics such as "the search function of o3", "the progress of agents and models", and "the slowdown of scaling growth" in a personal blog post.
He wrote, "As the release speed of new models slows down, it's time to review what achievements we've made this year and the direction of future development."
Photo | Nathan Lambert
In his opinion, the unique search ability demonstrated by o3 proves that OpenAI has made a technological breakthrough in improving the reliability of search and the use of other tools in inference models. "The best description I've heard of its relentless pursuit when searching for specific information is like a 'well-trained hound that has caught the scent of its target'."
He also said that in the future, more artificial intelligence models will be similar to Anthropic's Claude 4. Although the improvement in its benchmark tests is small, the progress in practical applications is significant. "Making minor adjustments to the model can make an agent like Claude Code appear more reliable."
In addition, when talking about the issue of the "basic stagnation" of the pre-training scaling law, he said "new scale-level expansions may only be achieved every few years or may not be achieved at all," which depends on whether the commercialization of artificial intelligence goes as smoothly as expected.
Nevertheless, he doesn't think that "pre-training as a science is no longer important." Gemini 2.5 is a counterexample.
Academic Headlines has refined the overall content without changing the original meaning, as follows:
Original link: https://www.interconnects.ai/p/summertime-outlook-o3s-novelty-coming
Summer has always been a relatively quiet period in the technology industry. OpenAI seems to fully conform to this trend. Its open-source model "needs more time" for optimization, and the release of GPT-5 also seems to be constantly postponed. These will obviously be major news, but I'm not sure if we'll see them before August.
I'll use this short break in the wave of artificial intelligence releases to review where we've been and look forward to the future direction. Here's what you need to know.
o3: A Technological Breakthrough Beyond Scaling
Regarding OpenAI's o3 model, the mainstream view is that they "expanded the computing resources for reinforcement learning training," which led to some strange and brand-new over-optimization problems. This statement is correct, and the live content of the press conference still represents a breakthrough - that is, to scale up the data and training infrastructure through Reinforcement Learning with Verifiable Rewards (RLVR).
There hasn't been much discussion about the different search experience brought by o3. For an ordinary query, o3 can search dozens of websites. The best description I've heard of its relentless pursuit when searching for specific information is like a "well-trained hound that has caught the scent of its target." o3 gives the impression that it can find information in a completely different way from any existing model.
It's worth noting that several months have passed since its release in April 2025, and no other leading laboratories have launched any similar models. In a context where the releases between laboratories (especially OpenAI and Google) seem to be completely synchronized, the continuous search ability in o3 still impresses me.
The core question is, when will another laboratory release a model of the same quality? If this trend continues until the end of summer, it will confirm that OpenAI has made a technological breakthrough in improving the reliability of search and the use of other tools in inference models.
For comparison, let's think about a basic problem faced by the open and academic communities, which is how to build a model inspired by o3 (with actual search ability closer to that of GPT-4o or Claude 4):
1. It's crucial to find RL data that can motivate the model to search. In RL experiments, it's easy to let the model try to search in the system prompt. However, as the training progresses, if the tool isn't practical enough, the model should quickly learn to stop using it. In this regard, OpenAI is very good, especially combined with Deep Research's RL training experience (I learned that its training is based on o3). Additionally, a research paper showing extended RL training in the style of DeepSeek R1 and maintaining consistent tool usage on a large data subset would really impress me.
2. The underlying search index is also very important. OpenAI's models run on the Bing backend. Anthropic uses Brave's API, but the performance is poor (there's a lot of SEO spam). Building an academic baseline using these APIs will incur some additional computational costs. Once there's a reliable open baseline, we can conduct some interesting scientific research, such as exploring which model can best generalize to unseen datasets - this is a key feature when deploying models on locally sensitive data (such as in the medical or banking industries).
If you haven't used o3 for searching, you really should give it a try.
Agent Performance Will Significantly Improve
The product-market fit of Claude Code (along with Claude 4) is excellent. It's the perfect combination of products - stable and efficient in operation, and the user experience (UX) is highly compatible with the domain... It's really a pleasure to use.
In this context, I've been looking for more ways to write about it. One problem is that I'm not part of the core user group of Claude Code and other programming assistants (such as Codex and Jules). I don't often develop in complex codebases - I'm more like a research manager and problem solver within an organization, rather than a developer who continuously develops in a single repository - so I don't have practical guides on how to make the most of Claude Code, nor do I have the experience to help you "feel AGI."
What I know are models and systems, and some basic facts about cutting-edge models make the development trajectory of these agents quite optimistic.
The novelty of LLM-based agents is that they involve multiple model calls, sometimes even requiring multiple models and various prompt configurations. Previously, the models used in chat windows were designed to complete linear tasks and return the results to users without managing complex memories or environments.
Adding a real environment to the model requires the model to complete more tasks, and the scope of tasks is often broader. When building these agentic systems, there are two types of bottlenecks:
(1) The model can't solve any task we want the agent to perform, and (2) the model fails in some minor details of the deployed tasks.
For agents that have made initial progress, such as Claude Code and Deep Research, most of the problems they exhibit belong to the second category. The way laboratories solve this is to find recurring abnormal failures in practical application scenarios. This may manifest as only 50% reliability in some long-tail daily tasks. In this case, laboratories can usually easily generate new data and incorporate it into the model's continued training, thereby increasing the reliability of this subtask to nearly 99%. Since laboratories currently achieve performance improvement mainly through post-training rather than large-scale pre-training, the time required to integrate these improvements is much shorter than in recent years.
The key lies in how all this works together. Many complex tasks may be hindered by some small failures. In this case, making minor adjustments to the model can make an agent like Claude Code appear more reliable, even though the peak performance of the model doesn't change significantly. The same is true for Deep Research.
Therefore, I expect the agents we're currently using to achieve random and significant performance improvements.
What I'm currently unsure about is when new agent platforms will emerge. One influencing factor is the product issue, and the other is the performance bottleneck issue. The development path of new agent platforms that seem to have achieved product-market fit (PMF) may be a bit random, but platforms that have achieved PMF can achieve significant improvements through cutting-edge models as we're used to.
This is a new path for the industry, which will use a different way of information dissemination. In the future, more artificial intelligence models will be similar to Anthropic's Claude 4. Although the improvement in its benchmark tests is small, the progress in practical applications is significant. This trend will have implications for policies, evaluations, and transparency. To judge whether technological progress continues, more detailed analysis is needed, especially when critics seize the opportunity of stagnant evaluation indicators and claim that artificial intelligence is no longer effective.
Similar to o3, even if you don't program often, you should try using Claude Code. It can quickly create interesting demos and independent websites. Compared with fully autonomous agents like Codex, it currently has a great advantage in terms of ease of use.
The Model Scaling Speed Is Slowing Down
In 2025, most of the models released by leading artificial intelligence laboratories no longer continue to increase in total parameter scale. Take Claude 4 as an example, its API pricing remains the same as that of Claude 3.5. OpenAI has only released a research preview version of GPT-4.5. Gemini hasn't released its Ultra version yet. There are more undisclosed models within these laboratories, which are larger in scale.
It should be noted that many of these models may be slightly smaller in scale. For example, Claude 4 Sonnet may be slightly smaller than Claude 3.5 Sonnet, which is due to the efficiency improvement in the pre-training stage. This marginal technological progress has a significant impact on price and inference speed, especially in the long run, but this isn't the core of my argument.
The key point is that the improvement in GPT-5's capabilities is mainly achieved through expansion during inference, rather than simply relying on "a larger single model." For years, we've been told that "the laboratory with the largest training cluster will win the competition because they have an advantage in scaling." That's why Musk built the xAI giant cluster. Now, the largest cluster only has an advantage in the overall R & D progress.
At the user demand level, scaling is basically no longer attractive. In the future, when laboratories encounter extremely challenging problems that users need to solve, they may refocus on this area. Although the training computational cost of GPT-4.5 is about 100 times that of GPT-4, its improvement in regular user indicators is only slightly significant.
What we're seeing is a large-scale efficiency improvement for the model sizes that users like. Several standards have also been formed in the industry:
1. Tiny models, such as Gemini Flash Lite or GPT 4.1 Nano;
2. Small models, such as Gemini Flash and Claude Haiku;
3. Standard models, such as GPT-4o and Gemini Pro;
4. Big models, such as Claude Opus and Gemini Ultra.
These models have relatively predictable price points, latency, and ability levels. As the industry matures, such standards are crucial!
Over time, the improvement in efficiency will give rise to new standards. We'll see the widespread popularity of models such as Gemini Ultra and GPT-4.5 (GPT-5), but the subsequent development direction isn't clear yet. Currently, new scale-level expansions may be achieved "every few years" or may not be achieved at all, which depends on whether the commercialization of artificial intelligence goes as smoothly as expected.
Scaling, as a factor in product differentiation, was no longer effective in 2024. However, this doesn't mean that pre-training as a science is no longer important. The recent report on Gemini 2.5 clearly points out:
The Gemini 2.5 series of models have made significant progress in improving the stability of large-scale training, signal propagation, and optimization dynamics. Compared with previous Gemini models, they've achieved significant performance improvements in the pre-training stage.
This article is from the WeChat official account "Academic Headlines." Compiled by Academic Jun. Republished by 36Kr with authorization.