2025 AI Annual Review: After Reading 200 Papers, Looking at DeepMind, Meta, and DeepSeek, What AGI Narratives Are the Chinese and American Giants Describing?
Editor's Note: Forge ahead with resolve, reimagine the future through reconstruction. Elephant News and Elephant Wealth, in collaboration with Tencent News and Tencent Technology, present the 2025 year-end project "Resolve and Reconstruction"—looking back at 2025 and forward to 2026, letting insights illuminate the essence and seeking certainty amid transformation.
In the just-concluded 2025, I read through approximately two hundred papers in the field of artificial intelligence.
If I had to use one phrase to describe the technical feel of this year, it would be the end of the "brute force aesthetics" era. The days of simply stacking parameters to pick low-hanging fruits are gone; 2025's technological evolution has returned to basic research.
In this article, I want to clarify three conclusions by sorting out the technical thread of this year:
First, in 2025, technological progress mainly focused on four areas: Fluid Reasoning, Long-term Memory, Spatial Intelligence, and Meta-learning. The reason is that the Scaling Law encountered diminishing marginal effects in terms of pure parameter scale. To break through the bottleneck of AGI
the industry was forced to find new growth points—shifting from "making models larger" to "making models smarter."
Second, the current technical bottleneck mainly lies in the fact that models need to "not only be knowledgeable but also know how to think and remember." Through the AGI framework proposed by Yoshua Bengio (based on CHC cognitive theory), we found that previous AI had serious "ability bias": it scored extremely high in general knowledge (K), but was almost blank in immediate reasoning (R), long-term memory (MS), and visual processing (V). This imbalance constitutes the biggest obstacle to AGI.
Third, these bottlenecks actually found some new solutions in 2025, making it a successful year for addressing weaknesses. The most important ones are three aspects.
● Reasoning Ability: Through the revolution sparked by Test-Time Compute (TTC), AI learned to think slowly, achieving a qualitative leap in reasoning ability from 0 to 8.
● Memory Ability: The emergence of Titans architecture and Nested Learning broke the stateless assumption of Transformers, giving models an internalized "hippocampus" and promising to completely cure goldfish memory.
● Spatial Intelligence: Video generation is no longer just a stack of pixels; it has begun to master physical laws and move toward a true world model.
Next, based on the papers I read this year, I will lead you to see in detail how these key pieces were put together.
(Due to space constraints, I only briefly describe the papers involved in each direction. If you are interested in in-depth understanding, you can refer to the relevant paper references at the end of the article, which have been processed by chapter.)
01 Evolution of Fluid Reasoning: Birth and Development of Test Time Compute
In 2024, the obvious shortcoming of AI was its immediate reasoning (R) ability. In the GPT-4 era, AI only relied on probabilistic intuition and had no reasoning ability at all. But in 2025, Test-Time Compute (TTC) exchanged time for intelligence by extending reasoning time. The core idea of TTC is: intelligence is not only a function of parameters but also a function of time. Represented by OpenAI o1 and DeepSeek R1, AI learned to "think slowly." By investing more computing resources in the reasoning phase, it began to conduct self-debate and deduction internally for seconds or even minutes before outputting answers.
This is the most important paradigm innovation of 2025, turning AI from a parrot that recites to a machine that thinks.
Because the thinking process of the model cannot be guided during pre-training, post-training, especially reinforcement learning (RL), has become the most important means to improve reasoning ability.
But things were not all smooth sailing. In 2025, the paper "Can Reinforcement Learning Truly Inspire LLMs to Surpass the Reasoning Ability of Base Models?" sparked an academic debate that lasted about half a year. The study found that in many cases, the correct reasoning paths generated by RLVR-trained models actually existed in the sampling distribution of the base model. The role of RL was merely to sharpen the distribution, significantly increasing the probability of sampling these paths, rather than truly "creating" reasoning abilities completely unknown to the base model.
In response, after half a year of subsequent debates, the current consensus is that the base model may indeed contain all necessary atomic reasoning steps (such as addition and subtraction, basic logical transformations), but the role of RL is to screen out strategic paths that can stably maintain long-distance dependencies through tens of thousands of trials and errors.
In addition, CMU research pointed out that RL training has three stages. The first stage is "sharpening," which only increases the probability of known paths; but as training deepens, the model enters the "Chaining" stage, starting to link asymmetric skills (such as verification and generation) that originally had extremely low probabilities in the base model, thereby solving problems never seen before. This shows that RL is not only sharpening but also effectively combining new reasoning methods.
However, this metaphysical discussion in academia failed to stop the industry's enthusiasm for engineering optimization, because the growth of Benchmark does not lie.
The essence of reinforcement learning is to obtain feedback through interaction with the environment, find a balance between exploring the unknown and utilizing the known, and learn an optimal decision-making strategy with the goal of maximizing long-term cumulative rewards. Therefore, its engineering can be split into three core strategies: exploration strategy (sampling), scoring (including scoring standards and how to score), and parameter update algorithm.
In 2025, reinforcement learning methods made significant progress in two of these parts. The sampling strategies are still focused on three methods: Monte Carlo method (finding new branches step by step), brute force temperature sampling (increasing the diversity of the model to sample multiple possibilities), and the STaR mode that became popular in 2023 (the model comments on its own conclusions and finds other paths based on the comments). However, in 2025, due to the success of DeepSeek R1, brute force temperature sampling obviously became the mainstream because it is simple in engineering and can produce good results.
Innovation in Scoring Systems
First, in 2025, reinforcement learning based on verifiable rewards (RLVR) and sparse reward indicators (ORM) rose comprehensively.
Due to the success of DeepSeek R1, everyone found that as long as the model is given a right/wrong conclusion as a reward signal, it can spontaneously explore the reasoning process. This led to the rise of ORM.
In the ORM field, in areas where the correctness of results can be clearly verified (objective truth), such as mathematics, code, and logic, reinforcement learning is easy to implement and its effect is easy to improve. The reinforcement learning reward mechanism based on these objective truths is called verifiable reward. In the first half of 2025, the method of RLVR (verifiable results) + GPRO (group exploration solution) made rapid progress, basically becoming the mainstream method, and also brought about a significant improvement in the model's ability in mathematics and code.
However, after long-term use, everyone found that if the reasoning process is too long, such as in complex mathematics and code, ORM is likely to collapse. Therefore, some companies will add some factors of the process reward scoring system (PRM), such as Qwen's code interpreter verification, which focuses on identifying wrong steps in the reasoning process. The KL regularization theory to prevent ORM from collapsing and deviating also made more progress this year.
Another problem is that RLVR is very useful, but not all fields have verifiable truth or falsehood. For example, in literature and even medical fields that are more statistical, there is currently no complete truth research. So what should we do? Therefore, we may need a more ambitious Universal Verifier to solve this problem.
There are currently two ideas: one is the external method: since the standards are not unique, manual or model-developed complex scoring rules (Rubric) are used, and then the model is allowed to reward according to the Rubric. The other is to trust the model's own intuition (internal method), using the model's own confidence to affect training in fields without clear rewards.
For example, Kimi K2's joint RL stage strategy combines RLVR and self-critique rubric reward for RL.
Innovation in Parameter Update Algorithms
The second RL shock brought by DeepSeek R1 is the popularity of the GPRO algorithm. In the past, the mainstream method of RL was PPO. In this framework, there are two roles: one is the Actor Model, responsible for writing answers; the other is the Critic Model, which scores each step of the actor. This method is particularly suitable for PRM, scoring each step, but it is very expensive because it has to be trained online all the time, allowing the model to try and then score online.
But GPRO is different. It directly cuts off the Critic model, allowing the model to generate a set of answers and calculate the average score to replace the Critic to see who does well and who does poorly. It saves 50% of the video memory at once, and when paired with ORM, it is extremely simple. It is very cost-effective and has good results.
Therefore, basically all domestic companies are extending on the GPRO framework, developing various variants in 2025. For example, Qwen's GSPO optimization introduced score weighting, not only looking at whether you are above the average but also your absolute score, allowing GPRO to select better ones from the correct ones and exclude all wrong ones from the gradient to make training more stable. Minimax's CISPO found that traditional GPRO/PPO training would violently truncate overly long COT contexts, making core thinking unavailable, so it did importance sampling to retain more important parts for updating.
In addition to these specific updates, the industry also tried to find the Chichila law of reinforcement learning.
For example, Meta's ScaleRL found in various ablation experiments that the growth curve of RL actually has a ceiling. They proved that the performance of RL and computing power do not conform to the power law (the scaling law where the greater the computing power, the greater the ability), but conform to the Sigmoid curve (difficult to start, rapid in the middle, and unable to rise at the end).
This is not good news, indicating that RL has a ceiling. We cannot expect to rely on RL to infinitely improve the intellectual ceiling of the model. It can only "force out" the potential already given by the model (endowed by pre-training). Once it reaches 100%, RL will fail. To break through again, we have to go back to innovate the base model or algorithm architecture.
But the good news is that we are still far from the ceiling, and there are a lot of engineering innovations left. In addition, the ability improvement of the base model has not completely stagnated.
ScaleRL also proposed a set of best engineering practices, including using long chain of thought (Long CoT) as a key driver and using large Batch Size (such as 2048 prompts) to reach a higher performance ceiling. This research transformed RL from "alchemy" into a precise engineering science, allowing researchers to accurately predict the effect of large-scale training through small-scale experiments.
All these explorations of RL engineering have enabled this year's models to steadily improve their overall capabilities without increasing parameters. Breaking the scales of ARC and Humans Last Exam again and again, while driving a significant improvement in mathematics and code capabilities.
02 Memory and Learning: Curing the Model's Amnesia
If Test Time Compute was the most important model change in the first half of the year, then the most important model change in the second half of the year was the improvement of memory ability. After all, this is the only branch ability that still scored 0 in the AGI score in the GPT-5 era, belonging to the shortest of the short boards and a major source of leakage.
What's the problem with models having no memory? First, models without memory ability cannot learn by themselves in reality and must learn through retraining in computing power factories. This retraining is expensive, and the training source may be completely disconnected from daily use, so continuous learning has become an extremely difficult task. Second, it becomes very difficult to have an AI that can remember who you are and your preferences. For example, my Gemini 3 now only relies on system-level Prompt to accumulate a little memory about me, but most of it is still wrong.
RAG (Retrieval-Augmented Generation), which became popular in 2024, alleviated this problem as an external "hippocampus," but its form at that time was just a database and search mechanism, which was not very easy to use. In 2025, research on memory issues actually made a lot of progress, but most of it appeared in the second half of the year and has not yet been truly integrated into engineering.
Memory can be divided into three ways: context as memory, RAG-processed context as memory, and integrating context into parameters for internalized memory. The difficulty increases step by step.
This year, both RAG and parameter adjustment memory methods have made great scientific research progress, but the most dazzling ones are the Titans architecture and Nested Learning released by Google Research, which are the biggest breakthroughs in the memory field in 2025, belonging to architecture-level improvements. It fundamentally challenges the stateless assumption of Transformers.
Let's look at them one by one.
Models Gain Living Memory
Titans is a deep neural long-term memory module that can update its own parameters in real time during testing (i.e., during reasoning). This is completely different from traditional Transformer layers, which are frozen after training. Titans is an empty container at the beginning, with only initial weights, and learns historical information into neural memory. Compared with traditional compression modes (Mamba), this learning is lossless.
How to decide what to remember and what not to remember? It depends on the Surprise Metric. The model will decide whether to store the input information into long-term memory according to the degree of surprise (gradient size) of the input information. This is similar to humans: the fresher and more impactful things are, the easier they are to remember.
Titans updates weights at any time, which means backpropagation and gradient updates are needed. Isn't it expensive? Indeed,