HomeArticle

AI guru Andrej Karpathy released his 2025 annual summary: LLMs enter a new era of "ghost intelligence" and "ambient programming"

36氪的朋友们2025-12-22 16:50
Kapaxi: In 2025, LLM shifts towards logical reasoning, and RLVR becomes the new engine.

On December 21st, Beijing time, Andrej Karpathy, one of the founders of OpenAI and an AI expert, released an in - depth annual observation report titled "2025 LLM Year in Review".

In this review, Karpathy thoroughly analyzed the underlying paradigm shifts that occurred in the field of large language models (LLMs) in the past year. He pointed out that 2025 marked a decisive leap in AI training philosophy from simple "probabilistic imitation" to "logical reasoning".

The core driving force behind this transformation stems from the maturity of Reinforcement Learning with Verifiable Rewards (RLVR). Through objective feedback environments such as mathematics and code, it forces the model to spontaneously generate "reasoning traces" similar to human thinking. Karpathy believes that this long - cycle reinforcement learning has begun to erode the share of traditional pre - training and has become a new engine for improving model capabilities.

In addition to the change in the technical path, Karpathy also put forward profound insights into the essence of intelligence. He used the metaphor of "Summoning Ghosts" rather than "Evolving/growing Animals" to describe the current growth model of AI, explaining why current large language models exhibit "jagged" performance characteristics - performing like geniuses in cutting - edge fields but being as vulnerable as children in basic common sense.

Moreover, Karpathy also elaborated in detail on the rise of "Vibe Coding", the trend of practical application of localized agents, and the evolution of the Large Language Model Graphical User Interface (LLM GUI). He emphasized that although the industry has made rapid progress, humans have currently exploited less than 10% of the potential of this new computing paradigm, and there is still extremely broad room for future development.

Karpathy revealed a harsh yet hopeful reality: we are at the critical point of transitioning from "simulating human intelligence" to "pure machine intelligence". With the popularization of technologies such as RLVR, the AI competition in 2026 will no longer be limited to the arms race of computing power but will shift to the in - depth exploration of the core logical paradigm of "how to make AI think efficiently".

The following is the full text of Karpathy's annual review:

"2025 LLM Year in Review"

2025 was a year of great leaps and full of uncertainties in the field of large language models. The following is a list of 'Paradigm Shifts' that I think are worth special recording and are, to some extent, unexpected. They have profoundly changed the industry landscape and brought great shocks at the thinking level.

01 Reinforcement Learning with Verifiable Rewards (RLVR)

At the beginning of 2025, the production stacks of large language models in all laboratories were basically as follows:

  • Pretraining (GPT - 2/3 in 2020)
  • Supervised Fine - Tuning (SFT, InstructGPT in 2022)
  • Reinforcement Learning from Human Feedback (RLHF, in 2022)

For a long time, this has been a stable and proven solution for training production - level large language models. By 2025, Reinforcement Learning with Verifiable Rewards emerged as the de facto core new stage in this technology portfolio.

By training large language models in environments with a large number of automatically verifiable rewards, such as mathematics and code puzzles, the models will spontaneously form strategies that approximate "reasoning" from a human perspective. They learn to break down complex problems into intermediate calculation steps and master various skills of repeated deliberation and solution - seeking (see relevant examples in the DeepSeek R1 paper).

This type of strategy was difficult to achieve in previous technical paradigms. The core reason is that the model cannot know the optimal reasoning traces or problem - fixing processes in advance and must independently explore effective solutions through optimization for reward targets.

Different from fine - tuning stages with relatively small computational requirements, such as supervised fine - tuning and reinforcement learning from human feedback, Reinforcement Learning with Verifiable Rewards conducts training based on objective (non - cheat - able) reward functions, which enables it to support a longer - cycle optimization process.

Practice has proven that Reinforcement Learning with Verifiable Rewards has an extremely high "capability/cost ratio" and has even occupied a large amount of computing resources originally used for pre - training. Therefore, the improvement of large language model capabilities in 2025 mainly stems from the exploration and release of the "stock potential" of this new stage by various laboratories.

Overall, the scale of model parameters did not change significantly this year, but the cycle of reinforcement learning training was significantly extended. In addition, Reinforcement Learning with Verifiable Rewards also brought a new adjustment dimension (and related extension laws): by generating longer reasoning traces and increasing the model's "thinking time", the computational volume in the testing stage can be flexibly adjusted, thereby achieving an improvement in capabilities.

The o1 model launched by OpenAI at the end of 2024 was the first public appearance of the Reinforcement Learning with Verifiable Rewards technology, while the release of the o3 model at the beginning of 2025 became a clear turning point. It was not until then that people could intuitively feel the qualitative leap in the capabilities of large language models.

02 The Debate between "Ghosts" and "Animals"/Jagged Intelligence

In 2025, I (and I think the entire industry) began to intuitively understand the "shape" of large language model intelligence. What we are facing is not "gradually evolving animals" but "summoned ghosts".

All components of the large language model technology stack: neural network architecture, training data, training algorithms, and especially the optimization objectives, are completely different from the evolutionary logic of biological intelligence. Therefore, large language models are a new type of entity in the intelligence space. If we interpret them from the perspective of biology, cognitive biases are inevitable.

In terms of the nature of the supervision signal, the neural network of the human brain is optimized for tribal survival and coping with the jungle environment; while the neural network of large language models is optimized for imitating human text, obtaining rewards in mathematical problems, and getting human likes on the LM Arena list.

Human intelligence is in blue, and AI intelligence is in red

With the popularization of Reinforcement Learning with Verifiable Rewards in verifiable fields, the capabilities of large language models in these specific fields will experience "explosive growth", showing an interesting "jagged performance characteristic" as a whole: they are both genius polymaths proficient in multiple fields and may be "primary school students" full of confusion and cognitive defects. They may even be induced by a "jailbreak instruction" to leak user data.

Related to this, in 2025, I completely lost interest and trust in various benchmarks. The core problem is that the construction logic of benchmarks is almost all based on "verifiable environments", so they are extremely vulnerable to "attacks" by methods such as Reinforcement Learning with Verifiable Rewards training or synthetic data generation.

In the typical process of "rank - brushing", various laboratories will inevitably build mini - training environments near the feature space corresponding to the benchmarks to cultivate "intelligent jaggedness" that precisely covers the test points. Nowadays, "targeted training for the test set" has become a new type of technical operation.

03 Cursor and the New Hierarchy of Large Language Model Applications