HomeArticle

The Year 2025 of Large Models: 6 Key Insights

腾讯研究院2025-12-23 19:38
We are at a critical juncture of transitioning from "simulating human intelligence" to "pure machine intelligence."

On December 21st, Beijing time, Andrej Karpathy, one of the founders of OpenAI and a renowned AI expert, released an in - depth annual observation report titled "2025 LLM Year in Review".

In this review, Karpathy comprehensively analyzed the underlying paradigm shifts that occurred in the large language model (LLM) field in the past year. He pointed out that the year 2025 marked a decisive leap in AI training philosophy from simple "probability imitation" to "logical reasoning".

The core driving force behind this transformation stems from the maturity of Reinforcement Learning with Verifiable Rewards (RLVR). Through objective feedback environments such as mathematics and code, it forces the model to spontaneously generate "reasoning traces" similar to human thinking. Karpathy believes that this long - cycle reinforcement learning has begun to encroach on the share of traditional pre - training and has become a new engine for enhancing model capabilities.

In addition to the changes in the technical path, Karpathy also put forward profound insights into the essence of intelligence.

He used the metaphor of "Summoning Ghosts" rather than "Evolving/Growing Animals" to describe the current growth model of AI, explaining why current large language models exhibit "jagged" performance characteristics - they can perform like geniuses in cutting - edge fields but may be as vulnerable as children in basic common sense.

Moreover, Karpathy also elaborated on the rise of "Vibe Coding", the trend of practical application of localized agents, and the evolution of the large language model graphical user interface (LLM GUI). He emphasized that although the industry has advanced rapidly, humans have currently exploited less than 10% of the potential of this new computing paradigm, and there is still extremely broad room for future development.

Karpathy revealed a harsh yet hopeful reality: we are at the critical point of transitioning from "simulating human intelligence" to "pure machine intelligence". With the popularization of technologies such as RLVR, the AI competition in 2026 will no longer be limited to the arms race of computing power but will shift to the in - depth exploration of the core logical paradigm of "how to make AI think efficiently".

The following is the full text of Karpathy's annual review:

"2025 LLM Year in Review"

The year 2025 was a year of great leaps and full of uncertainties in the large language model field. The following is a list of 'Paradigm Shifts' that I think are worth special recording and are somewhat unexpected. They have profoundly changed the industry landscape and brought great shocks at the thinking level.

Reinforcement Learning with Verifiable Rewards (RLVR)

At the beginning of 2025, the production stack of large language models in all laboratories was basically as follows:

  • Pretraining (2020's GPT - 2/3)
  • Supervised Fine - Tuning (SFT, 2022's InstructGPT)
  • Reinforcement Learning from Human Feedback (RLHF, 2022)

For a long time, this has been a stable and proven solution for training production - level large language models. By 2025, Reinforcement Learning with Verifiable Rewards emerged as the de facto core new stage in this technology portfolio.

By training large language models in environments with a large number of automatically verifiable rewards, such as mathematics and code puzzles, the models will spontaneously form strategies that approximate "reasoning" from a human perspective. They learn to break down complex problems into intermediate calculation steps and master various skills for repeated deliberation and solution - finding (see relevant examples in the DeepSeek R1 paper).

Such strategies were difficult to achieve in previous technical paradigms. The core reason is that the model cannot know the optimal reasoning traces or problem - fixing processes in advance and must independently explore effective solutions through optimization for the reward target.

Different from fine - tuning stages with relatively small computational requirements, such as supervised fine - tuning and reinforcement learning from human feedback, reinforcement learning with verifiable rewards conducts training for objective (non - cheat - able) reward functions, which enables it to support a longer - cycle optimization process.

Practice has proven that reinforcement learning with verifiable rewards has an extremely high "ability/cost ratio" and has even occupied a large amount of computational resources originally used for pre - training. Therefore, the improvement of large language model capabilities in 2025 mainly stems from the exploration and release of the "stock potential" of this new stage by various laboratories.

Overall, the scale of model parameters did not change significantly this year, but the cycle of reinforcement learning training was significantly extended. In addition, reinforcement learning with verifiable rewards also brought a new adjustment dimension (and related extension laws): by generating longer reasoning traces and increasing the model's "thinking time", the computational volume in the testing stage can be flexibly adjusted to achieve ability improvement.

OpenAI's o1 model launched at the end of 2024 was the first public appearance of the reinforcement learning with verifiable rewards technology, while the release of the o3 model at the beginning of 2025 became a clear turning point. It was not until then that people could intuitively feel the qualitative leap in large language model capabilities.

The Debate between "Ghosts" and "Animals"

In 2025, I (and I believe the entire industry) began to intuitively understand the "shape" of large language model intelligence. What we are facing is not "gradually evolving animals" but "summoned ghosts".

All components of the large language model technology stack: neural network architecture, training data, training algorithms, and especially the optimization objectives, are completely different from the evolutionary logic of biological intelligence. Therefore, large language models are a new type of entity in the intelligence space. If we interpret them from the perspective of biology, cognitive biases are inevitable.

From the nature of the supervision signal, the neural network of the human brain is optimized for tribal survival and coping with the jungle environment, while the neural network of large language models is optimized for imitating human text, obtaining rewards in mathematical problems, and getting human likes on the LM Arena list.

Human intelligence is blue, and AI intelligence is red

With the popularization of reinforcement learning with verifiable rewards in verifiable fields, the capabilities of large language models in these specific fields will experience "explosive growth", presenting an interesting "jagged performance characteristic" overall: they are both genius polymaths proficient in multiple fields and may be "primary school students" full of confusion and cognitive deficiencies. They may even be induced by a "jailbreak instruction" to leak user data.

Related to this, in 2025, I completely lost interest and trust in various benchmarks. The core problem is that the construction logic of benchmarks is almost all based on "verifiable environments", so they are extremely vulnerable to "attacks" such as training with reinforcement learning with verifiable rewards or synthetic data generation.

In a typical "leaderboard - brushing" process, each laboratory will inevitably build a micro - training environment near the feature space corresponding to the benchmark test to cultivate "intelligent jagged edges" that precisely cover the test points. Nowadays, "targeted training for the test set" has become a new type of technical operation.

Cursor and the New Hierarchy of Large Language Model Applications

The most notable point of Cursor (besides its explosive growth in 2025) is that it clearly reveals a new hierarchy of large language model applications, and people have begun to commonly discuss the "Cursor model in a certain field".

As I emphasized in my speech at Y Combinator this year, the core value of large language model applications like Cursor lies in integrating and orchestrating the calling logic of large language models for specific vertical fields, specifically in the following aspects:

- Handling "context engineering", optimizing prompt design and context management;

- Orchestrating multiple large language model calls into increasingly complex directed acyclic graphs (DAGs) in the background to precisely balance performance and cost;

- Providing a graphical user interface adapted to specific scenarios for the "Human - in - the - loop";

- Offering an adjustable "autonomy slider" to flexibly control the scope of AI's autonomous decision - making authority.

In 2025, there was a lot of discussion in the industry about the "thickness" of this new application layer: Will large language model laboratories monopolize all application scenarios? Or is there still a vast blue ocean for large language model applications in vertical fields?

Personally, I think large language model laboratories tend to cultivate models like "college students with extremely strong general knowledge", while large language model applications organize and fine - tune these "college students" through the integration of private data, sensors, actuators, and feedback loops, ultimately driving them to become "professional teams" in specific vertical fields.

The "Agents" "Residing" in Users

The emergence of Claude Code (CC) first convincingly demonstrated the core capabilities of large language model agents. It can connect tool use and reasoning processes in a cyclic manner to solve problems with a long - time span. In addition, the most remarkable feature of CC for me is its local operation mode: it is directly deployed on the user's computer and can access the local private environment, data, and context.

In my opinion, OpenAI's early exploration of code/agents was off - track. They focused on orchestrating cloud containers through ChatGPT rather than directly utilizing the local environment (localhost). Although the cloud - running agent clusters seem to be close to the "ultimate form of Artificial General Intelligence (AGI)", in the current reality of uneven AI capabilities and gradual technological development, having agents run directly on developers' computers is obviously more practical.

It should be clear that the core difference is not "the running location of AI computing" (cloud or local) but other key elements: the started computer device, its pre - installed environment, local context, private data, key information, system configuration, and low - latency human - machine interaction experience.

Anthropic accurately grasped this priority and packaged CC in the form of an extremely simple and elegant command - line interface (CLI), completely reshaping users' perception of AI - it is no longer a website that needs to be actively accessed (like the Google search engine) but an intelligent entity "residing" in the user's computer. This marks the official birth of a new and unique AI interaction paradigm.

Vibe Coding: Revolutionizing Software Development

In 2025, AI crossed a critical capability threshold, enabling people to build various powerful programs with just natural English, even ignoring the existence of code itself. Interestingly, when I first coined the concept of "Vibe Coding" in an informal tweet, I never expected it to have such a wide - reaching impact.

In the era of Vibe Coding, programming is no longer an exclusive skill for professionals with high training thresholds but a general ability that ordinary people can master. This confirms my previous view in "Power to the people": large language models are reversing the traditional logic of technology popularization.

Different from all previous technologies, ordinary people gain far more benefits from large language models than professionals, enterprises, and governments. Vibe Coding not only endows ordinary people with the right to technological creation but also enables professional developers to efficiently implement software projects that they would not have attempted due to technical thresholds or cost issues.

Take the Nanochat project I participated in as an example. I built an efficient BPE tokenizer through Rust Vibe Coding without systematically learning the in - depth technical details of Rust.

In 2025, I also completed several demonstration projects (such as menugen, llm - council, etc.) through Vibe Coding. I even quickly wrote a whole set of temporary applications to troubleshoot a bug. In the Vibe Coding mode, code becomes cheap, instant, malleable, and supports lightweight usage scenarios of "use - and - discard". In the future, Vibe Coding will completely transform the software development ecosystem and redefine the core value of related occupations.

The Emerging Interaction Form of Large Language Models

Google's Gemini Nano Banana is one of the most breakthrough and paradigm - shifting models in 2025. In my cognitive framework, large language models are another major computing paradigm innovation after personal computers in the 1970s - 1980s.

Therefore, we will see the replication of innovations based on similar underlying logic: the equivalent forms of large language models for personal computing, microcontrollers (cognitive core), and the Internet (agent network) will gradually emerge.

Especially in the field of user interface/user experience (UI/UX), the "text - based dialogue interaction" with large language models is similar to the operation mode of inputting instructions into a computer terminal in the 1980s. Text is the native and preferred data format for computers (and large language models) but not the most acceptable interaction form for humans, especially at the input end.

Humans are inherently not good at reading long texts, which is inefficient and energy - consuming. On the contrary, humans prefer to obtain information through visual and spatial means, which is the core reason for the invention of the graphical user interface (GUI) in the traditional computing field.

Similarly, large language models should also interact in a format preferred by humans - through visual forms such as images, infographics, slides, whiteboards, animations/videos, and web applications. Currently, the early signs of this trend are emojis and markup languages, which achieve visual typesetting of text through formats such as headings, bold, and lists.

But who will build the real "Large Language Model GUI"? From this perspective, Nano Banana is an early prototype of the future form. More importantly, its core value lies not only in its image - generation ability but also in the joint modeling ability of text generation, image generation, and world knowledge deeply integrated in the model weights.

Core Summary: The year 2025 was a year full of surprises and breakthroughs in the large language model field. Current large language models not only exhibit an intelligence level far beyond expectations but also have unexpected cognitive shortcomings. However, in any case, they already have extremely high practical value - I think that even with the current ability level, the industry has exploited less than 10% of the potential of large language models.

At the same time, there are still countless innovative ideas waiting to be explored in this field, and from a conceptual perspective, the development space is still extremely broad. As I said in Dwarkesh's podcast this year: I believe that the large language model field will continue to develop rapidly, and I also know that a lot of basic work still needs to be promoted. Fasten your seatbelts and get ready for the next wave of change.

* Special translator Wuji also contributed to this article.

This article is from the WeChat official account "Tencent Research Institute", author: Xiaojing. It is published by 36Kr with authorization.