StartseiteArtikel

2025: The Year of Large Language Models (LLMs)

神译局2026-01-29 07:18
The trends in this year are so numerous that they are dazzling.

God Translation Bureau is a compilation team under 36Kr, focusing on fields such as technology, business, the workplace, and life, and mainly introducing new foreign technologies, new ideas, and new trends.

Editor's note: AI is no longer just a chat toy but an intelligent agent taking over decision - making. When a monthly fee of $200 becomes the norm, Chinese models have quietly reached the top, and the myth of OpenAI is accelerating its disintegration in the inference battle of 2025. This article is from a compilation.

Year - end Summary of 2025

  • The Year of "Inference"

  • The Year of Agents

  • The Year of Programming Agents and Claude Code

  • The Year of Command - line LLMs

  • The Year of YOLO Mode and "Normalization of Deviance"

  • The Year of the $200 Monthly Subscription Fee

  • The Year of Chinese Open - source Weight Models Reaching the Top

  • The Year of Long - term Tasks

  • The Year of Prompt - driven Image Editing

  • The Year of Models Winning Gold in Academic Competitions

  • The Year of Llama Losing Its Way

  • The Year of OpenAI Losing Its Leading Position

  • The Year of Gemini

  • The Year of Pelicans Riding Bicycles

  • The Year I Built 110 Tools

  • The Year of the "Snitch"!

  • The Year of Vibe Coding

  • The Year of MCP (A Flash in the Pan?)

  • The Year of AI Browsers with Astoundingly Powerful Features

  • The Year of the Lethal Trifecta

  • The Year of Mobile Programming

  • The Year of Conformance Test Suites

  • The Year of Local Models Getting Stronger, but Cloud - based Models Getting Even Stronger

  • The Year of AI Slop

  • The Year of Data Centers Becoming Extremely Unpopular

  • My Word of the Year

The Year of "Inference" #

In September 2024, OpenAI launched the "inference" revolution with o1 and o1 - mini, also known as inference - side scaling or reinforcement learning with verifiable rewards (RLVR). At the beginning of 2025, they further strengthened this advantage by introducing o3, o3 - mini, and o4 - mini. Since then, "inference" has become a signature feature of almost every mainstream AI lab's models.

My favorite explanation of the importance of this technique comes from Andrej Karpathy:

By training LLMs for automatically verifiable rewards in various environments (such as math or code puzzles), the models spontaneously develop strategies that seem like "inference" to humans - they learn to break down problems into intermediate calculation steps and learn multiple trial - and - error strategies for problem - solving (see examples in the DeepSeek R1 paper). [...]

It turns out that running RLVR is extremely cost - effective, consuming computing resources originally used for pre - training. Therefore, most of the capability improvements in 2025 stem from the in - depth exploration of this new stage by LLM labs. Overall, although the scale of LLMs has not changed much, the running time of reinforcement learning (RL) has increased significantly.

In 2025, every well - known AI lab released at least one inference model. Some labs released hybrid models that can switch between inference and non - inference modes. Now, many API models provide adjustment dials to increase or decrease the inference strength for specific prompts.

It took me a while to understand what inference is actually useful for. The initial demonstrations showed how it could solve math logic problems or count the number of 'R's in "strawberry" - neither of which I encounter in my daily use of models.

It turns out that the real value of inference lies in driving tools. An inference model with access to tools can plan multi - step tasks, execute them, and continuously reason about the results to update the plan for better goal - achievement.

A notable result is that AI - assisted search has really become useful. Previously, connecting search engines to LLMs was underwhelming, but now I find that even for complex research questions, ChatGPT's GPT - 5 Thinking can usually provide answers.

Inference models also excel at code generation and debugging. The inference skill means they can start from an error and gradually dig into different levels of the codebase to find the root cause. I've found that as long as the inference model is good enough and has the ability to read and execute code in large and complex codebases, even the most stubborn bugs can be diagnosed.

Combining inference with tool usage leads to...

The Year of Agents #

At the beginning of the year, I predicted that agents wouldn't amount to much. Throughout 2024, everyone was talking about agents, but there were few successful cases. What's more confusing is that everyone using the term "agent" seemed to have their own definition.

By September, tired of avoiding the term due to its lack of a clear definition, I decided to define it as "an LLM that achieves goals by cyclically running tools." This allowed me to have meaningful conversations about it, which has always been my goal for such terms.

I thought agents wouldn't work because I believed the "vulnerability (prompt injection)" problem couldn't be solved, and I thought the idea of replacing human employees with LLMs was still a ridiculous science - fiction concept.

My prediction was half - right: the science - fiction version of an all - powerful computer assistant that can fulfill any request (like the one in "Her") didn't appear...

But if you define an agent as "an LLM system capable of performing useful work through multi - step tool calls," then agents have arrived, and they've proven to be very useful.

Programming and search are the two most prominent application categories for agents.

The "Deep Research" mode - where an LLM collects information and spends over 15 minutes generating a detailed report for you - was very popular in the first half of this year. However, with GPT - 5 Thinking (and Google's "AI mode," which is much better than their terrible "AI overview") being able to generate similar results in a very short time, this mode is no longer in vogue. I consider this an agent mode, and it works very well.

The "Programming Agent" mode has had a much greater impact.

The Year of Programming Agents and Claude Code #

The most influential event in 2025 occurred in February with the low - key release of Claude Code.

I say "low - key" because it didn't even have a dedicated blog post! Anthropic only mentioned it in passing as the second item in the announcement of Claude 3.7 Sonnet.

(Why did Anthropic jump directly from Claude 3.5 Sonnet to 3.7? Because they released a major update to Claude 3.5 in October 2024 without changing the name, causing the developer community to start calling this unnamed 3.5 Sonnet v2 3.6. Anthropic wasted a version number by failing to give the new model a proper name!)

Claude Code is the most outstanding representative of what I call "programming agents" - an LLM system that can write code, execute it, check the results, and then iterate further.

Major labs launched their own command - line (CLI) programming agents in 2025:

  • Claude Code Programming Agent

  • Codex CLI Programming Agent

  • Gemini CLI Programming Agent

  • Qwen Code Programming Agent

  • Mistral Vibe Programming Agent

Third - party options include GitHub Copilot CLI, Amp, OpenCode, OpenHands CLI, and Pi. IDEs such as Zed, VS Code, and Cursor have also invested a lot of effort in programming agent integration.

I first encountered the programming agent mode in early 2023 with OpenAI's ChatGPT Code Interpreter - a system built into ChatGPT that allows it to run Python code in a Kubernetes sandbox.

I was very happy when Anthropic finally launched a comparable product in September this year, although it initially had a strange name, "Create and edit files with Claude."

In October, they reused the container sandbox infrastructure to launch the web version of Claude Code, and I've been using it almost every day since then.

The web version of Claude Code is what I call an "asynchronous programming agent" - you just give a prompt and then leave it alone. It will solve the problem on its own and submit a Pull Request when it's done. OpenAI's "Codex cloud" (renamed "Codex web" last week) was launched in early May 2025. Gemini's product in this category is called Jules, also released in May.

I really like asynchronous programming agents. They perfectly solve the security challenges of running arbitrary code on a personal laptop. At the same time, it's an amazing experience to be able to start multiple tasks at once (usually on my phone) and get good results a few minutes later.

I documented my experiences in detail in "Conducting Code Research Projects with Asynchronous Programming Agents like Claude Code and Codex" and "Embracing the Lifestyle of Parallel Programming Agents."

The Year of Command - line LLMs #

In 2024, I spent a lot of time developing my LLM command - line tools to access models in the terminal. I thought it was strange at the time that so few people valued CLI access to models - they are a perfect match for Unix mechanisms like pipes.

Maybe the terminal is too quirky and niche to ever become the mainstream tool for accessing LLMs?

Claude Code and its counterparts have eloquently proven that developers are very willing to use LLMs in the command line as long as the models are powerful enough and have the right framework support.

Moreover, when an LLM can directly generate the correct commands for you, obscure terminal commands like *sed*, *ffmpeg*, and even *bash* itself are no longer barriers to entry.

As of December 2nd, Anthropic said that Claude Code's annualized revenue had reached $1 billion! I never thought a CLI tool could reach such a large - scale figure.

In hindsight, maybe I should have promoted LLMs from a side project to a core focus!

The Year of YOLO Mode and "Normalization of Deviance" #

The default setting of most programming agents is to seek user confirmation for almost every operation. In a world where an agent's mistake could wipe out your personal home folder or a malicious prompt injection attack could steal your credentials, this default setting is completely reasonable.

Anyone who has tried running an intelligent agent in the auto - confirmation mode (i.e., YOLO mode - Codex CLI even shortens the *--dangerously - bypass - approvals - and - sandbox* parameter to *--yolo*) has experienced this trade - off: an agent without safety guards feels like a completely different product.

One of the major advantages of asynchronous programming agents like the web version of Claude Code and Codex Cloud is that they can run in YOLO mode by default because they don't have a personal computer to damage.

Despite being fully aware of the risks, I often run in YOLO mode. So far, nothing has gone wrong...

... And that's exactly the problem.

One of my favorite articles about LLM security this year is "Normalization of Deviance in AI" written by security researcher Johann Rehberger.

Johann describes the "Normalization of Deviance" phenomenon, where individuals and organizations start to accept risky behavior as normal due to repeated exposure to it without negative consequences.

This concept was first proposed by sociologist Diane Vaughan to analyze the 1986 Challenger Space Shuttle disaster. The accident was caused by an O - ring failure, and engineers had known about the problem for years, but multiple successful launches made NASA's culture stop taking this risk seriously.

Johann believes that the longer we get away with running fundamentally unsafe systems, the closer we are to our own "Challenger disaster."

The Year of the $200 Monthly Subscription Fee #

The initial monthly price of $20 for ChatGPT Plus was actually a hasty decision made by Nick Turley based on a Google Forms vote on Discord. Since then, this price point has remained stable.

This year, a new pricing precedent emerged: the Claude Pro Max 20x plan, priced at $200 per month.

OpenAI also has a similar $200 plan called ChatGPT Pro. Gemini launched Google AI Ultra at $249 per month, with a discounted price of $124.99 per month for the first three months.

These plans seem to be bringing in substantial revenue, although no lab has yet released specific data on the number of subscribers at each level.

I personally used to pay $100 per month to subscribe to Claude. After my current free quota (from previewing one of their models - thanks, Anthropic) runs out, I'll upgrade to the $200 plan. I've also heard many people say they're willing to pay this price.

Generally, you need to use the model extremely intensively to spend a $200 API quota, so you might think that paying by token would be more cost - effective for most people. However, it turns out that once you start having Claude Code and Codex CLI perform more challenging tasks, they consume tokens at an astonishing rate, making the $200 monthly subscription fee seem extremely cost - effective.

The Year of Chinese Top - notch Open - source Weight Models #

In 2024, Chinese AI labs showed some initial vitality, mainly thanks to Qwen 2.5 and the early DeepSeek. They were decent models, but not yet dominant globally.

This situation changed dramatically in 2025. There were 67 articles under the "Chinese AI" tag on my blog in 2025 alone, and I even missed a series of high - profile releases at the end of the year (especially GLM - 4.7 and MiniMax - M2.1).

Here is the ranking of open - source weight models by Artificial Analysis as of December 30, 2025:

GLM - 4.7, Kimi K2 Thinking, MiMo - V2 - Flash, DeepSeek V3.2, and MiniMax - M2.1 are all Chinese open - source weight models. The highest - ranked non - Chinese model in the chart is OpenAI's *gpt - oss - 120B (high)*, which only ranks sixth.

The revolution of Chinese models truly began with the release of DeepSeek 3 on Christmas 2024, whose training cost is said to be only about $5.5 million. Immediately afterwards, DeepSeek launched DeepSeek R1 on January 20th, which quickly triggered a major crash in the AI and semiconductor sectors: due to investors' panic that AI might no longer be an American monopoly, NVIDIA's market value evaporated by about $593 billion.

The panic didn't last long - NVIDIA quickly rebounded, and its stock price is now much higher than before the release of DeepSeek R1. But this was still an epoch - making moment. Who would have thought that the release of an open - source weight model could have such a huge impact?

After DeepSeek, a group of powerful Chinese AI labs quickly followed suit. I've been paying special attention to these:

  • DeepSeek Lab

  • Alibaba Qwen (Qwen3)

  • Moonshot AI (Kimi K2)

  • Z.ai (GLM -