HomeArticle

Top Anthropic researcher: AI enters exponential growth, here are three things you need to understand

AI深度研究员2025-10-28 09:38
Top AI researcher Julian Schrittwieser emphasized that the core indicator of current AI progress is no longer "how many questions are answered correctly," but the task length that the model can continuously complete. This ability is growing exponentially at a rate of doubling every three or four months.

On October 25, 2025, a top AI researcher who rarely makes public appearances gave a judgment in a podcast.

There is no sign of AI slowing down. Every three or four months, the model can handle tasks twice as long as before.

The one who said this is Julian Schrittwieser, a core researcher at Anthropic. He once led the development of AlphaGo Zero and MuZero at Google DeepMind.

This is not a popular science interview. He comes from the most cutting - edge laboratory and is witnessing a reality that most people haven't realized yet:

What the public sees: Answering a few questions correctly

What he sees: The model starts to complete a whole day's work

Why can't people notice?

The answer Julian gave is: Human intuition cannot understand exponential changes.

(Image source: Julian Schrittwieser's blog post "Failing to Understand the Exponential Again", link: https://www.julian.ac/blog/2025/09/27/failing-to-understand-the-exponential-again/)

Just as people underestimated the spread speed of the pandemic in the early stage, AI is following the same path. When the model's capabilities double every three or four months, the key is not how powerful it is, but whether you can understand the three things that are happening.

Section 1 | The Key Is How Long the Model Can Run

Julian Schrittwieser's career trajectory almost strings together the main line of artificial intelligence in the past decade.

He once made AlphaGo defeat Lee Sedol and is the first author of MuZero. Now he is in charge of the reasoning research of the Claude model at Anthropic.

"The key to AI is not how many questions it can answer, but how long it can continuously complete tasks."

In his eyes, the progress of AI is not a series of "function upgrades", but an extension of the task duration. From seconds and minutes to now being able to handle continuous tasks for several hours or even days.

Julian explained in the interview that this indicator is called task length, which is the core standard they use internally at Anthropic to measure the "productivity level" of the model. They found that every three or four months, the task length doubles. Unlike humans, the model is not affected by fatigue, can think and execute continuously, and the error rate actually decreases in long - term tasks.

He gave an example: Previously, when a model wrote a program, it needed continuous human prompts; now, Claude can independently write a complete application module, including planning the structure, calling interfaces, testing, and fixing.

This is not about being smarter, but being able to work longer and more stably.

Just like a marathon runner, the key is not the sprint speed, but the endurance to complete the whole journey.

Julian believes that the improvement of this "task endurance" is more worthy of attention than the number of parameters. Because this means that the model is starting to change from a tool to an executor. When the model can work independently for a whole day, it can be assigned tasks, have its progress tracked, and its results verified like a colleague.

The internal continuous - task evaluation of Claude at Anthropic shows that the model can work continuously for 6 to 8 hours without human intervention, completing a whole set of processes from writing code to summarizing documents.

Julian emphasized:

"We are not waiting for the 'super - intelligence' to come. We are just watching the task length change from one minute to a whole day."

While the outside world is still discussing whether AI will replace humans, the laboratories are already asking: How long can it work today?

Section 2 | The Underlying Ability of Claude Is Not About Remembering More

"Not every model can independently complete tasks, and even fewer can work continuously for a whole day."

Julian explained that the essence of Claude's ability is not just a larger language model, but an additional ability to "preview the future".

"The key behind Claude is not the number of parameters, but that it has a 'world model' inside, which can simulate what might happen in the next few steps."

This "world model" is not about memorizing data or predicting words. It is more like a person imagining in their mind: If I say this, how might the other person react? What should I do next?

Julian said that this kind of model is no longer "answering", but "thinking".

This ability is actually a technical route he started exploring during the MuZero period.

MuZero is a reinforcement learning model proposed by DeepMind in 2020. Its biggest breakthrough is that it doesn't need to know the complete rules or environment. It can learn to predict the next few steps in its mind based on experience and continuously correct itself.

When summarizing this method, Julian said:

Humans don't remember the whole world in advance. Instead, they decide their actions by imagining the results of the next step. AI should do the same.

This is where Claude is different: It is no longer just a tool for generating sentences, but an actor that can simulate causality, conduct trials, and correct paths.

To achieve this "preview", it depends not on a single pre - training, but on reinforcement learning after training. The process of reinforcement learning is like letting the model practice repeatedly until it learns to make its own judgments and follow the correct process.

Pre - training enables the model to master knowledge, and reinforcement learning enables it to learn to execute tasks.

In other words, one is "knowing the answer", and the other is "finding the path to the answer". Without reinforcement learning, even if the model knows the answer, it cannot find the path to it on its own.

He mentioned an experiment with Claude: Give the model a complex task, such as writing a piece of API code with tests, and require it to:

  • Plan the writing method by itself;
  • Decide when to use which function;
  • Debug itself when there is an error;
  • Finally, output a piece of runnable code.

Claude achieved this, and several times during the process, it realized the problem by itself and rewrote the code.

This ability comes from the combination of the world model and reinforcement learning: The model no longer just answers questions, but can internally deduce paths, break down tasks, predict results, and correct errors.

It has evolved from a language model to an action model.

Section 3 | From Answering to Taking on Tasks: Claude Can Do Things

What is the difference between Claude and previous language models?

Julian's answer is very simple:

Claude is no longer a chatbot. It is an executor that you can assign tasks to.

He said that within Anthropic, they no longer use Claude as a "question - answering machine". Instead, they let it handle real - world tasks, such as:

Write a piece of runnable API code

Read a PDF file of thousands of words, summarize it, and list the key points

Execute a whole set of document - processing processes, including rewriting, formatting, and generating summaries

More importantly, these tasks are completed autonomously by Claude in stages without human intervention.

Julian pointed out that the "prompt engineering" popular in the industry in the past few years is essentially humans setting a path for the model to follow. But today, the core ability of Claude is "taking on tasks": You don't need to command it step by step. Instead, you give it a goal, and it will break it down, execute, review, and complete it by itself.

This is exactly the key characteristic of an emerging intelligent agent.

It doesn't rely on memory to solve problems, but on continuous thinking and action to complete tasks.

He gave examples of Claude Code and Claude Agent SDK. These are two key modules recently refactored internally at Anthropic, with the goal of enabling the model to handle long - process, multi - step tasks like a digital employee.

Claude Code can:

Deduce how to build a function when you haven't written a complete requirement document

  • Add debugging statements to the code by itself to locate bugs
  • Generate test cases for you after writing the code
  • Automatically rewrite the logic based on the test results

And Claude Agent SDK goes a step further. It can execute more complex multi - step tasks, such as:

  • Open a tool → Search for information → Write into a document → Check the output → Clean up the intermediate results
  • If the process fails midway, it will automatically record the reason for the failure and try to retry

Julian described it like this: Now what you give Claude is not a one - sentence question, but a task list.

This is the most fundamental difference between Claude and traditional models: Traditional models only answer questions, rely on prompt instructions, and complete single - round interactions; while Claude can autonomously break down tasks, execute multiple rounds, and correct itself.

It has changed from a tool to a collaborator that can deliver results.

Section 4 | It's Easy to Do It Right Once, but Hard to Do It Right Ten Times

If Claude can already do work, then the next question is: Can it always complete the tasks smoothly?

Julian's answer is: Not necessarily.

He said that this is the most realistic challenge in developing intelligent agents today:

We're not worried that the model isn't smart enough. We're worried about whether it can stably complete the tasks without errors or deviations.

AI is not lacking in ability, but is too easily interrupted by small problems.

For example:

In a document - processing process, the model processes the first half well, but suddenly the format goes wrong in the second half;

When executing a code - rewriting task, the model initially understands correctly, but later forgets the original goal;

Or when a certain link fails, the model fails to judge where the error is and continues to make mistakes.

The core problem is that although the model has learned a lot of knowledge through pre - training, it doesn't tell you "when to stop" or "whether this step is right".

That is to say, the model doesn't really know what it is doing.

At this time, Anthropic's approach is to introduce "reinforcement learning" and "behavior rewards" to give the model feedback and a sense of direction at each step of execution.

But this is much more difficult than expected.

Reinforcement learning has a "feedback loop": The model you train will be used to generate new training data. If there is a deviation in a certain link, the whole chain will go astray.

This is completely different from pre - training. Pre - training is like filling in blanks with a definite goal; reinforcement learning is more like walking through a maze where you constantly correct your direction. One wrong step may make the model deviate from the track.

So Anthropic has started to try several solutions.

The first is process - based reward.

Instead of just looking at whether the final result is correct, a reference point is set for each step of the model.

Instead of only rewarding the model for getting a good answer in the end, it is better to give feedback at each inference and every intermediate step. It's like a teacher not only looking at whether you get the right answer, but also looking at your problem - solving process.

The second method is self - verification.

In some mathematical and code tasks, Anthropic asks the model to verify its own answers after generating them. For example, when writing a proof, the model must be able to check for logical loopholes by itself to get a score.

This can greatly reduce the situation where the model seems to get it right on the surface but actually makes mistakes.

The third is to add an error - correction mechanism to the model's "behavior chain".

"The real sign of a powerful model is not that it never makes mistakes, but that it knows it has made a mistake and corrects it actively."

Anthropic enables Claude to actively pause, record the reason for failure, and retry the process if an abnormal result occurs during the task. It's a bit like leaving backups while working so that you can roll back if there is an error.

Julian admitted that these attempts are still in the early stage: We are still exploring how to make these methods more stable and scalable. This is the hurdle that intelligent agents need to cross. The key lies not in ability, but in reliability and execution stability.

Today's challenge is not that the model is too stupid, but that it is too easy to deviate from the track due to mistakes.

Section 5 | The Pace Is Accelerating, and the Window Period Has Begun

In this in - depth conversation, Julian repeatedly emphasized three key facts:

The tasks are getting longer - Every three or four months, the length of tasks that the model can independently complete doubles

The model is doing work - AI has evolved from answering questions to executing tasks

The pace is accelerating - The change is not coming in ten years. It's time to restructure the way of working now

So, how should we judge this trend? His answer is:

Don't judge the development stage of AI based on emotions, hype, or feelings. Look at the tasks, the data, and what it has actually done.

In his opinion, much of the current discussion about AI in the market stays on vague topics like "is it a bubble" or "is it a breakthrough". But the cutting - edge laboratories are looking at:

Can the model complete real - world tasks?

Is there an improvement in the completion?

Will people continue to use it after assigning tasks to it?

These are the evaluation dimensions that Anthropic, OpenAI, and Google are currently truly concerned about internally.

For example, GDP - Val launched by OpenAI allows real - world industry experts to design tasks for the model to complete, and then compares the results with those of humans. It's not about looking at the "test scores" of the model, but at its ability to actually complete work.

Julian specifically pointed out two indicators that are currently the most valuable for reference:

One is the task length

How long can AI work continuously? Is it 10 minutes or a whole day?

The longer the model can complete tasks, the larger the scope of tasks you can entrust to it, and the more labor it can save.

The other is user retention and reuse

It's not about whether the model can be used, but whether people are willing to use it continuously and start to form a dependence on it.

If users stop using a newly released model a few days later, then this model may only seem powerful. A real AI that can continuously generate productivity will definitely see an increase in usage and retention.

When the task length gets longer and the user usage frequency gets higher, it means that AI is no longer just a "function", but is starting to become a "labor force".

So, what should you do?

Don't just make judgments. Do experiments.

Assign a task that usually takes you 4 hours to AI and see how much it can do and how well it can do it. Do this several times, and you'll naturally know where AI stands now.

He said that he conducts such experiments every day, and each time the performance is improving: more tasks are completed, and there are fewer failures.

That's why he believes that in 2025, it's not about the arrival of super - intelligence, but about our ability to restructure tasks. We can hand over the processes that originally needed to be completed step by step by humans to