In der frühen Morgenstunde heute ist Claude Opus 4.8 online gegangen und hat 65 Milliarden US-Dollar an Kapital beschafft, aber noch Stärkeres kommt hinterher.
Yesterday evening, Anthropic released its latest model, Claude Opus 4.8.
As usual in the industry, there has been a wave of benchmark screenshots. Opus 4.8 leads in all aspects and dominates in coding capabilities.
To be honest, I'm no longer excited about benchmarks. Almost every model is promoted with benchmarks when it's released.
I think it's really worth focusing on two 0% values.
One is the "false alarm rate", that is, when the model has problems processing data but acts as if everything is okay and tells you it's done.
For Opus 4.5, this value was 0.40. For Opus 4.7, it dropped to 0.25, and for 4.8, it has dropped directly to zero.
The other is the "laziness rate", that is, when the model encounters a problem that requires in - depth investigation, it gives you a vaguely appropriate result and muddles through.
For this indicator, Opus 4.7 still had a 25% probability of being lazy, while Opus 4.8 also has 0%.
Two 0%, both are firsts in history.
If you've ever written code with AI, conducted data analysis, or done some research, you know that you're most afraid of an AI that doesn't have enough capabilities but appears to be very competent.
Opus 4.8 solves exactly this problem. It no longer pretends to be able to do everything and also tells you when it's not sure about something: "I'm not sure here."
In my opinion, this improvement is much more important than a 10% increase in benchmarks.
So, what is the essential difference between Opus 4.8 and its predecessors 4.6 and 4.7?
Based on rough experience, I'd like to point out the most important differences.
First: Honesty and reliability.
I'm currently using Opus 4.6, which I believe, without a doubt, is the best model for content creation.
In version 4.6, Claude was already more honest than the competition, but there were still many problems with "arrogance".
The later version 4.7 improved this, and in 4.8, the problem is finally solved.
Put simply: 4.6 is like an employee with good capabilities but very vain. 4.7 is like a rigorous and very competent employee, and 4.8 is like a really reliable senior engineer.
Second: Efficiency.
For the same task, 4.8 requires 15% fewer steps and outputs 35% fewer tokens than 4.7.
For developers, tokens mean costs. Doing better while consuming less is the real progress from one generation to the next.
Third: The leap in coding capabilities.
It can be said that the increase in coding capabilities from 4.6 to 4.7 was progressive, while that from 4.7 to 4.8 was a leap.
In some extreme tests, such as when you give a compiled binary file and prohibit reverse - compilation, and the model is supposed to recreate the source code from scratch. The performance of 4.8 with a budget of 1M tokens can only be achieved by 4.7 with 5M tokens.
Fourth: The essential difference in agent capabilities.
The agent capabilities of 4.6 were still in the stage of "it works, but it's not stable enough". 4.7 improved the stability, but in complex situations, it still tends to deviate.
With 4.8, there is a significant change: it really develops judgment.
The company published an example: A developer used Claude Code to migrate code and then left in the middle. Claude continued to work in the background.
When it reached the halfway point of the process, the code commit was rejected because a colleague had checked in an urgent fix in the meantime. Claude informed the developer, and the developer just said: "Can you just overwrite it?"
But Claude refused.
It recognized that a forced overwrite would delete the colleague's urgent fix. So it combined the changes from both sides, ensured that the code was consistent and the commit history remained clean, and then pushed.
This is not just "executing commands", but it knows when it has to refuse. This is a crucial step in the evolution of the agent from a tool to an employee.
Among the new features of this release, there is one that I think is very powerful, called "Dynamic Workflows".
Put simply: When Claude gets a large task, it writes its own script and distributes it to many parallel sub - agents.
After they're done, they check each other, find each other's mistakes, and finally pass the result on to you.
This function is currently still in preview, and the token consumption is much higher than in normal conversations. It's not suitable for daily use.
However, I think this ability will become Claude Code's trump card in the future.
I'll tell you an interesting story.
Shortly after the release of 4.8, someone asked via the API: "Who are you?" Sometimes it answered that it was Qwen, sometimes that it was DeepSeek.
The speculation in the technical community is that it's a distillation process.
That is, during the training period of Opus 4.8, the outputs of other models may have been used for the knowledge distillation process.
This in itself has no impact on the ability, but it's still interesting.
The knowledge exchange between AI models is more complicated than we think. What you're using may not be a pure model, but a mixture of different intelligences.
Finally, a summary.
First: Opus 4.8 has achieved, for the first time, an honest AI model.
Among all the leading models, it's the first to have zero defects in the reliability indicators. For enterprise customers, this is ten times more important than a 5% performance increase.
Second: Efficiency.
It's more powerful and consumes fewer tokens, which directly affects the cost structure. Although it's still expensive, it has improved performance at the same price compared to its predecessors.
Third: The evolution of the agent form.
From single - time answers to the execution of long - term tasks and finally to the parallel cooperation of multiple agents, the product form of Claude is no longer limited to just a chat window, but is evolving into a work system.
In addition, Anthropic announced a financing of $6.5 billion on the same day, with an estimated value of $96.5 billion, almost reaching $100 billion.
In the next few weeks, Anthropic will release Claude Mythos. No one knows what kind of super - evolution this giant will trigger.
According to current information, Mythos is a higher - value model than Opus. Some people suspect that Opus 4.8 itself is a distilled version of Mythos.
If that's true, then the day of the official release of Mythos will be the beginning of a real turning point.
I'm really looking forward to it.
This article is from the WeChat account “Tang Ren” (ID: RyanTang007), author: Tang Ren, published by 36Kr with permission.