HomeArticle

Claude Opus 4.7 is here. It's the SOTA among public models, but it feels very much like GPT when using it.

量子位2026-04-17 16:08
Is a player of Anthropic who can "catch things steadily"

After feigning an attack to the outside world with Mythos, Anthropic unexpectedly launched Claude Opus 4.7.

Many friends stayed up late and had a great time!

I sat up in bed all of a sudden and started experiencing it while surfing the Internet. I've summarized one piece of bad news and some good news from Opus 4.7.

Let's start with the bad news - Opus 4.7 somehow seems to have the shadow of an old acquaintance.

It always wants to "catch me steadily".

Many netizens also reported that although it's said to be an upgrade, Opus 4.7 is becoming more and more like GPT the more you use it???

If it's true, this is not a good thing (closing eyes helplessly.jpg).

Compared with only one piece of bad news, there are a lot of good news.

In many aspects such as Agentic coding, Agentic terminal coding, Scaled tool use, and Visual reasoning, it is better than its predecessor. However, its performance in several individual abilities such as Agentic search has declined.

Anthropic also said rather proudly:

Opus 4.7 is our most powerful publicly available model at present. However, this is not our most powerful model~~

It seems that the most powerful one is still the unannounced Mythos.

Looking at the table above, Mythos' overall performance in those tests is about 10% to 15% better.

There is no doubt that Mythos Preview is the most popular card in Anthropic's hand at present. It has full capabilities, but its price is also five times that of Opus 4.7.

In contrast, Opus 4.7 is more like the most powerful mass - produced version with a fully verified security system, an affordable price, and full - platform availability.

However... as the saying goes, even the wisest man makes mistakes.

The powerful Opus 4.7 still had a setback yesterday:

Claude Opus 4.7's surprise attack: Four core upgrade directions

Overall, this publicly available and most powerful Opus 4.7 has outstanding performance in four directions.

Advanced software engineering field: Worthy of trust

Opus 4.7's most significant progress is in the advanced software engineering field.

Let's look at this set of data:

The SWE - bench Verified test reached 78.2%;

The SWE - bench Multimodal reached 72.7%;

The Terminal - Bench 2.0 achieved 68.8%;

The number of production tasks solved in Rakuten - SWE - Bench is three times that of Opus 4.6;

The GitHub 93 - task coding benchmark also increased by 13%.

Michael Truell, the CEO of Cursor, gave a key evaluation:

On CursorBench, Opus 4.7 jumped from 58% to 70%. This leap is of great significance.

This improvement is reflected in three key features.

First, strictly follow the instructions.

Opus 4.7 no longer "flexibly interprets" the user's vague statements like early models, but executes them literally.

This means that in the past, when you wrote a prompt like "Try to optimize this code if possible", the model might selectively ignore it.

Now when you say "Optimize this code", it will definitely execute it.

This change requires users to readjust their prompt strategies. Soft modifiers such as "if possible/ideally/try to" have a higher weight, and hard restrictions need to be more explicit.

Second, self - verify before output.

Opus 4.7 will devise ways to verify its own outputs before reporting the results, just like a senior engineer runs a test on their own before submitting code.

Third, good at complex multi - file changes, fuzzy debugging, and cross - service code review.

Sarah Sachs, the AI Lead at Notion, shared a piece of data:

Facing complex multi - step workflows, Opus 4.7 has a 14% improvement compared to Opus 4.6, and it consumes fewer tokens. The tool error rate is only one - third. It is the first model to pass our implicit requirement test.

Visual ability: Resolution × 3, see more details

In terms of visual ability, Opus 4.7 also has very good progress.

Official data shows that the maximum supported long - side length is 2576 pixels (≈3.75 million pixels), which is more than three times that of Opus 4.6; the XBOW visual acuity reaches 98.5% (Opus 4.6 is only 54.5%).

It almost covers all actual application scenarios. It can directly recognize complete Figma design drafts, 1080p terminal screenshots (including small gray characters), accurately parse complex technical architecture diagrams and financial report charts. In the Computer Use scenario, it can clearly read high - density UI elements, and its visual processing ability is almost perfect.

In other words, tasks such as chemical structure analysis, complex technical chart recognition, and pixel - level precise UI element positioning, which used to require specialized models, can now be directly handled by Opus 4.7 alone.

After hearing this, Figma's stock price immediately plummeted. It's really a disaster.

Instruction following and reasoning: More controllable and reliable

Opus 4.7 also has great progress in instruction following.

It no longer tries to guess the user's real intention, but strictly executes according to the literal meaning.

The core advantage of this upgrade lies in strict literal execution. If the user requires "Do not use TypeScript", the model will definitely not use it; if the user requires "Output JSON", the output will definitely have no additional prefix.

This change may require old users to adapt (at the same time, old prompts are prone to unexpected results and need to be recalibrated), but it is a blessing for scenarios that require precise control.

In terms of reasoning, it performs well in the long - context scenario of 1 million tokens. The score of the BFS task is 58.6%* (Opus 4.6's score was 41.2%), and the logical coherence in complex reasoning is significantly improved.

Enhanced Agent ability: A version born for Agents

If the previous Claude was born for conversations, Opus 4.7 is born for Agents.

This is reflected in several aspects.

First of all, generally speaking, the core Agent ability of Opus 4.7 has been comprehensively improved.

Several well - known AI companies have presented data related to actual usage effects - the success rate of Notion's multi - step workflows has increased by 14%, and the tool call error rate has dropped to 1/3; in the long - term operation simulation of Vending - Bench 2, the final balance reached $10,937 (Opus 4.6 had $8,018 left), and long - term decision - making is more stable; in the Genspark scenario, the three production - level features of anti - infinite loop, consistency, and error recovery are fully utilized.

It also has file system memory and can reliably remember key information across multiple sessions. New tasks can reduce 40% of repeated context input.

Scott Wu, the CEO of Cognition, described it more vividly:

Opus 4.7 has raised long - term autonomy to a new level in Devin. It can work coherently for several hours, break through difficult problems instead of giving up, and unlocks a type of in - depth investigation work that we couldn't run reliably before.

At the same time, Opus 4.7 also provides developers with many exciting Agent - related four - piece sets.

First, a new xhigh reasoning level is added, which is between the high and max levels as the default level.

This gives developers more precise control and allows them to find a balance between reasoning depth and latency, balance intelligence and token cost, and adapt to most coding/Agent tasks.

Second, a new adaptive thinking mode is added to replace the fixed - budget long - thinking mode. The model independently determines the thinking depth, responds quickly to simple queries, and focuses on complex steps.

Third, task budget (public beta) allows developers to guide token consumption and optimize resource allocation for long tasks.

Fourth, the /ultrareview command is added to Claude Code, which can create a dedicated review session and mark minor errors and design problems.

Want to be a reliable model: First - launch protection, enhanced memory

Anthropic officially stated that Opus 4.7's network security ability is not as good as that of Mythos Preview.

However, this is their deliberate choice.

Behind this "self - restriction" is Anthropic's consistent insistence on AI security.

Since its establishment in 2021, this company has spent four years carefully building its reputation, trying to shape an image of "paying more attention to security and responsible AI deployment than competitors such as OpenAI" to the outside world.

After Mythos Preview triggered heated discussions in the industry about the security risks of powerful AI models, Opus 4.7 is designed as a buffer zone.

Specifically, Anthropic tried to differentially reduce Opus 4.7's network ability during training, making the model show a more cautious behavior pattern when facing network security - related tasks.

At the same time, the official released protective measures for automatically detecting and blocking high - risk network security requests. These safeguards can automatically identify and intercept requests indicating prohibited or high - risk network security uses.

For professionals with legitimate network security needs, Anthropic has launched the Cyber Verification Program.

Security professionals who need to use Opus 4.7 for legitimate purposes such as vulnerability research, penetration testing, and red - team exercises can apply through formal channels.

The official website also wrote at the end of the podcast that if developers want to migrate from Opus 4.6 to version 4.7, there are some matters that need special attention.

First is the update of the tokenizer.

Opus 4.7 uses a new tokenizer. Although it improves text processing efficiency, the same input may be mapped to more tokens, about 1.0 to 1.35 times.

This means that the same prompt may consume more tokens, and a margin needs to be reserved in the cost budget.

Second, more output tokens will be generated at a higher effort level.

Opus 4.7's thinking depth at the high and xhigh levels has increased significantly, especially in the later stages of multi - round conversations in the Agent scenario.

This "think more, be more reliable" behavior pattern improves the output quality, but it also means that token consumption will increase with the length of the session.

Same price as Opus 4.6, here are some things you need to know

Currently, Opus 4.7 is available on all platforms.

In addition to the official Claude channels, the new model has not only been launched on all products of Claude Pro/Max/Team/Enterprise and the official API but also simultaneously launched on three major cloud platforms: Microsoft Foundry, Google Cloud Vertex AI, and Amazon Bedrock.

Its pricing is the same as that of Opus 4.6: $5 per million input tokens and $25 per million output tokens.

As mentioned before, although Opus 4.7 involves the need to reconstruct prompts and adjust token usage strategies, Anthropic has given positive signals in its internal tests.

In an internal Agent coding evaluation, the token usage efficiency at all effort levels has improved compared to Opus 4.6.

In other words, although the number of tokens per call may increase, the total number of tokens required to complete the task is often less because the number of mistakes made by the model has decreased.

It's probably like hiring a senior engineer with a higher hourly wage, but he can complete tasks faster and has less rework, so the final total cost may be lower.

In addition, Opus 4.7 will be more cautious in subsequent rounds, especially in the Agent scenario.

This means more reliable output, but it also means more token consumption.

Developers can balance performance and cost by adjusting the effort parameter, setting the task budget, or optimizing the prompts.

Anthropic suggests starting with the high or xhigh effort level when testing the coding and Agent use cases of Opus 4.7 and adjusting gradually according to actual needs.

Anyway~

Generally speaking, the actual usage cost will vary depending on the usage method, but in most cases, the efficiency gain brought by the improved ability will offset the increase in token consumption.

For teams that rely on Claude for complex development work, this is likely to be a cost - effective deal.

Reference links:

[1]https://www.anthropic.com/news/claude-opus-4-7

[2]https://www.cnbc.com/2026/04/16/anthropic-claude-opus-4-7-model-mythos.html

[3]https://x.com/i/trending/2044560325509316766

This article is from the WeChat official account “QbitAI”, author: Heng Yu, published by 36Kr with authorization.