Claude Opus 4.5 Arrives: Single-Handedly Create "Minecraft" and Crack High-Difficulty Agent Evaluations

Programming can outperform Gemini 3 Pro, and in interviews, defeat all human candidates.

Outperform Gemini 3 Pro in programming and defeat all human candidates in interviews.

According to a report by Zhidx on November 25th, today, Anthropic released its flagship programming model, Claude Opus 4.5. Anthropic claims that this is the most powerful model globally in terms of programming, agents, and computer usage.

In the real - world software engineering test SWE - bench Verified, Claude Opus 4.5 became the first AI model to score over 80%. It not only outperformed its own Claude Sonnet 4.5 but also exceeded Gemini 3 Pro and GPT - 5.1 Codex - Max, which were released last week.

Anthropic also gave Claude Opus 4.5 the high - difficulty take - home exam used by the company to interview human engineers. As a result, within the stipulated two - hour period, Claude Opus 4.5 scored higher than any previous human applicant, indicating that the AI model has surpassed excellent human applicants in important technical skills.

Programming is not the only improvement of Claude Opus 4.5. Its visual, reasoning, and mathematical abilities are all superior to its predecessors, and it can handle daily tasks such as in - depth research, processing slides, and spreadsheets quite well.

Meanwhile, Anthropic has really brought down the price of the Claude Opus series of models this time. The pricing of Claude Opus 4.5 is $5 per million tokens (input) / $25 (output), only one - third of its predecessor, Claude Opus 4.1. At the same time, Anthropic has also removed the usage limits specifically for the Opus series.

Claude Opus 4.5 is now available in the Claude application and API. However, before using Opus, users still need to subscribe to the highest - tier package at $200 per month. Claude Opus 4.5 has also been launched on the three major mainstream cloud platforms: AWS, Google Cloud, and Microsoft Azure.

01. Another leap in front - end performance and a perfect one - time replication of "Minecraft"

What is the actual effect of using Claude Opus 4.5? In the comment section of Anthropic's official announcement of the model release, many users have already shared their first - hand experiences.

In terms of front - end capabilities, Guillermo, the CEO of the front - end developer platform Vercel, used Claude Opus 4.5 to build an e - commerce website. The one - time generated result is as follows:

Guillermo sighed that the level of Claude Opus 4.5 is completely different and incredibly good.

This netizen shared four Hero Sections he built using Claude Opus 4.5. The Hero Section is an important area on a website or app used to attract users' attention. It can be seen that in terms of both font design and web layout, these pages all have a high - end feel.

Some netizens used Claude Opus 4.5 to build a clone of "Minecraft", which tested the model's performance on more complex projects. Claude Opus 4.5 succeeded on the first try, generating 3,500 lines of code. The netizen believes that this means Claude Opus 4.5 won't cut corners like Gemini 3.0 Pro.

The "Minecraft" game replicated by Claude Opus 4.5 looks very realistic. It has different biomes (plains, deserts, snow - covered areas), the transparent blocks of leaves and water are just right, and it also has a great inventory and crafting system. All of these are integrated into one game. It even created cloud effects. The netizen said that he had never seen any model achieve this before.

Dan Shipper, the co - founder and CEO of the AI subscription platform Every, sighed that every six months to a year, a model that truly changes the industry landscape emerges, and Claude Opus 4.5, which was released today, is that model. Shipper said that this is the best programming model he has ever used, without a doubt.

02. Leading in 7 programming language tests and a significant improvement in security

Before the release, Anthropic conducted internal tests on the Claude Opus 4.5 model. The testers said that Claude Opus 4.5 can handle ambiguous situations and weigh the pros and cons without excessive guidance.

When encountering complex multi - system errors, Claude Opus 4.5 can find the repair method on its own. Tasks that Claude Sonnet 4.5 could hardly complete a few weeks ago can now be easily handled by Claude Opus 4.5. Anthropic's testers told the model team that Claude Opus 4.5 really "knows its stuff".

Anthropic shared the performance of Claude Opus 4.5 in multiple benchmark tests. In the SWE - bench Multilingual test, which examines the mastery of multiple programming languages, Claude Opus 4.5 led in 7 out of 8 programming languages.

In the BrowseComp - Plus test, which examines the deep - search agent capabilities, Claude Opus 4.5 showed an approximately 4.7% advantage over Claude Sonnet 4.5.

Claude Opus 4.5 also cracked some commonly used benchmark tests. For example, in the τ2 - bench test, which measures the capabilities of agents, the model needs to play the role of an airline customer service representative to help a passenger in trouble.

The benchmark test requires the model to refuse to modify the economy - class ticket because the airline does not allow changes to tickets in this class. However, Claude Opus 4.5 found a clever and reasonable solution: first upgrade the ticket class and then modify the flight.

Technically speaking, because the way Claude Opus 4.5 helped the customer was unexpected, the benchmark test judged it as a failure. But this creative way of solving problems is a significant progress.

In other cases, finding a clever way to bypass the expected limitations may be regarded as reward hacking - that is, the model "manipulates" the rules or goals in an unexpected way.

Preventing such biases is one of the goals of Anthropic's security tests. In internal evaluations, the probability of Claude Opus 4.5 showing concerning behavior is slightly over 10%, far lower than the 20% of GPT - 5.1 and Gemini 3 Pro.

Claude Opus 4.5 has made significant progress in resisting prompt injection attacks. Prompt injection attacks secretly implant deceptive instructions to induce the model to perform harmful actions. Opus 4.5 is more difficult to deceive with prompt injection attacks than any other cutting - edge model in the industry.

03. New thinking intensity control and the same context compression function as GPT

While releasing the latest model, Anthropic also announced a series of new features for the Claude developer platform.

As the intelligence level of models improves, they can solve problems with fewer steps: reducing backtracking, redundant exploration, and lengthy reasoning. Compared with its predecessors, Claude Opus 4.5 significantly reduces token consumption when achieving the same or better results. However, different tasks require different trade - offs - developers sometimes want the model to keep thinking about difficult problems, and sometimes they need a more agile response.

Through the newly added "effort parameter" in the Claude API, developers can independently choose to minimize time costs or maximize the model's capabilities.

Under the medium - intensity setting, Claude Opus 4.5 achieved the best result of Sonnet 4.5 in the SWE - bench Verified test, while reducing the output tokens by 76%.

Under the highest intensity, its performance exceeded that of Claude Sonnet 4.5 by 4.3 percentage points and saved 48% of tokens.

Combining intensity control, context compression, and advanced tool - using capabilities, Claude Opus 4.5 can handle more long - lasting and complex tasks and reduce manual intervention. It is worth noting that GPT - 5.1 Codex Max, launched by OpenAI last week, also has the new context compression function.

The Claude developer platform has achieved breakthroughs in context management and memory capabilities, significantly improving the performance of agent tasks. Claude Opus 4.5 is particularly excellent in coordinating sub - agent teams, supporting the construction of complex and well - coordinated multi - agent systems. Test data shows that the combination of these technologies improves the performance of Claude Opus 4.5 in in - depth research evaluations by nearly 15 percentage points.

Anthropic is continuously improving the composability of the developer platform. By providing basic modules such as efficiency control, tool use, and context management, it helps developers precisely build the required functions.

In terms of products, Claude Code has received a double upgrade with Claude Opus 4.5. The planning mode can formulate more precise plans and execute them thoroughly - first, it actively asks for clarification questions, and then generates an editable plan.md file for the user to implement the operations.

At the same time, this function has now been launched on the desktop application, supporting parallel local and remote sessions and enabling multi - agent collaborative work (such as code repair, GitHub research, and document updates simultaneously).

For users of the Claude application, long conversations are no longer limited by the context length. The system will automatically summarize the early conversation content to maintain the continuity of communication.

Claude for Chrome, which is available to all Max users, is now fully open, supporting task processing across browser tabs. The test permission for Claude for Excel, which was launched in October, is now extended to all Max, Team, and Enterprise users. These updates are all due to the improvements of Claude Opus 4.5 in computer operations, spreadsheet processing, and long - term task management.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Claude Opus 4.5 is here. It can single-handedly create "Minecraft" and crack high-difficulty Agent evaluations.

01. Another leap in front - end performance and a perfect one - time replication of "Minecraft"

02. Leading in 7 programming language tests and a significant improvement in security

03. New thinking intensity control and the same context compression function as GPT