HomeArticle

The strongest "ox-horse" snipes at the king of programming. OpenAI and Anthropic both made major moves late at night.

字母AI2026-02-06 09:12
Claude Opus 4.6vsGPT-5.3 Codex, IPO Countdown

This day in 2026 is destined to be written into the history of AI development.

Claude Opus 4.6 and GPT-5.3 Codex were released within less than an hour of each other.

It seems that both companies are determined to present their answers at the same time.

Behind this "collision" is a contest for capital, technology, and market dominance.

Just two weeks ago, NVIDIA announced a $10 billion investment in Anthropic, which skyrocketed Anthropic's valuation to $350 billion.

Less than 72 hours after the news broke, NVIDIA turned around and injected $20 billion into OpenAI.

Jensen Huang has a clear plan: bet on both sides, and he won't lose no matter who wins.

But for Anthropic and OpenAI, it's not just about getting the money.

Both companies plan to start the IPO process between the second half of 2026 and around 2027. Now is a critical moment to prove their technological strength and compete for market pricing power.

Investors are not interested in the promises on PPTs; they want to see tangible products.

The company with a stronger model and more convincing performance in practical applications will be able to command a higher price and get more chips during the IPO.

As the saying goes, "One mountain cannot accommodate two tigers." Anthropic and OpenAI must make each other understand who is the boss.

Therefore, this product release rhythm is not a coincidence but a well - timed confrontation.

Both companies are aware that at this time, every product release is a financing roadshow, and every technological breakthrough will directly influence investors' judgments and market expectations.

However, judging from the products themselves, both companies have truly shown their capabilities.

Claude Opus 4.6

This time, Anthropic's upgrade of the Claude Opus series focuses on "thinking more intelligently".

The most significant change in Opus 4.6 is that it has learned "adaptive thinking". The model automatically adjusts the depth of thinking according to the complexity of the task. It spends more time thinking about difficult problems and quickly gets through simple tasks.

In terms of coding ability, Opus 4.6 achieved the highest score in the Terminal - Bench 2.0 evaluation.

This test specifically examines an AI's operational ability in a terminal environment. The model needs to know when to use which command, how to combine different tools, and how to find problems from error messages.

It's like testing whether a programmer can skillfully use various development tools. It's not just about writing code but also about debugging, deploying, and finding bugs in logs.

More importantly, Opus 4.6 is Anthropic's first Opus - level model to offer a 1 - million - token context window. This number means that the model can process a text volume equivalent to two medium - thickness novels at one time.

In the long - text processing test, Opus 4.6 scored 76% on the 8 - needle 1M variant of MRCR v2, while the previous generation, Sonnet 4.5, only scored 18.5%.

To put it simply, you give the model a large number of documents and then ask it a question that requires comprehensive information from multiple sources to answer.

Previous models would "forget" the previous content or fail to find the key information. Opus 4.6 can accurately locate the required information in a vast amount of text without a decline in performance due to the length of the document.

In the GDPval - AA evaluation of knowledge work ability, Opus 4.6 scored about 144 Elo points higher than OpenAI's GPT - 5.2 and 190 points higher than its previous version, Opus 4.5. This test covers actual work tasks in fields such as finance and law, such as preparing financial analysis reports, drafting legal documents, and conducting market research.

Anthropic has also made many supporting updates at the product level.

Claude Code now supports the "agent teams" function, which can start multiple AI agents simultaneously, assign different subtasks to each of them, and then automatically coordinate their work.

This function is particularly useful for large codebases, as it can split the work among different agents for parallel processing.

In terms of office software integration, Anthropic has launched a research preview version of Claude in PowerPoint and significantly upgraded Claude in Excel.

Now, Claude can handle more complex tasks directly in Excel, supporting functions such as pivot table editing, chart modification, and conditional formatting. In PowerPoint, Claude can understand the existing layout, font, and master design and then create new slides in that style.

It means that AI truly integrates into your daily work tools. You don't need to copy and paste back and forth. Just talk to Claude in the sidebar of Excel or PowerPoint, and it can help you modify tables, create charts, and generate presentations.

Moreover, it will learn your style, and the output won't seem out of place.

At the API level, Anthropic has introduced the "effort" parameter, offering four levels: low, medium, high, and highest.

Developers can choose the appropriate level according to the complexity of the task to balance cost, speed, and quality. There is also a "context compaction" function. When the conversation approaches the context window limit, it will automatically summarize and replace the earlier content, ensuring that long - running tasks won't be interrupted due to exceeding the limit.

It can be understood as giving developers more control.

Use the low - level for simple tasks to save money and time; use the high - level for complex tasks to ensure quality. When the conversation is too long, the system will automatically compress the previous content, so you can keep chatting.

In terms of security, Anthropic has conducted their most comprehensive security assessment ever.

Opus 4.6 showed a low rate of inappropriate behavior in automated behavior audits, including deception, flattery, encouraging user delusions, and facilitating abuse.

Due to the significant improvement in Opus 4.6's network security capabilities, Anthropic has specifically developed six new network security "probes" to detect potential abuse.

At the same time, they are also using this model to help open - source software find and patch vulnerabilities, hoping to enable the defense side to leverage the power of AI.

Advancing Finance: In - depth Application in the Financial Field

Anthropic published a special article detailing the application of Claude Opus 4.6 in the financial field.

In financial work, professionals need AI to do three things: research, analysis, and creating deliverables. Opus 4.6 has reached the industry - leading level in all three dimensions.

In terms of research ability, Opus 4.6 has improved in both the BrowseComp and DeepSearchQA benchmarks.

These two tests examine the model's ability to extract specific information from a large amount of unstructured data.

For financial analysts, it means they can throw a bunch of company financial reports, industry reports, and news articles at the AI and then ask a very specific question, and the AI can give a targeted answer instead of a general summary.

If you gave it a financial report before and asked "How is this company's profitability?", it might give you a long passage and then repeat the content of the financial report.

Now it can directly tell you what the key indicators are, how they compare with the industry average, and what the risk factors are.

In terms of analysis ability, Opus 4.6 achieved an accuracy rate of 60.7% in the Finance Agent external benchmark test, a 5.47 - percentage - point increase from Opus 4.5.

In the tax assessment TaxEval, Opus 4.6 also reached the industry's highest level of 76%.

Anthropic conducted a comparison using a commercial due - diligence task. They asked Claude Opus 4.6 to evaluate a potential acquisition target. This kind of work usually takes a senior analyst two to three weeks to complete.

However, the first output of Opus 4.6 was closer to the directly usable standard in terms of structure, content, and format than Opus 4.5.

That is to say, you can use the output with only minor modifications. For financial professionals who need to quickly produce reports and presentations, this is a real improvement in efficiency.

Anthropic's internal "real - world finance" assessment covers about 50 investment and financial analysis use cases, including spreadsheet, slide, and document generation and review.

These are common tasks for analysts in investment banking, private equity, public market investment, and corporate finance. Opus 4.6 has improved by more than 23 percentage points compared to Sonnet 4.5 a few months ago.

With the new Cowork function, financial teams can start multiple analysis tasks simultaneously. Cowork allows Claude to access the local folder you specify and directly read, edit, and create files in it.

For financial teams, it means they can assign several analysis tasks at once and supervise the process of Claude creating each deliverable to ensure it meets their standards.

GPT - 5.3 Codex: A Self - Training Model

Just a few dozen minutes after the release of Claude Opus 4.6, Sam Altman suddenly posted on X, announcing GPT - 5.3 Codex.

Here, on behalf of Alphabet AI, I gave a like and a repost to Sam Altman and Dario Amodei to show respect.

The most remarkable thing about GPT - 5.3 Codex is that it can work like a real - life colleague and discuss with you while working.

Previous AIs were like "do what you're told", while GPT - 5.3 Codex is like "ask you whenever there's a problem".

When you give it a complex task, it can think about it for hours or even days on its own. It will also actively report the progress and ask for your opinion during the process, and you can interrupt and adjust the direction at any time.

Interestingly, OpenAI used an early version of GPT - 5.3 Codex to help develop subsequent versions. That is, let AI help debug the training process of AI, fix bugs, and optimize the system. The OpenAI team said that this has made the development speed incredibly fast.

GPT - 5.3 Codex has set new industry records in multiple benchmark tests. On SWE - Bench Pro, it achieved an accuracy rate of 56.8%, which is a strict real - world software engineering assessment.

Different from SWE - bench Verified, which only tests Python, SWE - Bench Pro covers four programming languages. It is more resistant to contamination, more challenging, more diverse, and closer to the industry reality.

On Terminal - Bench 2.0, GPT - 5.3 Codex reached 77.3%, far exceeding the previous 64%.

This test measures the terminal skills required by code agents, that is, the ability to complete various operations in a command - line environment. Notably, GPT - 5.3 Codex uses fewer tokens than any previous model, which means users can do more with the same cost.

In the OSWorld - Verified test, GPT - 5.3 Codex scored 64.7%, while GPT - 5.2 - Codex only scored 38.2%.

This is a proxy computer usage benchmark test, where AI needs to complete productivity tasks in a visual desktop computer environment. Humans score about 72% in this test, and GPT - 5.3 Codex is approaching the human level.

In web development, OpenAI presented a comparison case: they asked GPT - 5.3 Codex and GPT - 5.2 - Codex to create a landing page for a SaaS product respectively.

GPT - 5.3 Codex automatically displayed the annual package as the discounted monthly price, making the discount clearer and more appealing, rather than simply multiplying the annual total price.

GPT - 5.3 Codex

GPT - 5.2 Codex

It also created an automatically switching user review carousel with three different user reviews instead of just one, making the whole page feel more complete and closer to the state of being ready to go live.

In short, it considers user experience and marketing effects. Instead of mechanically implementing functions, it thinks about "how to do it better". This attention to details and understanding of the final effect make its output closer to a professional level.

The capabilities of GPT - 5.3 Codex are not limited to coding.

It supports all tasks in the software life cycle, such as debugging, deployment, monitoring, writing product requirement documents, editing copywriting, user research, testing, and analyzing metrics.

In the GDPval test, GPT - 5.3 Codex performed on par with GPT - 5.2, achieving a win - or - draw rate of 70.9%. This test measures the model's performance in 44 well - defined knowledge work tasks across different professions, including creating presentations, spreadsheets, and other work products.

An interesting detail is that both companies emphasize "using their own products". Anthropic says "We use Claude to build Claude", and OpenAI says "GPT - 5.3 Codex played a key role in its own development."

This is actually the best