HomeArticle

Just now, ChatGPT and Claude had major updates simultaneously. Workers who can't be the boss of AI will be eliminated.

爱范儿2026-02-06 07:00
When Mars collides with Earth.

Just now, a "Mars hitting the Earth" scenario played out in the Silicon Valley AI circle.

OpenAI and Anthropic, as if by prior agreement, simultaneously unveiled their heavyweight updates: Claude Opus 4.6 and GPT - 5.3 - Codex.

If before last night, we were still discussing how to write good prompts to assist with work; then after today, we may need to learn how to manage AI employees as bosses.

AI creates AI and takes over your computer by the way

Just yesterday, Sam Altman boasted on the X platform about the "one million active users" milestone of Codex. Just one day later, OpenAI continued its winning streak and dropped a bombshell -

GPT - 5.3 - Codex.

There is a very significant sentence hidden in the technical documentation: "This is our first model that played a key role in creating itself."

In plain language, AI has learned to write code by itself, find bugs by itself, and has even started training the next generation of AI on its own. This self - evolving ability is also directly reflected in a series of benchmark scores.

Remember the OSWorld - Verified benchmark test that simulates human computer operations? The previous model only had an accuracy rate of 38.2%, not even reaching the passing line. But this time, GPT - 5.3 - Codex directly jumped to 64.7%.

You know, the average human level is only 72%. This means that AI is just a step away from being as proficient as you in moving the mouse, switching screens, and operating software.

In the Terminal - Bench 2.0 (command - line operation benchmark test), it even scored a high score of 77.3%, leaving GPT - 5.2 (62.2%) far behind.

In the SWE - Bench Pro benchmark test that covers four programming languages, is resistant to pollution, and consists entirely of real - world hardcore engineering problems, GPT - 5.3 - Codex also demonstrated state - of - the - art (SOTA) performance, and used fewer tokens than any previous model.

OpenAI even demonstrated its independent building ability:

Within a few days, it built from scratch a racing game v2 with multiple maps and also created a deep - sea diving game that manages an oxygen system.

What impressed me most was GPT - 5.3 - Codex's understanding of vague intentions.

When building a landing page, it automatically converted the annual plan into a discounted monthly payment price and even thoughtfully added a user review carousel - all without your instructions.

OpenAI's ambition is obvious: Previously, Microsoft often said that AI would be the co - pilot (Copilot) of humans, but now AI wants to be the driver who can control the steering wheel and even repair the car on its own.

By the way, there is an interesting detail.

Previously, there were rumors that OpenAI was dissatisfied with NVIDIA's AI chips, but this time the official blog specifically emphasized that the design, training, and deployment of GPT - 5.3 - Codex were all completed on the NVIDIA GB200 NVL72 system.

This wave of high - emotional - intelligence "thanks to NVIDIA" really gave Huang Renxun enough face.

Say goodbye to the "goldfish memory", Claude makes a comeback

Right around the time of the release of GPT - 5.3 - Codex, Anthropic also presented its own Spring Festival gift package.

The bad news is that the much - anticipated "medium cup" Sonnet model of Claude was not updated; but the good news is that Anthropic directly presented the "extra - large cup" - Claude Opus 4.6.

Compared with OpenAI's radical approach in terms of action, the Claude Opus 4.6 released by Anthropic today focuses on thinking ability and reliability.

Many enterprise users have a pain point called Context Rot: Although it claims to support a 200k context, when too much data is input, the AI starts to lose track.

This time, the data presented by Claude Opus 4.6 is simply a "game - changer".

In the MRCR v2 (long - text needle - in - a - haystack) test, the recall rate of Claude Opus 4.6 was as high as 76%.

In contrast, the previous generation Sonnet 4.5 only had a pitiful 18.5%. To some extent, this is a qualitative change from being basically unusable to being highly reliable.

This is because Claude Opus 4.6 introduced a truly usable 1M context window for the first time.

What does this mean? It means that you can directly throw hundreds of pages of financial reports or hundreds of thousands of words of code libraries at it. It can not only read them but also accurately tell you that there is a problem with the number in the footnote on page 342.

In addition, it now also supports a maximum output of 128k tokens. What does this mean? You can let it write a long research report or a complex code library in one go without being forced to truncate it due to word limits.

Besides having a good memory, Opus 4.6 also achieved a crushing victory in terms of intelligence this time:

In the GDPval - AA (an assessment for high - economic - value tasks in finance, law, etc.), Opus 4.6's Elo score was a full 144 points higher than the second in the industry (OpenAI's GPT - 5.2) and 190 points higher than the previous generation.

In the complex multidisciplinary reasoning test Humanity's Last Exam, it led all cutting - edge models.

In the BrowseComp test, which measures the ability to find "hard - to - find information" on the Internet, it also performed the best.

Through these data, Anthropic seems to be sending a signal: If you want to write code, go to OpenAI next door; if you want to handle complex business decisions, legal documents, or financial analyses, Claude is the only choice.

What really caught the eye of office workers is its productivity features.

On the one hand, Anthropic directly integrated Claude into Excel and PowerPoint this time. It can directly generate a PPT based on Excel data, not only retaining the layout style but also aligning the fonts and templates. In the Claude Cowork collaborative environment, it can even perform autonomous multitasking.

On the other hand, Anthropic then launched an experimental Agent Teams feature in Claude Code, allowing ordinary developers to experience the feeling of "commanding thousands of troops":

Role division: You can designate a Claude Session as the Team Lead. It doesn't do the dirty and tiring work but is responsible for disassembling tasks, assigning work orders, and merging code; the other Sessions are teammates, each taking on tasks to complete.

Independent operation: Each teammate has an independent context window (no need to worry about token explosion). They can even send messages to each other behind your back (Inter - agent messaging), discuss technical details, and only report the results to the team lead in the end.

Parallel competition: What's the use of this? Imagine looking for a stubborn bug. You can generate 5 agents to verify 5 different hypotheses respectively, like "racing horses" to eliminate problems in parallel; or during a code review, you can let one teammate act as a "security expert" to check for vulnerabilities and another as an "architect" to check the performance, without interfering with each other.

To demonstrate the limit of Opus 4.6, Anthropic researcher Nicholas Carlini conducted a crazy experiment: Agent Teams.

Instead of writing code himself, he allocated $20,000 worth of API credits and let 16 Claude Opus 4.6 form a "fully automated software development team".

As a result, within just two weeks, these AIs independently carried out more than 2,000 programming sessions and handwrote a C language compiler (based on Rust) with 100,000 lines of code from scratch.

This compiler written by AI successfully compiled the Linux 6.9 kernel (covering x86, ARM, and RISC - V architectures) and even ran the Doom game.

Although it is not perfect yet (for example, the efficiency of the generated code is not as high as that of GCC), this case also shows that we are no longer programming with AI but watching an AI team collaborate independently, find errors, and advance the project.

In addition, it has also learned Adaptive Thinking and can decide how long to "think" based on the difficulty. With the newly added "intelligence intensity" control, you can switch between four levels from Low to Max.

In terms of pricing, Anthropic was very reasonable this time, maintaining the basic price of $5/$25 per million tokens. It seems that in order to capture the enterprise - level market, it is determined to compete with OpenAI to the end.

One is a radical genius, and the other is a reliable old - timer