OpenAI releases a new model to directly compete with Anthropic. Just as Claude Code has become popular, will it be outshined by GPT-5-Codex?
On September 15th, OpenAI officially launched a new model, GPT-5-Codex. This is a fine-tuned variant of GPT-5, specifically designed for its various AI-assisted programming tools. The company stated that the "thinking" time of the new model, GPT-5-Codex, is more dynamic than previous models. The time required to complete a coding task ranges from a few seconds to seven hours. Therefore, it performs better in agent coding benchmark tests.
OpenAI Launches the "Most Competitive" Coding Agent, GPT-5-Codex
One of the highlights of GPT-5-Codex is its enhanced code review function, which can detect potential critical errors before product release, helping developers avoid risks in advance.
How exactly does it achieve this?
Different from static analysis tools, Codex matches the declared intent of a PR with the actual differences, reasons about the entire codebase and its dependencies, and executes code and tests to verify behavior. Only the most meticulous human reviewers can invest so much effort in each PR being reviewed. Therefore, Codex fills this gap - helping teams detect problems earlier, reducing the burden on reviewers, and delivering with more confidence.
Once Codex is enabled in a GitHub codebase, it will automatically review PRs until the PR changes from draft status to ready status and publish its analysis on the PR. If it suggests modifications, users can stay in the same thread and request Codex to implement these modifications. Of course, users can also explicitly request a review by mentioning "@codex review" in the PR and provide additional instructions, such as "@codex review security vulnerabilities" or "@codex review outdated dependencies".
Currently, GPT-5-Codex has become the default setting for Codex cloud tasks and code review. At the same time, developers can also apply it to local development environments through the Codex CLI and IDE extensions.
At OpenAI, Codex now takes charge of reviewing the vast majority of the company's internal PRs, detecting hundreds of problems every day - usually even earlier than human reviews start. This is crucial for the Codex team to advance projects quickly and confidently.
In addition, another technological breakthrough of GPT-5-Codex lies in its ability to dynamically adjust the thinking time according to task complexity. The model combines two core skills: on the one hand, it can perform agile pairing with developers in interactive sessions; on the other hand, it can also continuously execute independently in large tasks until a complete result is delivered.
In internal tests, GPT-5-Codex demonstrated its powerful ability to handle complex engineering tasks: it can work independently for more than 7 hours continuously, complete large-scale refactoring, continuously iterate, fix test errors, and finally achieve successful delivery. This means that whether it is a small, well-defined request or a large-scale project that requires long-term iteration, GPT-5-Codex is capable of handling it.
Since the launch of the Codex CLI in April and the Codex web version in May this year, Codex has gradually evolved into a more efficient collaborative coding tool. Two weeks ago, OpenAI integrated Codex into a unified product experience based on ChatGPT accounts, enabling developers to seamlessly migrate tasks between local environments and the cloud while maintaining complete context.
Today, Codex can run on multiple platforms, including terminals, IDEs, web pages, GitHub, and the ChatGPT iOS app. It is also included in ChatGPT Plus, Pro, Business, Edu, and Enterprise packages, providing a consistent experience for users at different levels.
OpenAI said that it plans to provide the model to API customers in the future.
Beats GPT-5 in Multiple Benchmark Tests
So, how does this model perform in various benchmark tests?
OpenAI said that GPT-5-Codex outperforms GPT-5 on SWE-bench Verified, which is a benchmark for measuring agent coding ability and also a benchmark for measuring the performance of code refactoring tasks from large, mature repositories.
It is worth mentioning that when OpenAI launched GPT-5, it only reported results on 477 SWE-bench Verified tasks. After being pointed out by Anthropic, OpenAI quickly made adjustments, and now the number of tasks has reached 500. The detailed results are as follows:
According to OpenAI's usage data, in user rounds sorted by generated tokens (including hidden inferences and final outputs):
For the bottom 10% of low-load tasks, GPT-5-Codex uses 93.7% fewer tokens than GPT-5, significantly improving efficiency.
In the top 10% of high-complexity tasks, GPT-5-Codex spends twice as much time on inference, editing, testing, and iteration as GPT-5, indicating its in-depth investment in complex engineering.
This flexibility allows the model to optimize resource utilization in different task scenarios.
According to OpenAI, GPT‑5-Codex is a reliable partner for front-end tasks. In addition to creating beautiful desktop applications, GPT‑5-Codex also shows significant progress in human preference evaluation when creating mobile websites. When working in the cloud, it can view images or screenshots input by users, visually check its progress, and show users screenshots of its work.
Alexander Embiricos, the product lead of OpenAI Codex, said in a briefing that the performance improvement is largely due to GPT-5-Codex's dynamic "thinking ability". Users may be familiar with the router of GPT-5 in ChatGPT, which directs queries to different models according to the complexity of the task. Embiricos said that GPT-5-Codex works similarly but without a built-in router, and can adjust the processing time of tasks in real-time.
Embiricos said that this is an advantage compared to the router because the router decides at the beginning how much computing power and time to use to solve a problem. GPT-5-Codex can decide to spend an additional hour on a problem five minutes after it starts. Embiricos said that he has seen the model take up to seven hours in some cases.
What Do Netizens Think?
The release of GPT‑5-Codex has sparked heated discussions on the Internet.
Well-known blogger Dan Shipper said that he has experienced GPT-5-Codex and was shocked by its effects.
"It dynamically selects the 'thinking' time according to the task - it can work for a long time on difficult problems and give instant answers to simple questions.
In our production codebase test, it can run autonomously for up to 35 minutes - in contrast, GPT-5 is often too cautious, which is an obvious upgrade.
It supports seamless switching between local and web development environments. You can start a task in VS Code and then hand it over to Codex Web to continue when you go shopping.
It is equipped with a code review agent that will actually run your code, so it can detect more bugs.
Here is our overall impression after a large number of internal tests:
This is a very excellent upgrade, making Codex CLI a strong alternative to Claude Code.
However, it needs reasonable prompts to perform at its best. For example, @kieranklaassen can only make it run for up to 5 minutes, while @DannyAziz97 has found the trick.
Sometimes it's 'lazy' - it may not think enough on some tasks or directly refuse if it thinks the task is too large.
I've been using Codex CLI to submit a new PR for @CoraComputer all weekend, and the experience shows that it is very useful and easy to guide - it's a great model."
On Reddit, some users who have tried GPT-5-Codex also think that it is rewriting the rules of the game.
"Today, I encountered some simple bugs related to Electron rendering and JSON generation. These bugs couldn't be solved by Codex three weeks ago (I had consulted it 10 times before). Today, I tried the new version, and it solved these problems at once and really solved the problems according to my instructions.
I saw a post about what the CEO of Anthropic said, that 90% of the code will be generated by AI. I think he's right - but Anthropic hasn't achieved this. From my two-hour experience, I think Codex will ultimately write nearly 75% of my code, 15% will be written by myself, and 10% will be written by Claude, at least in situations where the context is controllable."
Some people even feel a sense of employment crisis because GPT-5-Codex can work efficiently for 7 hours continuously:
"When this service can run stably at night and on weekends, the rules of the game will change completely. Junior developers simply can't compete with it. After all, the cost of this service is only between $20 and $200, while hiring a junior developer costs a company between $5,000 and $10,000 per month. When you factor in the costs of sick leave, holidays, weekend overtime pay, insurance, etc., this service can save the company 500 to 1000 times the cost of hiring a junior developer.
It is foreseeable that the industry is about to undergo a huge transformation. If I could go back to college and choose a major again, I probably wouldn't consider majoring in computer science."
Some people sighed that in the current era of numerous AI programming tools, programming work is no longer about traditionally writing code but more about architecture design. This netizen said:
"The focus of programming will shift more towards architecture design rather than simply writing raw code. The past model of hiring junior engineers just to implement certain functions envisioned by architects or senior engineers will gradually lose its meaning.
For me, even if software is developed by artificial intelligence in the future, programming will still be full of fun. Because I think the real fun lies in: when my ideas are implemented, they seem to 'come alive' in a sense. And making all the code work together smoothly involves many challenges and problem-solving, which are always difficult for artificial intelligence lacking general intelligence to overcome.
Therefore, I think the programming profession will basically not disappear completely until we truly achieve AGI (Artificial General Intelligence)."
Large Amounts of Capital Flowing into AI Coding Tools
This update is part of OpenAI's efforts to enhance the competitiveness of Codex with other AI coding products (such as Claude Code, Anysphere's Cursor, or Microsoft's GitHub Copilot).
Due to strong user demand, the AI coding tool market has become more crowded in the past year.
Anysphere, the manufacturer of Cursor, just completed a $900 million financing round in June, with a valuation of $9.9 billion. This round of financing was led by returning investor Thrive Capital, with participation from Andreessen Horowitz, Accel, and DST Global.
This large-scale financing is Anysphere's third financing round in less than a year. As first reported by TechCrunch, this three-year-old startup received a $100 million financing round at a valuation of $2.5 billion at the end of last year.
A person familiar with the matter told TechCrunch that Anysphere's annual recurring revenue (ARR) approximately doubles every two months. The source told Bloomberg that the company's ARR has exceeded $500 million, a 60% increase from the reported $300 million in mid-April.
At the beginning of this month, Anthropic, the manufacturer of Claude, announced that it had completed a new round of financing, raising $13 billion, making it one of the most valuable startups in the world, with its valuation almost tripling to approximately $183 billion. The artificial intelligence company initially planned to raise $5 billion but repeatedly raised its target due to strong investor demand.
Anthropic was founded in 2021 and has achieved explosive growth since then. Its recurring revenue increased fivefold from January to August this year alone. However, it also faces fierce competition from other rapidly growing artificial intelligence companies such as OpenAI and Meta.
Also this month, Replit, the fastest-growing Agentic AI software creation platform, announced the completion of a $250 million financing round, with a valuation of $3 billion, nearly tripling compared to the previous financing round in 2023. This financing comes at a time when Replit's annual recurring revenue has increased from $2.8 million to $150 million in less than a year, an increase of more than 50 times, thanks to its community of more than 40 million global users. Prysm Capital led this round of financing, and strategic investors include Amex Ventures and Google AI Futures Fund. Companies such as YC, Craft, a16z, Coatue, and Paul Graham are increasing their investments.
The similar code editor Windsurf experienced a chaotic acquisition, resulting in its team being split between Google and Cognition.
Against the backdrop of huge financing and fierce competition, the AI coding track is entering an unprecedented spotlight moment: tech giants are increasing their investment, startups are sprinting forward, and capital is chasing. However, behind the capital frenzy, whether the market can truly produce products with sustainable vitality remains to be verified by time. Whether it is star companies like OpenAI and Anthropic or emerging companies like Replit and Anysphere, they all have to face a common question - how to make AI coding tools truly integrate into the development process and improve productivity, rather than just staying in the "valuation game".
Reference Links:
https://openai.com/index/introducing-upgrades-to-codex/
https://www.reddit.com/r/OpenAI/comments/1nhuoxw/sam_altman_just_announced_gpt5_codex_better_at/
https://www.swebench.com/
This article is from the WeChat official account "AI Frontline". Compiled by Dongmei. Republished by 36Kr with authorization.