HomeArticle

OpenAI's most powerful programming model, GPT-5-Codex, is here, capable of working continuously for 7 hours without "getting tired".

智东西2025-09-16 10:05
The first optimized model of GPT-5 is here!

According to Zhidx on September 16th, early this morning, OpenAI released a new model GPT-5-Codex. This is a model version specifically optimized for software engineering based on GPT-5, further enhancing the Agentic Coding ability in Codex.

OpenAI mentioned in its blog that the training of GPT-5-Codex focuses on practical software engineering work. It can dynamically adjust the thinking time according to the task and can work independently for more than 7 hours on large and complex tasks.

Meanwhile, in benchmark tests, compared with GPT-5, GPT-5-Codex has achieved improvements in the accuracy of multiple benchmark tests and the probability of high-impact comments in code reviews.

Just over two hours after the release of GPT-5-Codex, Sam Altman, the co-founder and CEO of OpenAI, revealed on X that the traffic of GPT-5-Codex has accounted for about 40% of the total Codex traffic, and it will account for more than half of the traffic today.

GPT-5-Codex is available in all scenarios where developers use Codex. It is the default tool for cloud tasks and code reviews. Developers can also extend it through the Codex Command Line Interface (CLI) or Integrated Development Environment (IDE) and choose to use it for local tasks.

OpenAI first launched the open-source programming agent Codex CLI in April this year and the web version of Codex in May. Two weeks ago, it integrated Codex into a single product experience connected through ChatGPT accounts, enabling developers to seamlessly migrate their work between the local environment and the cloud without losing context.

Codex is included in the subscription packages of ChatGPT Plus, Pro, Business, education, and enterprise users. Among them, the Plus, education, and Business packages support several focused coding courses per week, and the Pro package supports the use of multiple projects within a week. For developers using Codex CLI through API keys, OpenAI plans to provide GPT-5-Codex in the API soon.

In the X comment section of OpenAI, developers said that this new release from OpenAI is very promising for handling complex projects, and some developers are worried about their AI tool subscription budgets.

01 Dynamically adjust thinking time according to tasks, reduce incorrect comments and increase high-impact comments

GPT-5-Codex is trained for complex real-world engineering tasks, such as building a complete project from scratch, adding features and tests, debugging, performing large-scale refactoring, and conducting code reviews. It can better follow the instructions in AGENTS.md and generate high-quality code. Developers only need to present their requirements without writing lengthy descriptions of code style or code cleanliness.

In addition, GPT-5-Codex will dynamically adjust the thinking time according to the complexity of the task. The time it takes to execute a task ranges from a few seconds to 7 hours. This model combines two basic skills of programming agents: pairing with developers in interactive sessions and continuously and independently executing on longer tasks. This means that Codex feels more agile when handling small, well-defined requests or chatting with it, and can also work for a longer time when handling complex tasks such as large-scale refactoring.

Looking at historical data, including when GPT-5 was released, OpenAI only announced the test results of the SWE-bench Verified, a benchmark test set of 477 measures for the model's ability to solve real software engineering tasks. This was because some tasks could not run in its infrastructure environment at that time. Now, OpenAI has fixed this problem and can currently announce the test results of all 500 tasks. The accuracy of GPT-5-Codex in this benchmark test is 74.5%, while that of GPT-5 is 72.8%.

OpenAI tested the code refactoring ability of the new model based on refactoring-style tasks from large and mature code libraries, involving programming languages such as Python, Go, and OCaml. The accuracy of GPT-5-Codex in this test is 51.3%, while that of GPT-5 is 33.9%.

In the test, researchers found that GPT-5-Codex can independently handle large and complex tasks for more than 7 hours, continuously iterate on the implementation, fix test errors, and finally deliver successfully.

Based on the usage of OpenAI's internal employees, researchers found that when sorting user interaction rounds by the number of tokens generated by the model, in the last 10% of cases with the fewest generated tokens, GPT-5-Codex used 93.7% fewer tokens than GPT-5.

In the top 10% of cases, on the contrary, GPT-5-Codex thinks more and spends twice as much time on reasoning, code editing, testing, and iteration as GPT-5.

GPT-5-Codex can also be used to conduct code reviews and find critical defects. During the review, it will browse the developer's code library, infer dependencies, and run code and tests to verify correctness.

OpenAI evaluated the code review performance of recent submissions in popular open-source repositories. Experienced software engineers evaluate the correctness and importance of review comments for each submission.

About 13.7% of GPT-5's comments are incorrect, while for GPT-5-Codex, it is only 4.4%. In terms of the proportion of high-impact comments, GPT-5 has 39.4%, and GPT-5-Codex has 52.4%. In terms of the average number of comments per pull request, GPT-5 has an average of 1.32 comments, and GPT-5-Codex has 0.9 comments.

They found that the comments of GPT-5-Codex are less likely to be incorrect or unimportant.

According to TechCrunch, Alexander Embiricos, the product lead of OpenAI Codex, said in a briefing that the significant performance improvement of GPT-5-Codex is largely due to its dynamic thinking ability. Users may be familiar with the Real-time router of GPT-5 in ChatGPT, which directs queries to different models according to the complexity of the task. GPT-5-Codex works in a similar way but without a built-in router and can adjust the processing time of tasks in real-time. This is an advantage compared to the router because the router decides at the beginning how much computing power and time to use to solve a problem, while GPT-5-Codex can decide to spend an additional hour five minutes after starting a problem.

OpenAI's official blog also mentioned that, different from the general model GPT-5, they recommend that developers use GPT-5-Codex only when performing agent programming tasks in Codex or a similar Codex environment.

02 Three core improvements make the agent programming workflow more automated

In addition, OpenAI has recently made some updates, including an improved Codex CLI and a new Codex IDE extension.

First, for Codex CLI

Based on the feedback from the open-source community on Codex CLI, OpenAI rebuilt Codex CLI around the agent programming workflow. Now, developers can directly attach and share images in the CLI, including screenshots, wireframes, and charts, etc., to build a shared context based on design decisions and accurately obtain the required content.

When handling more complex work, Codex can now use a to-do list to track progress and includes tools such as web search and MCP for connecting to external systems, thereby improving the overall accuracy of tool use.

The upgrade of the terminal user interface includes better and more understandable tool call and difference display formats.

The approval mode is simplified into three levels: read-only (requiring explicit approval), automatic (requiring full access to the workspace but approval outside the workspace), and full access (able to read files anywhere and run commands through network access). It also supports compressing the conversation state to facilitate developers to manage longer conversations.

Second, the Codex IDE extension

This IDE extension can connect the Codex agent to VS Code, Cursor, and other editors derived from VS Code, allowing it to preview local code changes and collaborate on code editing with Codex.

When developers use Codex in the IDE, they can get results by entering shorter instructions because Codex can use context information, such as the files opened by the developer or the selected code snippet.

The Codex IDE extension allows developers to switch the workflow between the cloud environment and the local environment. Developers can create new cloud tasks, track ongoing work, and view completed tasks without leaving the editor.

If developers need to make final adjustments to the code, they can directly open cloud tasks in the IDE, and Codex will fully retain the relevant context information.

In addition, OpenAI has also been improving the performance of its cloud infrastructure. By caching containers, it has reduced the average completion time of new tasks and subsequent tasks by 90%. Codex can now automatically set up the environment by scanning and executing commonly used installation scripts; with configurable Internet access, it can execute commands like pip install at runtime to obtain dependencies as needed.

As in the CLI and IDE extensions, developers can now share front-end design specifications with Codex by uploading images, such as interface prototypes, visual drafts, or upload screenshots of interface misalignment and abnormal styles to illustrate UI vulnerabilities.

When Codex builds front-end content, it can start a browser by itself to view the built effect and perform iterative optimization. Finally, it will attach the result screenshot to the corresponding task and the GitHub pull request.

In code reviews, Codex can be used to find critical defects.

Different from static analysis tools, it can match the development intention declared in the pull request with the actual differences, conduct reasoning and analysis in combination with the entire code library and dependencies, and verify the actual running behavior by executing code and test cases.

Once developers enable Codex in a GitHub repository, when a pull request changes from the draft state to the ready state, Codex will automatically review it and publish the analysis results on the pull request.

If Codex suggests making modifications, developers can ask Codex to directly implement these modifications in the same conversation thread.

Developers can also explicitly mention @codex review in the pull request to request a review, for example, @codex review for security vulnerabilities or @codex review for outdated dependencies.

Codex is currently used within OpenAI to review most of its pull requests, discovering hundreds of problems every day, often before manual reviews begin.

03 Conclusion: The competition among AI programming tools is intensifying

Currently, the competition among AI programming tools has become increasingly fierce. There are several major products in the fray, including OpenAI Codex, Claude Code, Anysphere Cursor, and Microsoft GitHub Copilot. Moreover, Cursor's Annual Recurring Revenue (ARR) exceeded $500 million in early 2025, and the AI code editor Windsurf experienced a chaotic acquisition, resulting in its team being split between Google and Cognition.

The upgrade of OpenAI Codex and the release of a new model specifically optimized for agent programming have significantly enhanced its automated programming and collaboration capabilities with users, demonstrating that the intensity of the competition among AI programming tools continues to heat up.

This article is from the WeChat official account “Zhidx” (ID: zhidxcom), author: Cheng Qian. It is published by 36Kr with authorization.