HomeArticle

OpenAI's most powerful programming model makes its debut, but in actual tests, it's crushed by Gemini 3 Flash.

智东西2025-12-19 11:48
Benchmark tests outperform GPT-5.2.

On December 19th, Zhidx reported that early this morning, OpenAI released its latest programming model, GPT-5.2-Codex. Based on GPT-5.2, this model has undergone in-depth optimization of the programming capabilities of agents, specifically including improvements in long-range task execution, large-scale code changes, compatibility with the Windows environment, and network security defense. OpenAI stated in its blog that this is their most powerful programming model to date.

According to the official OpenAI blog, GPT-5.2-Codex not only inherits the advantages of GPT-5.2⁠ but also integrates the cutting-edge agent programming and terminal operation capabilities of GPT-5.1-Codex-Max⁠. It is specifically designed for professional fields such as complex real-world software engineering and network security.

OpenAI has first released GPT-5.2-Codex in Codex CLI, IDE extensions, the cloud, and code reviews. It has been available to all paying ChatGPT users since today, and API access will be launched soon.

It's worth noting that just before the release of GPT-5.2-Codex, Google announced the launch of the Gemini 3 Flash model. Some netizens had GPT-5.2-Codex and Gemini 3 Flash perform tasks together. As a result, GPT-5.2-Codex was outperformed. In a task of conducting a vulnerability review of 50 files, Gemini 3 Flash took 1 minute and 2 seconds and found 5 issues, while GPT-5.2-Codex took 4 minutes and 48 seconds and only found 2 issues that Gemini 3 Flash had already identified.

The performance of GPT-5.2-Codex may fall short of expectations. Some netizens claim that GPT-5.2-Codex's performance improvement on SWE-Bench Pro is less than 1%, and the results of SWE-Bench Verified have not been released. This inevitably leads people to speculate that GPT-5.2-Codex has not reached the current optimal level, and there has been a performance regression in some system benchmark tests.

According to the official OpenAI blog, in terms of functionality, GPT-5.2-Codex has added native context compression technology, showing improvements in long-context understanding, tool invocation, factual accuracy, and native context compression. The efficiency of token usage during inference has been improved, and it can more accurately understand screenshots, technical diagrams, data graphs, and user interfaces shared during the coding process. In the native Windows environment, GPT-5.2-Codex has further upgraded the capabilities of GPT-5.1-Codex-Max, with more efficient and reliable agent programming performance.

GPT-5.2-Codex has shown improvements in actual software engineering tasks, including codebase navigation, refactoring, and the creation and review of pull requests.

From the perspective of benchmark tests, GPT-5.2-Codex scored 56.4% in the SWE-Bench Pro benchmark test, which evaluates the ability to fix real-world code problems, surpassing GPT-5.2's score of 55.6% and GPT-5.1's score of 50.8%. In the Terminal-Bench 2.0 benchmark test, which measures tasks such as compilation and server configuration, GPT-5.2-Codex scored 64.0%, significantly leading the previous version, GPT-5.1-Codex-Max, which scored 58.1%. This demonstrates the model's progress in using the command line and terminal to solve agent tasks.

According to the official OpenAI blog, in the field of network security, GPT-5.2-Codex set the best record among all models in the Capture The Flag (CTF) challenge. From the line graph, we can also conclude that OpenAI's model capabilities in network security assessment are continuously improving. OpenAI's blog stated that they are comprehensively upgrading network security protection and introducing a trusted access mechanism to support defense work.

Sam Altman, the CEO of OpenAI, said that last week, a security researcher used GPT-5.1-Codex-Max to discover and disclose a vulnerability in React, which could potentially lead to source code leakage. This reflects the practical value of applying model capabilities to network security. Altman also mentioned that these models are still being continuously improved and will ultimately benefit network security.

Conclusion: Intensifying Competition in AI Programming Tools

GPT-5.2-Codex is another iteration of OpenAI's programming model. By improving long-range task processing, large-scale code changes, and performance in specific environments, it provides stronger support for complex development and security research and is expected to become a powerful tool for discovering and fixing vulnerabilities.

Before OpenAI's latest update, Google also released the low-cost Gemini 3 Flash model on the same day, and the competition in the AI programming field remains fierce. At present, the actual effectiveness of GPT-5.2-Codex, touted as OpenAI's most powerful programming model, in real-world scenarios and its performance compared to competing products may fall short of expectations. The actual application effects and performance tests of this model will likely become the focus in the future.

This article is from the WeChat official account “Zhidx” (ID: zhidxcom). The author is Wang Xinyi, and the editor is Cheng Qian. It is published by 36Kr with permission.