7 hours of continuous reconstruction without disconnection. The unrivaled Claude finally meets a rival: Greg Brockman personally interprets a major breakthrough in AI programming.
On September 16th, OpenAI officially launched a new model, GPT-5-Codex. This is a fine-tuned variant of GPT-5, specifically designed for its various AI-assisted programming tools. The company stated that the "thinking" time of the new model, GPT-5-Codex, is more dynamic than previous models. The time required to complete a coding task ranges from a few seconds to seven hours. Therefore, it performs better in agent coding benchmark tests.
The release of GPT-5-Codex marks the end of perhaps the most dramatic shift in the atmosphere in the "Coding Agents" field recently.
Over the past year or more, from Claude 3.5 Sonnet in June last year, to 3.7 Sonnet and Claude Code in February, and then to Claude 4 in May, Anthropic has been almost unrivaled in the coding scenario, firmly occupying the dominant position. During this period, the company's revenue soared to $5 billion (10% of which came from Claude Code), and its market value reached $183 billion, an increase of $122 billion in market value.
All this has obviously ignited OpenAI's fighting spirit. You know, as early as 2021, OpenAI released the original Codex, which gave birth to GitHub Copilot - the world's first AI programming tool (there are still 182 developers continuously contributing to it today); GPT-3 also inspired Debuild, foreshadowing all subsequent vibe coding startup tides. Since then, OpenAI has also put coding capabilities back on the priority list in o1 and GPT-4.1.
GPT-5-Codex scored 74.5% on SWE-bench, almost equal to GPT-5 thinking's 74.9% on the 477 subset. So, what has led to a major reversal in the overall reputation of GPT-5?
One of the reasons is that the Codex team is really "working hard."
Firstly, it is a "multi-faceted unified" agent. Greg mentioned in a podcast today:
"At the beginning of the year, we set a company goal: to create an agent-style software engineer by the end of the year. Figuring out what this really means, how to achieve it, and how to integrate all opportunities and computing power is a huge task shared by many people at OpenAI."
The initial agent-style SWE shell was called 10X and ran in the terminal. Now, with the new Codex CLI, "ChatGPT Codex" (now renamed Codex Cloud), the IDE extension (with over 800,000 installations in 2.5 weeks), and the GitHub code review robot, OpenAI has formed a complete set of interactive interfaces to cover various needs.
Secondly, it has better post-training characteristics. OpenAI has always emphasized the close combination of research and products. Several important characteristics were also mentioned in today's podcast, and the most important one is the significant improvement in "long-running agent tasks."
Thibault Sottiaux said:
"This model demonstrates an ability to persevere for a longer time and has the 'tenacity' required for complex refactoring tasks.
But at the same time, for simple tasks, it responds very quickly and can give answers without much thought. This makes it a great collaborator - you can ask questions, locate code, and plan solutions; and once you let it handle a task, it can work continuously for a long time.
We've seen it work continuously for 7 hours internally to complete a very complex refactoring, something no other model has achieved before. We've also invested a great deal of effort in code quality, and it has been optimized specifically for the actual needs of Codex users."
This skillfully applied "tenacity" is the key to making GPT-5-Codex a more comprehensive and practical agent-style programming model. It's not just about optimizing for the most difficult problems and then forcing users to switch to a "dumber" model for simpler tasks.
We've translated the full content of this podcast interview to take you deep into how the OpenAI team built GPT-5-Codex and the technology and stories behind it.
1 Why Programming Is a Special Exception in AGI Research
Andrew Mayne: Today we're going to talk about Codex. I've actually used it since the earliest version when I was still working here. Now that you have a new version, I spent the whole weekend playing with it, and I'm really shocked. I really didn't expect this technology to have advanced to this level in just a few years. I'm really curious about the origin story: how did you first come up with the idea of using a language model to write code?
Greg Brockman: I remember in the GPT-3 era, I first saw signs that given a docstring or the definition of a Python function, the model could complete the code. When you first see this, you know it's going to work and it's going to be big. At that time, we also talked about some idealized goals. For example, imagine if a language model could write a thousand lines of coherent code. That would be amazing. That was our big goal at the time. And now, that goal has long been achieved and surpassed. In fact, we're so used to it that we don't even find it strange anymore. But during the R & D process, you often only see the loopholes and defects of the model. Occasionally, when you step back and look, you'll find that the technology has really come a long way.
Thibault Sottiaux: Yeah, it's incredible how accustomed we humans are to this continuous improvement. It quickly becomes a daily tool. We use it every day, and then when we look back, something that was completely impossible a month ago has now become routine. It's really fascinating how quickly humans can adapt to new things.
Greg Brockman: However, we've always had a dilemma about whether to focus on a certain field. Because our mission is AGI, general intelligence. So intuitively, we want to improve all abilities equally. But programming has always been an exception.
We have a completely different research plan for programming, focusing on programming data, code metrics, and the model's performance on code tasks. Later, we also started trying this approach in other fields. But for programming, we've always given it special attention.
For example, with GPT-4, we finally launched a comprehensive large model, but in fact, we also trained a Codex model and a model more biased towards Python. Around 2021, we really worked hard to push the coding ability to the limit. The Codex demonstration we did at that time was perhaps the earliest prototype of what is now called Vibe coding.
I still remember when I was building the interactive interface, I suddenly realized that the interaction of an ordinary language model is very simple, just completing a sentence or continuing a conversation. But code is different. Code has to be "alive", it has to be executed and connected to tools. At this point, you'll find that the so - called "interaction shell" (harness) itself is as important as the intelligent model itself, and it determines whether the model can really be used. We understood this from that moment on.
This year, with a more powerful model, we're not just participating in programming competitions and pursuing raw capabilities, but making it really useful. So we introduced a diverse environment in training, connected the model to real development scenarios, and paired it with a suitable interaction shell. This is also the direction that Thibault and his team have been working hard on.
Andrew Mayne: Can you explain "harness" in a simpler way?
Thibault Sottiaux: It's actually very simple. The model itself is just an input - output system. The so - called harness is to integrate it with other infrastructures so that it can really act on the environment. This includes tools and loop methods, such as the "agent loop" we mentioned. It seems simple on the surface, but when you really train these parts end - to - end, you'll see some amazing behaviors - the model can act and create for you and become a real collaborator. You can think of it as an analogy: the brain is the model, and the harness is the body.
Andrew Mayne: Right, it's very interesting. Think about the GPT - 3 era. We still had to write annotated code, like adding # comments before a Python function to tell the model what the function does. And now the model can write code very naturally and intuitively. You just mentioned the difference between a general model and a programming - specific model - is it because of high user demand or your own desire to use it this way?
Greg Brockman: It's both. For example, in 2022, we collaborated with GitHub to launch Copilot. At that time, we really felt for the first time what it was like for AI to enter the programming workflow: it could speed you up. But there were also many problems at that time, such as how to design the interface: should it be an automatic completion like Ghost text or provide a drop - down menu with different options? But one thing was clear: latency itself is a product feature. The latency threshold for automatic completion is 1500 milliseconds. If it exceeds this time, no one is willing to wait, no matter how smart it is. So the consensus at that time was to use the smartest model within the latency limit. But later, with GPT - 4, it was smarter but couldn't meet the latency requirement. So what to do? We found the answer was to change the harness and the interface. You have to let the interaction method and the model's ability evolve together.
A fast and smart model is of course good, but even a smarter but slower model is definitely worth it because the rewards brought by intelligence will definitely show in the long run.
Andrew Mayne: When we were working on GitHub Copilot back then, I really didn't understand this point. I thought as long as the model could complete the code, I didn't realize how much of a difference the harness and tools could make. Now with the CLI, like Codex CLI, I can use it in the command line, and there's also the VS Code plugin, and it can even be directly deployed on the web. I didn't fully understand these values at that time. How do you use these things yourself? Where do you find them most useful?
Thibault Sottiaux: Going back to the initial observation: many developers use ChatGPT to debug very complex problems. They keep trying to stuff more context into it: code snippets, stack traces, and then paste them to the model for help. As the interaction becomes more and more complex, we suddenly realized that instead of letting the user lead the interaction, we should let the model find the context itself, reason and debug itself, so that the user can just watch the model work. This shift in thinking made us start to think more seriously about the harness and give the model the ability to act autonomously.
2 CLI, IDE, and Terminal: Applicable Scenarios of Different Tools
Greg Brockman: We were also trying different forms at that time. At the beginning of the year, we had several different implementations, such as an asynchronous agent harness and a local experience.
Thibault Sottiaux: We also ran a prototype in the terminal. Later, we thought it wasn't "AGI - like" enough. We wanted it to be scalable and run remotely. You could close your laptop, and it would still keep working. You could even follow up on it remotely with your phone. That felt so cool, so we pushed in that direction. But actually, the version in the terminal was fully usable, and many people inside OpenAI were using it efficiently. That tool was called "10X" at that time because it really increased productivity by 10 times. But in the end, we didn't release it as a product because we felt it wasn't polished enough.
So we tried more forms and first focused on the asynchronous aspect. Now we've come back to bringing the agent back to the terminal and the IDE. The ultimate goal is to make it a collaborator by your side and embed it in the development tools you're already using.
Greg Brockman: We also made other attempts, such as connecting a local agent to a remote daemon process so that we could have the best of both worlds. In fact, we found that there's almost a "form matrix" here. It can be asynchronous, locally synchronous, or a hybrid. The problem is where we should focus our efforts: should we make it a general, externalized tool that can adapt to various environments, or should we first focus on the internal environment and make the experience of internal engineers perfect? Of course, we want to do both. But if we can't even use it well ourselves, how can we expect the whole world to use it well? This is our challenge, to find the right focus and make our engineering efforts generate the greatest value.
One of the company goals we established this year is to create a "proxy software engineer" by the end of the year. What does this mean? How to achieve it? How to invest resources and computing power into it? This is a huge project, and many people are working hard on it.
Andrew Mayne: You just mentioned the internal tool "10X". It seems very useful, but you didn't release it to the public in the end. Decisions like this must be very difficult, right? When should you make it public, and when shouldn't you? The cloud - based code is now very powerful, and I guess it's a similar