OpenAI has also "distilled" engineers' experience into skills. The author of a viral article on Harness revealed the internal practices: a system with one million lines of code was developed without any manual coding and review throughout the process.
Recently, Ryan Lopopolo from OpenAI published a lengthy article titled Harness Engineering, which has become a hot topic in the industry.
In that article, Ryan systematically revealed for the first time how OpenAI's newly established Frontier team operates: this team is now the heaviest user of Codex within OpenAI. They maintain a codebase of over one million lines of code, and not a single line of code in the entire system is manually written. More importantly, there is no human code review before code merging.
Ryan is almost evangelical about this approach. He even bluntly stated that if you are not using over 1 billion tokens per day these days, you are practically "derelict in your duty". Based on current market prices and caching assumptions, this translates to roughly $2,000 to $3,000 in token costs per day.
Over the past few months, they conducted an extremely ambitious experiment: building and delivering an internal test product from scratch, with no human-written code at any stage of the process. Out of this experiment, they gradually developed a completely different approach to engineering work: when an agent fails, the team no longer thinks "let's try a different prompt" or tells it to "try harder", instead they ask: what specific capability, what category of context, or what layer of structure is missing?
This experiment ultimately produced Symphony. Ryan describes it as a "ghost library", a reference implementation in Elixir completed by Alex Kotliarskyi. Its purpose is to build an entire large-scale Codex agent system. Each agent is fed extremely detailed prompts, detailed enough to match a real product requirements document, without directly giving the full implementation.
As a result, the outline of the future is becoming increasingly clear: coding agents are no longer just a co-pilot sitting next to you, they are gradually becoming a "teammate" that anyone can actually rely on. Codex has also been doubling down on this direction, and even sending a strong message to the outside world: you just need to start building directly.
Ryan has long been pushing one question: what happens if you stop optimizing your codebase, workflow and organization around human usage habits, and instead start optimizing them for agent readability?
During an appearance on the Latent Space and AI Engineer podcast, he shared in detail how the concept of Harness Engineering originally came about, as well as the core constraint that launched the entire experiment: Ryan deliberately chose not to write any code himself from the start, forcing agents to complete the entire job from start to finish.
He also covered how OpenAI's internal team actually uses Codex; why human attention has become the real bottleneck in AI-native software development; why they are so obsessed with build speed: why "one minute" is set as the upper limit for the inner loop, and how the team rebuilds their build system over and over again just to keep agents consistently productive.
We have translated and edited this original conversation, with minor omissions made without altering the core meaning, for our readers.
Building Products With Unlimited Tokens, Zero Human-Written Code
swyx: You recently published a landmark article on harness engineering, which is very likely to become the most representative piece on this emerging direction.
Ryan Lopopolo: Thank you. I find it really interesting that we've actually sort of established the framework for this discussion early on.
swyx: This is your first time on a podcast, right? Let's start with this: which team are you on right now, and what do you do?
Ryan Lopopolo: Sure. I'm currently on the Frontier Product Exploration team, working on new product development for OpenAI's Frontier line. OpenAI Frontier is essentially our enterprise platform, designed to let enterprises deploy agents at scale, in a governable and secure way. Our team's job is to explore new approaches to packaging models into products and solutions that enterprises will actually pay for.
swyx: Let me add your background for context: you've previously worked at Snowflake, Brex, Stripe, and Citadel.
Ryan Lopopolo: Yeah, I've basically spent my entire career serving enterprise customers.
Vibhu: I have to say, when I followed you on Twitter before, I never expected this background. Your online vibe is totally all-in on AI, all-in on coding, like you'd have your laptop on your lap coding even while riding a Waymo. But when I saw your resume, it was a total surprise, and then I thought, "Oh, that actually makes perfect sense."
Ryan Lopopolo: I think if you really want to be an AI maximalist, OpenAI really is the best place for it.
swyx: The one thing you don't lack is tokens, right?
Ryan Lopopolo: That's right. No rate limits internally, that's a huge help. So I really can go all in, just like you said.
swyx: So that means, within Frontier, you're a relatively special team.
Ryan Lopopolo: That's right. The company really gave us space to experiment on our own, which is really exciting.
That's why I set what sounds like an extreme constraint for myself from the start: I wouldn't write any code myself. My reasoning was simple: if we're building agents that will actually be deployed in enterprises in the future, they should be able to do the work I do on a daily basis. After working with these coding models and coding harnesses for several months, I really do feel that both the model itself and the harness layer have evolved to the point where they're "isomorphic" to me, meaning their ability to get work done is close enough to mine.
So once I added the "no writing code myself" constraint from the start, the only way I could get my work done was to let agents do it for me.
When the Model Fails, The Problem Isn't Always The Prompt
Vibhu: This is really the core experiment from your article. Over several months, you developed an internal tool with zero manually written code, and the entire codebase adds up to over 1 million lines. You even said it's basically faster than if you wrote it all yourself. So that was your approach from the start?
Ryan Lopopolo: Yeah, that's exactly it.
We started with a very early version of the Codex CLI paired with the Codex Mini model. It was obviously far less capable than the models we have today, but that was actually a good constraint. The experience was very direct: you ask the model to build a product feature for you, and it just can't put all the pieces together.
And that frustration pushed us to gradually develop a core approach: whenever the model can't get it done, you have to immediately break the task apart, build smaller foundational components, and then assemble them back up to the larger goal.
To be honest, the process was really painful at first. For the first month and a half, we were moving at about 1/10 the speed I would have if I was writing code myself. But because we paid that upfront "tuition", we ended up building a whole set of tools and build stack that let agents get the entire job done in the end. And once this system was in place, its productivity far outstripped any single human engineer.
Later, we went through model iterations from GPT-5, 5.1, 5.2, 5.3, to 5.4. You can really feel that each generation of model has its own quirks and different working styles. That means when the model upgrades, we have to adjust the entire codebase along with it, "shifting gears" to match.
A really interesting example: with 5.2, the Codex harness didn't have background shell capability, so back then we could rely on blocking scripts for long-running tasks. But when 5.3 launched with background shell, the model became less patient, it didn't want to just sit there waiting. So we had to rebuild the entire build system from scratch, with the goal of getting build time under one minute.
If this was a normal human team maintaining a codebase, I'd almost say this was impossible. Because humans have their own preferences, endless debates, and agonize over whether it's worth the effort. But our only goal back then was to maximize agent productivity on a one-week timeline. So we switched from custom Makefile builds, to Bazel, then to Turbo, then to Nx, and stopped wherever was fastest.
Why We Cap Build Time Firmly At One Minute
swyx: That's really interesting, tell us more about Turbo and Nx, because most people go the opposite direction.
Ryan Lopopolo: To be honest, I don't actually have that much hands-on experience with frontend repository architecture myself.
swyx: You said Jessica built the whole system. I know the Nx team, and I know Jared Palmer and the Turbo team. I find this contrast really interesting.
Ryan Lopopolo: But the mountain we had to climb back then really only had one goal: make it faster.
swyx: Do you have micro frontends here? Or is it just really high React complexity?
Ryan Lopopolo: It's an Electron-based monolith, that's basically the structure.
swyx: And it has to be under one minute? That's an interesting constraint. I'm actually not that familiar with the background shell feature, I think it was mentioned in the 5.3 release.
Ryan Lopopolo: Basically, Codex can spin up commands in the background, and keep working on other things while waiting for them to finish. For example, it can kick off a really time-consuming build, then review code while it waits. That makes for much better time utilization for people using the harness.
swyx: So why one minute, not five?
Ryan Lopopolo: Because we want to get the inner loop as fast as possible. One minute is just a nice round number that's actually achievable.
swyx: If it doesn't finish in one minute, do you just kill it?
Ryan Lopopolo: No. We just treat it as a signal that we need to stop, rework the task, break down the build graph into finer granularity, until we get the complexity below the threshold so the agent can keep running efficiently.
swyx: This feels like a ratchet mechanism. You're forcing yourself to hold the line on build speed in a really uncompromising way. Because if you don't, build times just keep getting longer and longer. You also mentioned that the software you work on personally already has 12-minute build times, which is a terrible experience.
Ryan Lopopolo: That's right. This is basically what I used to see all the time when I was on platform teams: everyone has a "acceptable" range of build times, and it just creeps up until it's over the limit, then you spend two or three weeks pushing it back down under the average.
But now tokens are so cheap, and models can parallelize like crazy, so we can constantly prune this system like tending a garden, and keep these core metrics maintained. That makes the code and the entire SDLC much less fragmented, we can actually make a lot of things simpler, and rely on more stable invariants when building software.
Humans Have Become The Bottleneck
Vibhu: You have a really striking line in your article: humans have become the bottleneck. You started with just three people, and ended up with a 1 million-line codebase with thousands of PRs. So what's the core thinking here? You talked a lot earlier about "code being disposable", but you still do a lot of review. The article keeps emphasizing that everything needs to be rephrased into prompting: anything an agent can't see is basically garbage. So at a high level, how did you build this system? When humans have become the bottleneck for PR review, what role is left for humans to play?
Ryan Lopopolo: To be honest, we've reached a point where even code review is no longer primarily done by humans. Most human review actually happens after code is merged now.
swyx: Wait, after merging? So there's no review before merging at all? It's just for people's peace of mind?
Ryan Lopopolo: You just can't use the old way of thinking anymore. Models are inherently extremely easy to parallelize. If I'm willing to throw more GPUs and more tokens at it, it can keep working on my codebase nonstop. The truly scarce resource now is "human attention", the time of team members that has to be actively focused on the work.
And honestly, once the machine is running, it's hard to stop yourself from poking it and feeding it more work, but there are only so many hours in a day, we have to eat lunch, I need to sleep. So you have to step back, think systematically, keep asking yourself: where is the agent making mistakes? Where am I actually spending my time? How can I stop spending time there in the future? Then based on that, you gradually build confidence in automation: okay, this part of the SDLC is now fully automated.
What does this usually mean? At the very beginning, we had to watch the code extremely carefully. Because the agent didn't have the right building blocks back then, it couldn't build truly modular, properly decomposed software, and couldn't build a reliable, observable system that could grow a frontend interface incrementally.
So, to avoid us having to stare at the terminal all day, the first thing we did was give observability to the model. That's what that diagram in the article is about.
swyx: Right, let's walk through that diagram. Start with traces. What came first?
Ryan Lopopolo: We started with just the app, then added everything from vector to logging, metrics, APIs, and it only took me half an afternoon. We very intentionally chose high-level, fast-to-develop tooling. There are so many options for this now. We used tools like MISE, which lets you easily pull the Go-written Victoria Stack binaries into your local dev environment, then you just write a little bit of Python glue code to spin everything up, and it just runs.
There's an interesting point here: we've intentionally flipped the whole process upside down. Normally people set up an environment first, then drop the coding agent into it, but we don't do that. Our entry point is Codex. That means we spin up the coding agent first, then through skills and scripts we give it the ability to spin up the entire stack itself, if it decides it needs that. At the same time, we tell it how to set environment variables, so the app and local dev environment point to this stack that the agent itself decides whether to launch or not.
I think this is one of the most fundamental differences between reasoning models and the older GPT-4 and GPT-4o generation models. Older models can't reason, so you have to put them in a "box" with pre-defined state