HomeArticle

Only 4 people in 28 days, OpenAI reveals the inside story of Sora for the first time: 85% of the code was actually completed by AI.

新智元2025-12-15 14:39
AI iterates on AI for self-evolution

The popular OpenAI app was developed by only four experts. They built the Android version of Sora from scratch in just 28 days. Behind this, AI completed 85% of the coding.

Four people built the Sora app in 28 days, and approximately 85% of the code was written by AI!

In early October, OpenAI officially launched the updated Sora 2 and the first AI video application, the Sora app.

By November, once the Android version of Sora was launched, it topped the Google Play Store.

Android users generated over 1 million videos within 24 hours.

Two months later, the OpenAI team revealed the behind-the-scenes story of how this popular app (the first Android version) was built.

Surprisingly, this app was completed in just 28 days, and the biggest contributor behind it was the AI agent, Codex.

From October 8th to November 5th, a four-person engineering team collaborated with Codex, consuming approximately 5 billion tokens, and launched Sora Android globally.

Despite the large scale of the application, it achieved a 99.9% crash-free rate.

Moreover, they were using an early version of the GPT-5.1-Codex model.

Just five months after its release, Codex was already responsible for 70% of OpenAI's internal pull requests (PRs) each week.

Embrace "Brooks' Law": Stay Flexible and Be Swift

When Sora was launched on iOS, the number of users skyrocketed.

In contrast, at that time, Android only had a rudimentary internal prototype, while the number of users pre-registering on Google Play was increasing.

Facing such a high-pressure and urgent release task, the usual reaction is to add more people and processes.

For a production-level application of this scale and quality, it usually takes a large number of engineers several months, and the progress is often slowed down by various coordination tasks.

American computer architect Fred Brooks once said, "Adding manpower to a late software project makes it later."

In other words, when trying to deliver a complex project quickly, adding more people often increases communication costs, task fragmentation, and integration difficulties, which actually reduces efficiency.

Therefore, OpenAI assembled an "elite team" of only four engineers, all equipped with Codex, which greatly maximized each person's combat effectiveness.

With this approach, they released an internal build of Sora Android to employees within 18 days and officially launched it to the public just 10 days later.

AI Iterates AI, Self-Evolution

Inside OpenAI, most engineers are using Codex, the open-source CLI version.

Alexander Embiricos, the product lead of Codex, revealed, "It monitors its own training process, processes user feedback, and 'decides' what to do next."

Codex is writing a large number of research test frameworks for its own training runs. OpenAI is even trying to let Codex monitor its own training process.

This "matryoshka doll" style of development allows Codex to self-iterate.

This recursive cycle of using tools to create better tools has a long history in computing.

In the 1960s, engineers manually designed the first integrated circuits on paper and then manufactured physical chips based on the blueprints.

Then, these chips powered the computers running the first electronic design automation (EDA) software, and in turn, this software enabled engineers to design complex circuits that were impossible to draw by hand.

Modern processors contain billions of transistors, and the existence of this arrangement pattern is entirely due to software.

OpenAI using Codex to create Codex seems to follow the same path: the capabilities created by each generation of tools are fed back into the next generation.

This system can autonomously run many processes, handle feedback, spawn and manage subprocesses, and generate the code that is ultimately released in the actual product.

OpenAI employees call it a "teammate" and use tools such as Linear and Slack to assign tasks to it.

Do the tasks processed by Codex really count as "decisions"?

But undeniably, a semi-autonomous feedback loop has been formed here:

Codex writes code under human guidance. This code becomes part of Codex, and as a result, the next version of Codex will write different code.

A Newly Hired "Senior Engineer"

To understand how engineers collaborate with Codex, we first need to know its strengths and areas where human guidance is needed.

Considering it as a "newly hired senior engineer" is a good starting point.

This positioning means that engineers can spend more time on directing and reviewing code rather than writing it themselves.

Different from "ambient programming," having Codex code belongs to the realm of "Vibe engineering."

The former refers to developers directly accepting the code generated by AI without much scrutiny, while the latter is a concept proposed by AI researcher Simon Willison, which means humans remain in the loop.

Generally, we let Codex do the work/make plans, then discuss them together and iterate on the plans. In this way, developers and the model stay in a "loop," and they can carefully review the code.

Areas Where Codex Needs Guidance

Currently, Codex is not good at inferring unknown things.

For example, personal preferred architectural patterns, product strategies, real user behaviors, and internal unwritten rules or shortcuts.

Similarly, Codex cannot see how the app actually runs:

It cannot open Sora on a real device, feel if the scroll bar is not smooth, or notice if an interaction process is awkward.

These experiential tasks can only be done by the OpenAI team themselves.

Each instance requires "onboarding training." Providing context, clarifying goals, constraints, and clear rules is crucial for Codex to do a good job.

Moreover, Codex can easily go astray in deep architectural judgment: If left unchecked, it may create an unnecessary ViewModel when the team only wants to extend the existing one; or it may force the logic that belongs to the Repository layer into the UI layer.

Its instinct is to make the function work, rather than prioritizing long-term code cleanliness.

OpenAI found it very useful to place a large number of AGENT.md files throughout the codebase.

This allows engineers to easily reuse the same guidance and best practices in different sessions.

For example, to ensure that Codex writes code according to the style guide, the OpenAI team added the following section to the top-level AGENTS.md:

  • ## Formatting and static checks
  • - **Always run** `./gradlew detektFix` (or for the affected modules) **before committing**. CI will fail if formatting or detekt issues are present.

Areas Where Codex Excels

Next, let's see what Codex is best at?

Understand Large Codebases Instantly: Codex is proficient in all mainstream programming languages and can easily reuse the same concepts across different platforms without complex abstractions.

Test Coverage: Codex has a unique enthusiasm for writing unit tests and can cover various edge cases. Although not every test is in-depth, this wide coverage is particularly useful for preventing bug regressions.

Respond to Feedback: Similarly, Codex is very receptive to advice. When the CI fails, you can directly give it the log (paste it into the prompt) and ask it for a fix.

Large-Scale Parallelism and Disposability: Most people have never explored the limit of the number of parallel sessions. Developers can test several ideas in parallel and treat the code as disposable. If it doesn't work, just discard it.

Provide New Perspectives: In design discussions, the team treats Codex as a generative tool to discover potential failure points or new ways to solve problems. For example, when designing memory optimization for the video player, Codex searched through multiple SDKs and proposed some solutions that the team didn't have time to explore in detail. The insights discovered by Codex are invaluable for minimizing the memory usage of the final app.

Free Up Time for High-Leverage Work: In fact, the team spends more time on reviewing and directing code than writing it themselves. Nevertheless, Codex is also very good at code review and can often catch bugs before merging the code, improving reliability.

Once the team understands Codex's capabilities, the work mode becomes straightforward.

In areas with clear patterns and well-defined scopes, let Codex do the heavy lifting, while the team focuses on architecture, user experience, systematic changes, and overall quality control.

Set Rules and Manually Lay the Foundation

To make good use of Codex and ensure stable and maintainable output, the key is for developers to personally control the system design and key trade-offs.

This includes determining the app's architecture, modularization, dependency injection, and navigation; even authentication and basic network processes are handled by themselves.

For a project where an estimated 85% of the code is written by Codex, a well-planned foundation avoids costly rework and refactoring.

The OpenAI team said, "This is definitely one of the best decisions we've ever made."

We must form this mindset —

It's not about quickly creating a "working thing," but a "rule-abiding thing."

There are many "correct" ways to write code:

You don't need to tell Codex exactly how to do each step;

But you need to show Codex what is "correct."

Once the starting point and the team's preferred building method are determined, Codex can start working.

To see what would happen, the OpenAI team actually tried giving a direct prompt:

Build the Sora Android app based on the iOS code. Start working.

As a result, it quickly failed.

Although the code written by Codex technically worked, the product experience was far from satisfactory.

Moreover, if you don't understand endpoints, data, and user flows, the code written by Codex in a "zero-shot" manner is simply unreliable. Even without AI, merging thousands of lines of code at once is a risky move.

OpenAI hypothesized that if Codex is given a sandbox full of good examples, it will thrive. As it turns out, they were right.

Just asking Codex to "create a settings page" is basically unreliable.

But if you ask it to "create a settings page referring to the architecture and pattern of the page you just saw," the result will be much better.

Humans make structural decisions and set strict rules; Codex is responsible for filling in a large amount of code within this framework.

Plan First, Then Code

To maximize Codex's potential, the team's next step is to figure out how to let Codex work without supervision for a long time.

For this reason, the four-person team changed their workflow.

For any slightly complex change, they first let Codex clarify how the system and code work.

For example, they let it read a set of related files and summarize how a function works; for example, how data flows from the API through the Repository layer and ViewModel to the UI, and then humans correct or refine its understanding.

This is like guiding a highly capable new teammate. The team will work with Codex to formulate a solid implementation plan.

This plan is usually like a mini design document, indicating which files need to be modified, what new