Revealed: How did OpenAI develop its reasoning model?
ChatGPT, which has brought OpenAI into the limelight, might just be a "beautiful accident." Inside the company, a grand plan codenamed "Strawberry," originating from mathematics, has quietly sparked a "reasoning" revolution. Its ultimate goal is to create a general AI agent capable of autonomously handling complex tasks. "Ultimately, you just tell the computer what you need, and it will complete all those tasks for you," said CEO Altman.
While the world was celebrating the sudden emergence of ChatGPT, you might not know that it was just an unexpected surprise from OpenAI. A recent in - depth article from tech media Techcrunch revealed OpenAI's grand vision of moving from math competitions to "general AI agents." Behind this is a well - thought - out layout spanning several years and an ultimate exploration of AI's "reasoning" ability.
01
An Unexpected Starting Point: Mathematics
Many people think OpenAI's success story began with ChatGPT, but the real disruptive force originated from a place seemingly far from mass applications - mathematics.
In 2022, when researcher Hunter Lightman joined OpenAI, his colleagues were busy with the release of ChatGPT. This product later became a global phenomenon and a consumer application sensation. Meanwhile, Lightman was in an unassuming team called "MathGen," quietly teaching AI models how to solve high - school math competition problems.
"We were trying to make the model better at mathematical reasoning at that time," recalled Lightman. This seemingly off - track exploration was precisely the starting point for OpenAI's development of reasoning models.
Why mathematics? Because mathematics is a touchstone for pure logic and reasoning. If a model can truly understand and solve complex math problems, it means it has begun to develop preliminary reasoning abilities.
Looking back, ChatGPT's success is more like a "beautiful accident" - in internal terms, it was a low - key research preview version that unexpectedly detonated the consumer market.
But OpenAI's CEO, Sam Altman, had his eyes set on a much farther future. At the first developer conference in 2023, he clearly painted a picture of the future:
Ultimately, you just tell the computer what you need, and it will complete all those tasks for you. These abilities are commonly referred to as agents in the AI field. The benefits they bring will be enormous.
That "low - key" work at that time yielded remarkable results. Recently, an OpenAI model won a gold medal in the International Mathematical Olympiad (IMO), the intellectual arena for the world's top high - school students.
OpenAI firmly believes that the reasoning ability honed in the field of mathematics can be transferred to other fields and ultimately drive the general AI agent they've been dreaming of.
02
The "Strawberry" Project: The Key Breakthrough Triggering the Reasoning Revolution
Early GPT models were good at handling text but often "confused" when faced with basic mathematics.
How did OpenAI bridge the gap from basic language processing to complex logical reasoning? The turning point came in 2023 when OpenAI achieved a leap in reasoning ability through an innovative method. This breakthrough was initially codenamed "Q*" internally and later called "Strawberry."
Its core lies in an unprecedented combination of three technologies:
Large Language Model (LLM): Provides a vast knowledge base and language capabilities.
Reinforcement Learning (RL): In a simulated environment, trains the model to make better choices through a "reward - punishment" mechanism (i.e., feedback on whether the answer is correct). This is the same technology used when AlphaGo defeated Lee Sedol.
Test - time computation: Gives the model more time and computing power to "think," repeatedly planning, verifying, and checking its steps before giving the final answer.
This combination gave birth to a new method - "Chain - of - Thought, CoT." Instead of directly giving the answer, the model, like a human, shows a complete problem - solving train of thought. Researcher El Kishky couldn't hide his excitement when describing the scene at that time:
I could see the model starting to reason. It would notice mistakes and backtrack; it would get frustrated. It was really like reading someone's mind.
This breakthrough directly led to the birth of the o1 reasoning model in the fall of 2024. The emergence of o1 shocked the world and made the 21 core researchers behind it the most sought - after talents in Silicon Valley. Meta's Zuckerberg offered a compensation package worth hundreds of millions of dollars to poach five of them to form a new department focused on super - intelligence.
03
Exploring the Essence of AI "Reasoning"
Is AI really "reasoning," or is it just a more advanced form of imitation?
Facing this question, OpenAI's researchers were quite pragmatic. El Kishky explained from a computer - science perspective: "We're teaching the model how to effectively use computing power to get answers. If that's the definition, then it's reasoning."
Another researcher, Lightman, focused more on the results: "If the model can complete difficult tasks, then it's going through a necessary process similar to reasoning. We can call it reasoning, but it's just a way to create powerful and useful tools."
Nathan Lambert, a researcher at the non - profit organization AI2, used a wonderful analogy: AI reasoning is to human thinking what an airplane is to a bird's flight. An airplane doesn't fly by imitating a bird flapping its wings, but it still conquers the sky. The "reasoning" mechanism of AI is different from the human brain, but it doesn't prevent it from achieving similar or even more powerful results.
This focus on the ultimate goal rather than being stuck in form is precisely the core of OpenAI's culture. According to former employees, "all research in the company is bottom - up." As long as the team can prove the breakthrough nature of their ideas, the company will allocate precious GPU and human resources. It's this dedication to the AGI (Artificial General Intelligence) mission, rather than the pursuit of short - term product benefits, that allows OpenAI to make such a huge investment in reasoning models and ultimately gain the upper hand.
04
The Next Frontier: From Objective Coding to Subjective Tasks
Today, AI agents have shown their capabilities in some well - defined and verifiable fields, such as helping programmers with coding tasks. But when people try to let them handle more complex and subjective tasks, like "find me the most cost - effective long - term parking space" or "plan a perfect family trip for me," they often make some basic mistakes or take too long.
What's the core bottleneck behind this? Lightman pointed out sharply: "Like many problems in machine learning, this is a data problem."
How to train models to handle those tasks without standard answers and more on the subjective side is the current research frontier. OpenAI researcher Noam Brown revealed that they've mastered new general reinforcement learning techniques to train models to learn skills that are not easy to verify. The IMO gold - medal model was born based on this. This model can generate multiple "agent clones" to explore different problem - solving paths simultaneously and then select the optimal solution.
This indicates the future evolution direction of AI: from a single model to multi - agent collaboration, from handling objective facts to understanding subjective intentions.
OpenAI's ultimate blueprint is to create a super - intelligent agent that can handle anything on the Internet for you and understand your preferences. This is very different from today's ChatGPT, but all its research is firmly pointing in this direction.
There's no doubt that OpenAI was once the absolute leader in the AI industry, but now it's facing a siege from strong competitors like Google, Anthropic, xAI, and Meta. The question is no longer whether OpenAI can achieve its "agent future," but whether it can reach the finish line first before being overtaken by its rivals. This race for the future has just begun.
This article is from the WeChat official account "Hard AI," written by Long Yue, and published by 36Kr with authorization.