Revealed: How did OpenAI develop the reasoning model?
ChatGPT, which has made OpenAI famous, might just be a "beautiful accident." Inside the company, a grand plan code-named "Strawberry," starting from mathematics, has quietly initiated a "reasoning" revolution. Its ultimate goal is to create a general AI agent capable of autonomously handling complex tasks. "Ultimately, you just need to tell the computer what you want, and it will complete all these tasks for you," said CEO Altman.
While the world is celebrating the sudden emergence of ChatGPT, you may not know that it is just a pleasant surprise from an "unintentional act" by OpenAI. A recent in - depth article from the tech media Techcrunch reveals OpenAI's grand vision of moving from math competitions to "general AI agents." Behind this is a well - thought - out layout over several years and an ultimate exploration of AI's "reasoning" ability.
01
An Unexpected Starting Point: Mathematics
Many people think that OpenAI's success story began with ChatGPT, but the real disruptive force comes from a place seemingly far from mass applications - mathematics.
In 2022, when researcher Hunter Lightman joined OpenAI, his colleagues were busy with the release of ChatGPT. This product later became a global phenomenon and a popular consumer application. Meanwhile, Lightman was in an unremarkable team called "MathGen," quietly teaching AI models how to solve high - school math competition problems.
"We were trying to make the model better at mathematical reasoning at that time," Lightman recalled. And this exploration, which seemed off - track, was precisely the starting point for OpenAI's development of reasoning models.
Why mathematics? Because mathematics is a touchstone for pure logic and reasoning. If a model can truly understand and solve complex math problems, it means it has begun to develop basic reasoning abilities.
Looking back, ChatGPT's success is more like a "beautiful accident" - in internal terms, it was a low - key research preview version that unexpectedly exploded in the consumer market.
But Sam Altman, the CEO of OpenAI, had set his sights on a much farther place. At the first developer conference in 2023, he clearly painted a picture of the future:
Ultimately, you just need to tell the computer what you want, and it will complete all these tasks for you. These abilities are commonly referred to as agents in the AI field. The benefits they bring will be huge.
That "low - key" work at that time yielded remarkable results. Recently, one of OpenAI's models won a gold medal in the International Mathematical Olympiad (IMO), which is the intellectual arena for the world's top high - school students.
OpenAI firmly believes that the reasoning ability honed in the field of mathematics can be transferred to other fields and ultimately drive the general AI agent they have been dreaming of.
02
The "Strawberry" Project: The Key Breakthrough Triggering the Reasoning Revolution
Early GPT models were good at handling text but often "confused" when facing basic mathematics.
How did OpenAI bridge the gap from basic language processing to complex logical reasoning? The turning point came in 2023 when OpenAI achieved a leap in reasoning ability through an innovative method. This breakthrough was initially code - named "Q*" internally and later called "Strawberry."
Its core is an unprecedented combination of three technologies:
Large Language Model (LLM): It provides a vast knowledge base and language skills.
Reinforcement Learning (RL): In a simulated environment, the model is trained to make better choices through a "reward - punishment" mechanism (i.e., feedback on whether the answer is correct). This is the same technology used when AlphaGo defeated Lee Sedol.
Test - time computation: The model is given more time and computing power to "think," and it repeatedly plans, verifies, and checks its steps before giving the final answer.
This combination gave birth to a new method - "Chain - of - Thought, CoT." Instead of directly giving the answer, the model shows a complete problem - solving train of thought, just like a human being. Researcher El Kishky couldn't hide his excitement when describing the scene at that time:
I could see the model starting to reason. It would notice mistakes and backtrack, and it would get frustrated. It was really like reading someone's mind.
This breakthrough directly led to the emergence of the o1 reasoning model in the fall of 2024. The appearance of o1 shocked the world, and the 21 core researchers behind it became the most sought - after talents in Silicon Valley. Zuckerberg of Meta offered a compensation package worth hundreds of millions of dollars to poach five of them to form a new department focused on super - intelligence.
03
Exploring the Essence of AI "Reasoning"
Is AI really "reasoning," or is it just a more advanced form of imitation?
Facing this question, OpenAI's researchers are quite pragmatic. El Kishky explained from the perspective of computer science: "We are teaching the model how to effectively consume computing power to get the answer. If defined in this way, then it is reasoning."
Another researcher, Lightman, focuses more on the results: "If the model can complete difficult tasks, then it is going through a necessary process similar to reasoning. We can call it reasoning, but it's just a way to create powerful and useful tools."
Nathan Lambert, a researcher at the non - profit organization AI2, used a wonderful analogy: AI reasoning is to human thinking what an airplane is to a bird's flight. An airplane doesn't fly by imitating a bird flapping its wings, but it still conquers the sky. The "reasoning" mechanism of AI is different from that of the human brain, but this doesn't prevent it from achieving similar or even more powerful results.
This focus on the ultimate goal rather than being stuck in form is precisely the core of OpenAI's culture. According to former employees, "all research in the company is bottom - up." As long as the team can prove the breakthrough of their ideas, the company will allocate precious GPU and human resources. It is this perseverance in the mission of AGI (Artificial General Intelligence) rather than the pursuit of short - term product benefits that allows OpenAI to make such a huge investment in reasoning models and ultimately gain an advantage.
04
The Next Frontier: From Objective Coding to Subjective Tasks
Today, AI agents have shown their capabilities in some well - defined and verifiable fields, such as helping programmers with coding tasks. However, when people try to let them handle more complex and subjective tasks, such as "find me the most cost - effective long - term parking space" or "plan a perfect family trip for me," they often make some elementary mistakes or take too much time.
What is the core bottleneck behind this? Lightman pointed out sharply: "Like many problems in machine learning, this is a data problem."
How to train models to handle subjective tasks without standard answers is the current frontier of research. OpenAI researcher Noam Brown revealed that they have mastered new general reinforcement learning technologies, which can train models to learn skills that are not easy to verify. The IMO gold - medal model was born based on this. This model can generate multiple "agent clones" to explore different problem - solving paths simultaneously and finally select the optimal solution.
This indicates the future evolution direction of AI: from single - model to multi - agent collaboration, from handling objective facts to understanding subjective intentions.
OpenAI's ultimate blueprint is to create a super - intelligent agent that can handle anything on the Internet for you and understand your preferences. This is very different from today's ChatGPT, but all of its research is firmly pointing in this direction.
There is no doubt that OpenAI was once the absolute leader in the AI industry, but now it is facing a siege from strong competitors such as Google, Anthropic, xAI, and Meta. The question is no longer whether OpenAI can achieve its "agent future," but whether it can reach the finish line first before being overtaken by its opponents. This race for the future has just begun.
This article is from the WeChat official account "Hard AI," author: Long Yue, published by 36Kr with authorization.