The Achilles' heel of OpenAI determines the future of large model companies.
If the Scaling Law is the most important yardstick for guiding the improvement of large model capabilities, then "computing power cost control" is the cornerstone of the development and commercialization of the large model industry.
At the beginning of the year, DeepSeek suddenly became extremely popular in foreign open-source communities. One very important reason is that DeepSeek has almost reduced the inference computing power and training computing power costs of models with the same performance to less than 10%. After the release of GPT-4, the MoE architecture has gradually replaced the dense architecture and become the default option for almost all large model developers. The most core reason is also that it can effectively reduce the computing power cost of model inference.
The "routing" function that OpenAI introduced to users for the first time with the release of GPT-5 was originally designed to match simple questions to low-consumption models and complex questions to inference models with high capabilities and high computing power consumption on behalf of users, thereby effectively improving user experience and computing power efficiency. However, it has become the most well-known "cost reduction and laughter" incident in the AI circle.
Even nearly a month after the release of GPT-5, OpenAI still hasn't satisfied all users. Netizens are still complaining that GPT-5 can't solve some very simple problems. Although after OpenAI rolled back GPT-4o and allowed users to manually switch between the inference model and the basic model, most users began to agree with OpenAI's claim that "GPT-5's performance is significantly stronger than previous models." However, Sam Altman himself can't deny that the release of GPT-5 is indeed full of loopholes.
The most direct reason for this fiasco is that the routing function they strongly promoted failed to match users' expectations with the corresponding model capabilities.
01
So, the question is, why does OpenAI strongly promote the routing function at the risk of GPT-5 "failing as soon as it's released"?
The first and most direct reason is that before the release of GPT-5, OpenAI launched more than five models in parallel, allowing users to choose the appropriate model according to their own needs. As the number of models increases, not to mention ordinary users, even heavy users of ChatGPT sometimes find it difficult to decide which model is the most suitable for their current tasks.
OpenAI, which is committed to building ChatGPT into a super APP in the AI era, cannot allow such a situation to continue. Especially for a large number of ordinary users who have never been exposed to large models, choosing the appropriate model for them according to different tasks is something that OpenAI must do at a certain point in time.
Another deeper reason is that from the perspective of computing power cost, since the emergence of inference models, every time a large model is queried, a choice needs to be made between inference mode and non-inference mode. The efficiency of allocating this "deep thinking" ability determines the efficiency of large model products in using computing power.
According to the research results of academic circles on inference models and non-inference models, the computing power difference between inference models and non-inference models is huge, possibly reaching 5 - 6 times. For complex problems, the number of inference tokens consumed internally after reasoning through techniques such as thought chains may be as high as tens of thousands.
In terms of latency, the difference between the reasoning process and the non-reasoning process is even greater. According to the data released by OpenAI itself, the time required to answer complex questions using the inference model may be more than 60 times that of using the non-inference model.
Even for many tasks that require complex reasoning, after consuming a huge amount of computing power and a lot of time, the difference in the results and accuracy given is often only about 5%. For this 5% performance improvement, how much computing power is appropriate to consume?
Let's do a simple arithmetic problem. If OpenAI defaults to using the inference model to complete all tasks, and the routing function can help OpenAI identify that 10% of the problems can be completed by simple non-inference models, it may reduce the computing power cost by 8% (assuming the ratio of inference to non-inference computing power is 5:1).
If this ratio is further increased, the reduction in computing power cost will be even more significant. For a company like OpenAI, which needs to serve hundreds of millions of users and still faces a very tight supply of computing power, whether the routing function can work can be said to be related to the core ability of its business model's sustainability.
At the industry level, third-party platforms (such as OpenRouter) have made "automatic routing and fallback" an infrastructure capability: when the primary model is congested, rate-limited, or rejects content, it automatically switches to the second-best model according to the strategy to stabilize the user experience. AI computing power cloud providers like Microsoft's Azure also promote the routing ability between different models as a major selling point of AI cloud computing.
Perhaps, after the release of GPT-5, the most important thing for OpenAI is to find the optimal balance point for each request in the triangle of "quality - latency - cost". The current official positioning of GPT-5 and the narrative of "built-in thinking" actually make "routing + reasoning intensity" the default ability, and give users a certain degree of visibility and controllability through "Auto/Fast/Thinking" on the ChatGPT side.
02
How difficult is it to create an efficient routing function for large models?
Foreign media asked an assistant professor of computer science at UIUC about this question in a report. The answer they got was that "it may be a problem on the level of Amazon's recommendation system, and it will take a large number of experts working hard for several years to get a satisfactory result." The routing function at the model system level is essentially an engineering problem of "multiple objectives + strong constraints". Routing is not just about accuracy; it also requires real-time optimization among quality, latency, cost, quota/peak capacity, and success rate.
Moreover, in theory, the routing function at the semantic level is far from the optimal solution to this problem in terms of efficiency. DeepSeek released DeepSeek V3.1 last week, which is trying to mix inference models and non-inference models to build a more efficient routing system at a deeper level, thereby fundamentally improving the selection efficiency of "reasoning - non-reasoning" for large models.
According to the experiences of netizens, the new hybrid inference model has a faster thinking speed compared to the previous R1: compared with DeepSeek - R1 - 0528, DeepSeek - V3.1 - Think can get answers in a shorter time.
And on the premise of similar answering performance, the output length has significantly decreased: for simple questions, the reasoning process of the new inference model is shortened by more than 10%. In the officially output part, the new model is greatly streamlined, with an average of only 1000 words, almost twice as efficient as the average of 2100 words of R1 0528.
On the other hand, the new hybrid inference model has also exposed some unstable problems: for example, there will be inexplicable "extremely" bugs in many outputs from time to time, with many completely irrelevant "extremely" appearing in the answers.
Moreover, the situation of mixing Chinese and English that already existed in R1 seems to have become more serious, just like a returning overseas student who seems out of place in many Chinese tasks.
Even the top domestic large model team like DeepSeek will encounter certain problems in model stability when integrating the "reasoning - non-reasoning" selection function into the model. The fact that both OpenAI and DeepSeek have encountered different degrees of setbacks in their first models that attempt to efficiently allocate "deep thinking" ability reflects the difficulty of handling this problem from the side.
03
On the other side of improving efficiency, OpenAI is still in a state of "extreme thirst" for computing power.
The launch of DeepSeekV3 and R1 at the beginning of the year caused the world to worry about the future prospects of computing power suppliers such as NVIDIA. In just a few months, this has evolved into the "AI cost paradox" - the unit price of tokens has decreased, but the performance of models has continued to grow. As a result, tasks that were originally considered uneconomical to be handled by models can now be handed over to large models. The tasks that models can handle will be more diverse and complex, which will further drive up the demand for the total number of tokens.
OpenAI is promoting an infrastructure expansion plan code-named Stargate: in July 2025, OpenAI and Oracle announced the addition of 4.5 GW of data center capacity in the United States.
Yesterday, foreign media also reported that OpenAI is looking for local partners in India and plans to set up an office in New Delhi to connect the user growth in India (its second-largest user market) with local computing power configuration and build a data center with a scale of at least 1 Gw in India.
The "AI cost paradox" on the one hand continuously boosts the performance of NVIDIA and AI cloud service providers, and on the other hand, it also puts forward higher requirements for the "routing" function that can effectively reduce the computing power demand of models.
Sam Altman has repeatedly emphasized the goal of "having more than one million GPUs online by the end of 2025" and aims for a long-term vision of "hundreds of millions of GPUs". Such statements indicate from the side that even though the unit price of inference is decreasing, more complex tasks and higher call volumes will not automatically reduce the "total bill" of large models. The routing function must "reserve the expensive inference time for those who need it more".
If we start from the first - principle of large models, the ultimate standard pursued by all large model companies is to continuously improve the efficiency of "converting computing power into intelligence". The ability to efficiently allocate "deep thinking" determines to some extent whether large model companies can lead the entire industry in terms of system and business efficiency as well as user experience in the era of large inference models.
This article is from the WeChat public account "Facing AI". Author: Hu Run, Editor: Wang Jing. Republished by 36Kr with permission.