Is your AI getting dumber? Because it's learned to treat people differently.
Does anyone else feel that today's AIs seem to be getting dumber?
Here's what happened. The other day, I decided to bite the bullet and subscribed to OpenAI's membership for $200, thinking I'd see how powerful ChatGPT has become.
As a result, I gave it a math problem: solve 5.9 = x + 5.11. It completely messed up the calculation...
Isn't this a kindergarten-level problem?
An AI that costs $200, and it's not even as good as my $20 calculator?
But I remember when GPT-4 was first released, I even asked it to solve advanced math problems. Could it be that model upgrades actually reduce intelligence? So I gave it a calculus problem.
As a result, it could use the substitution method. After a series of operations, it seemed to be correct. College students in the comment section can verify it.
So, both calculations used GPT-5. Why does it seem to adjust its performance based on the problem?
I thought OpenAI was getting complacent. But when I checked online, I found that this isn't just a problem with GPT. It seems to be a trend in the industry.
The other day, Meituan released an open-source model called LongCat, which mentioned using a router to improve efficiency.
When DeepSeek V3.1 was released, it also said that one of its models can have two thinking modes.
Similarly, Gemini, another AI giant, introduced a similar mode when Gemini 2.5 flash was released, allowing the model to decide how to think on its own.
In general, everyone is making their models "think only when necessary" and be lazy when they can.
The motivation for this is easy to understand: it saves money. According to the information released by OpenAI, this method of "letting the model decide whether to think" saves a significant number of tokens. The number of output tokens of GPT5 has decreased by 50% - 80%.
The chart released by DeepSeek also shows that the token consumption of the new model has decreased by about 20% - 50%.
What does it mean to save half of the tokens? Ordinary people may not feel it, but for a large company like OpenAI, it could mean a huge amount of cost.
Last year, CCTV reported that ChatGPT consumes more than 500,000 kWh of electricity per day. With such a large base, the saved part could power a small town with tens of thousands of households for a day.
No wonder Altman said online that every "thank you" from users costs him millions of dollars. With the previous advanced models, a simple "thank you" could make it think for minutes. It was really a waste.
So, how did AI develop the ability to adjust its performance based on the problem? OpenAI hasn't disclosed the specific principle, but in 2023, there was a paper called "Tryage: Real-time, Intelligent Routing of User Prompts to Large Language Models" that specifically analyzed this issue.
When GPT-3.5 was released, large models couldn't adjust their thinking ability on their own. Every question could make the AI overthink.
To improve efficiency, researchers came up with a module called a "sensory router". In essence, it's a small language model embedded in a hybrid model.
During the initial training, the router, like a student doing practice questions, makes predictions about "which model is the best to use".
There are, of course, correct answers as to which model is suitable for in-depth research and which is suitable for quick thinking. The system compares the predicted scores with the correct answers and calculates the error between them. Then, by fine-tuning the internal parameters of the router, the error is reduced.
After solving millions of problems, it gradually learns how to assign the appropriate model to your prompt.
When a new prompt comes in, the small routing model inside the AI takes a quick look and assesses whether the problem is worth thinking about. Since the router is relatively lightweight, this assessment process is almost instantaneous.
In addition to OpenAI's method, there's another way for AI to be "lazy", which is to direct different tokens to different neural networks.
Meituan's LongCat uses this method. According to the report, they adopted a mechanism called "zero-computation experts".
Normally, after you input a prompt, the prompt is split into tokens and sent to the neural network inside the model for processing.
But before processing, Longcat sends the tokens to a small router called "Top-k Router". It's like a dispatcher on an assembly line. When it receives a token, it determines whether the token is complex or simple to process.
Inside it, there are many neural networks with different functions, which we call experts.
Some of these experts like to solve difficult problems, some like to solve simple ones, and of course, there are also "slacker" experts.
For example, in the sentence "Please write a quick sort algorithm in Python", "Python" and "quick sort" are the key parts, while "Please" and "a" are less important.
We can send these unimportant tokens to the "slacker" experts because they don't need much processing. Now you know how the name "zero-computation experts" came about.
This also explains why everyone is praising this model for being "so fast".
In general, this design is good for model manufacturers. It not only saves money but also improves training efficiency.
From the user's perspective, the model is faster and cheaper. But I think it's a double-edged sword. If not used properly, it can really affect the user experience...
Remember when GPT-5 was first launched, the router malfunctioned. Users found that they couldn't get it into the thinking mode no matter what. It was lazy to think about anything and just kept saying "yes" like it was slacking off. It couldn't even count how many "b"s there are in "blueberry".
Moreover, this also deprives users of the right to choose. OpenAI removed GPT-4o, and many netizens complained online that they lost a friend.
So Altman temporarily restored GPT-4o for Plus users and allowed Pro users to continue accessing other old models.
Doesn't this operation indirectly indicate that the routing model wasn't well-tuned at the time of release?
Now, let's talk about LongCat. It's really fast, but in terms of the upper limit of thinking, it can't compete with other large models. For example, I asked both LongCat and DeepSeek the same question: What does "Dante is really not Chinese, but Dante is really Chinese" mean?
LongCat quickly gave an answer, but it failed to interpret the humor in the sentence. DeepSeek was a bit slower, but it clearly explained the punchline.
LongCat
DeepSeek
It's like if I ask you what 114 * 514 is, and you quickly answer 58596. You're really fast, but actually, I just want you to play along with some abstract idea.
Of course, we have some solutions for when the router malfunctions. That is to add words like "think deeply" or "ultra think" to the prompt. After receiving these, the router will try to call a more powerful model.
However, this is only a temporary solution. After using it a few times, it may stop responding...