HomeArticle

The era of cost-conscious AI has arrived

深流研究所2026-07-02 11:39
Frontier AI and everyday AI are diverging from each other.

Last week, Doubao officially started charging. It launched a professional version with three tiers of prices, and the annual fee for the top - tier package reached 5088 yuan.

Immediately afterwards, DeepSeek, well - known as the "price butcher", is also going to change its pricing method. After implementing the new peak - valley pricing model, from 9 a.m. to 12 p.m. and from 2 p.m. to 6 p.m. every day are considered peak hours, and the call cost will double directly.

Image source: DeepSeek Open Platform

All gifts that seem to be given away are secretly marked with a price. ChatGPT started inserting ads into the dialog boxes of free users in early February this year. Last week, it also went to France for large - scale business promotion, and the density of ad push increased suddenly.

The era of getting AI for free seems to be coming to an end soon. At this time last year, companies were still engaged in a price war. Now, everyone is thinking about how to make the AI business stop being a bottomless pit with only costs and no profits.

1. Double obstacles in commercialization

There are basically two ways for an AI company to make money. Either charge more or spend less. But now, both ways are blocked.

In the past few years, everyone was thinking about how to increase the user scale. After all, the Internet business model has been deeply ingrained. In the early stage, companies burned a lot of money to attract users. It didn't matter if they were losing money. Once the user scale was large enough, the cost would naturally be diluted. However, the marginal cost of the Internet is almost zero. When one more user visits, the server cost hardly increases.

AI products are more in line with the logic of the manufacturing industry. When the user scale increases, the computing power cost also increases. Because the cost of AI is rigid. When one more user asks a question, the model has to run a real - time inference, consuming computing power. The more users and conversations there are, the more computing power is consumed.

OpenAI, with a monthly active user base of 900 million, had a net loss of 38.5 billion US dollars last year. In the first quarter of this year, the situation did not improve. For every 1 US dollar the company earned, it lost 1.22 US dollars. On Doubao's side, although the daily token call volume has reached 180 trillion, the daily revenue is less than 1 million yuan.

An ad appeared at the bottom of the answer when asking "How to learn AI" on ChatGPT

On the other hand, the supply of computing power itself is a scarce resource, which keeps the price of computing power at a high level, and the total cost cannot be reduced.

Currently, the constraint on the supply of computing power is a hard physical wall that is not easy to break through. First is electricity. Gartner predicts that the global data center electricity consumption will exceed 1200 TWh in 2030, and the power grid will not be able to meet the demand at that time. Then there are chips. Almost all the advanced packaging of high - end AI chips in the world depends on TSMC. However, even if TSMC expands its production capacity as fast as possible, NVIDIA alone can consume more than 60% of it. The remaining 40% will be scrambled for by dozens of companies, and even if you have money, you may not get in line.

Moreover, the form of AI is changing from the question - and - answer Chatbot to the Agent that needs to run continuously. An Agent needs to transform a few lines of tasks given by humans into hundreds or thousands of self - inferences, tool calls, and memory accesses in the background. This transformation requires an order - of - magnitude increase in computing power.

So, the computing power cost is facing a double squeeze. The cost increases as the call volume increases, and the scale cannot dilute the cost. Also, the supply cannot meet the continuously rising demand, so the cost cannot be reduced.

Then, why not just raise the price and charge more?

In the To B productivity scenario, raising the price is okay. After all, customers are buying the ability to solve complex professional problems. The upper limit of intelligence and real - world capabilities are the primary constraints, and enterprises can accept paying high costs for this. Whether it is the soaring ARR of Anthropic, the sky - high stock price of Zhipu, or the wide acclaim of workbuddy, all these illustrate this point.

However, in the To C scenario, the situation is completely different. Among the 900 million weekly active users of ChatGPT in 2025, about 50 million are individual subscribers, accounting for only about 5%.

The willingness to pay is even lower in China. After being immersed in the "free + ads" Internet model for a long time, domestic users have not developed the habit of paying for independent software. When Doubao tested the subscription model in early May, the topic "Doubao is stupid and still charges" trended on the hot search.

To put it bluntly, ordinary users currently have no loyalty to To C AI products. They will use the one that is convenient and easy to use. Not to mention raising the price, even changing from free to paid will drive away a large number of users.

So, the only way left for enterprises is: can AI consume less computing power resources when completing the same task?

This is what the entire industry is doing now: giving priority to efficiency.

2. Make every bit of computing power count

From the inside out, every layer of the industry is now working along the efficiency - oriented path.

At the lowest hardware layer, even NVIDIA thinks that relying solely on GPUs is not enough. This year, NVIDIA launched a new chip called LPU at the GTC conference. It is based on Groq, for which NVIDIA obtained the technology license last year, and is specifically optimized for AI inference scenarios.

How to understand this? GPUs are good at high - concurrency large - scale computing, like a ten - thousand - man phalanx charging together. Currently, they are mostly used in the pre - training of large models to improve the upper limit of intelligence. The LPU is like an elite squad, good at quickly completing tasks. In the daily inference scenarios for ordinary users, there is actually no need for a large number of troops to attack simultaneously. Fast response and cost - saving are the most cost - effective.

Above the chips is the model architecture. MoE (Mixture of Experts architecture) has become the mainstream in the past two years. Its advantage is that the total parameters of the model can be piled up to the trillion - level to ensure a large enough "brain capacity", but only a small number of parameters are activated each time it works, making it both powerful and cost - saving. This can be understood as a company selecting the most suitable experts from all its employees to work on a task according to the needs every time it receives a task.

It is difficult to keep the activation rate low, and it is even more difficult to select the right "experts". Otherwise, if the parameters that should be activated are not activated, the quality of the answer will collapse. For example, DeepSeek V4 Pro has a total of 1.6 trillion parameters, and only 49 billion are activated each time, which is equivalent to only using 3% of its elite employees. As a result, its coding ability is close to that of the top - level closed - source models, and the output price is only one - eighth of that of GPT - 5.5.

The hy3 preview recently open - sourced by Tencent also follows this approach. It has 295B parameters and only 21B are activated, which is equivalent to having the ability close to a 300B - level model, but the cost is at the 20B level. After it was launched on OpenRouter, developers flocked to use it. Besides being free, it is also because the cost - performance ratio at this scale is really high.

Apparently, it has been verified that this direction is feasible. The AI assistant Xiaowei, which is currently in the gray - scale internal testing by Tencent, also uses the same idea. The model behind Xiaowei is called WeLM, with a total of 80 billion parameters, but only 3 billion are activated each time, and the activation rate is as low as 3.75%, even lower than that of DeepSeek - V4 - Flash (4.6%), which is currently the representative of extreme cost - performance in China.

Why keep the activation rate so low? With WeChat's monthly active user base of 1.4 billion, once "Xiaowei" is fully launched, the daily inference volume will be astronomical. If the cost - performance ratio of the model is not high enough, the electricity cost alone will eat up all the profits. So, most of Xiaowei's daily requests are handled by the fast and cheap WeLM, and there are also cooperative models to handle really difficult tasks.

During the operation of the model, computing power can also be further squeezed through engineering skills. For example, a method called KV cache reuse, which is used by DeepSeek and others, means that when you talk to AI about the same topic repeatedly, the repeated content such as system prompts and common prefixes does not need to be calculated from scratch every time. You can directly use the previous calculation results. It's like when you commute frequently, you don't need to re - navigate every time after you are familiar with the route.

In addition to engineering methods, DeepSeek also offers a new approach, which is to use price leverage to optimize computing power scheduling.

Under DeepSeek's new pricing method, the price remains the same during off - peak hours, and the cost of cache hits is still close to free. This is equivalent to using price signals to shift part of the load from the daytime to the nighttime off - peak period, so that the previously idle computing power can be utilized. For the same batch of GPUs, the overall utilization rate within 24 hours is higher, and the unit cost will naturally decrease.

As mentioned before, in the Agent era, the problem of computing power is more tricky. When an Agent is working, a large number of tokens are actually spent on repeatedly moving information, rather than actually producing new things. When multiple Agents collaborate, it is even more exaggerated. They gather together like an inefficient meeting, repeatedly confirming the background that has already been discussed. The longer the task, the more serious the idle running.

Google's A2A protocol and Anthropic's MCP protocol are aimed at solving this problem. Simply put, MCP allows a single Agent to reuse the context internally, without starting from scratch every time. A2A enables multiple Agents to share existing results, avoiding repeated work. One addresses internal consumption, and the other addresses repetition. Working together, they can reduce the ineffective inferences when Agents collaborate.

Priority on efficiency is not just the wishful thinking of enterprises. The needs of users are also differentiating.

There is an indicator called the LLM Token Expenditure Index, which measures the market's willingness to pay for AI. Recently, it has been continuously declining. Behind the decline of this indicator is that users are accelerating their departure from those expensive, large - parameter cutting - edge models and turning to cost - effective, specially optimized lightweight and MoE models.

Image source: Citadel Securities' report "Tokennomics"

In response to these phenomena, Citadel Securities recently made a very incisive judgment: There are signs of differentiation between the use of cutting - edge artificial intelligence and "everyday" artificial intelligence. In other words, cutting - edge AI pursues the upper limit of intelligence, while everyday AI pursues extreme efficiency. The two AI routes can no longer be compared with the same standard.

This does not mean that cutting - edge models are no longer important. Leading large models will still continuously pursue the upper limit of intelligence, and there is also a demand for this part. However, people have realized that only a few professional or complex scenarios are worth using expensive AI models. In most scenarios, it is actually possible to actively downgrade to more cost - effective models.

After all, a company will not let its chief analyst answer the front - desk phone. The same applies to model use. Using a cannon to shoot a mosquito will only waste resources.

If the priority on efficiency is achieved, both enterprises and users can benefit. For enterprises, once the single - inference cost is reduced, profits can be seen. On the other hand, after the enterprise - side cost is reduced, it can in turn open up room for price cuts. When the price is reduced, users who were previously blocked by the price can enter the market, and the paid - user scale can grow healthily, forming a positive cycle.

3. Make AI accessible to everyone

Recently, in addition to the price increase of To C AI products, tech giants have also been reducing the token usage of their internal employees.

Microsoft has started to cancel the internal Claude Code license and asked employees to switch to its own cheaper Copilot CLI. Amazon has clearly required employees not to use AI just for the sake of using it. Meta has also removed the internal token consumption leaderboard.

As a result, people are forced to learn how to maximize the use of tokens. A knowledgeable engineer can indeed keep the AI bill very low. He knows how to streamline prompts, control the length of the context, and avoid letting the model read the same material repeatedly. For him, these are things that can be easily learned.

Recent technical posts on the CSDN community about saving tokens

But how many ordinary users can understand these technical posts about saving tokens and consciously control token usage every time? They are more likely to be paying for far more computing power than they actually need, and they don't know how to solve this problem.

This gap should not be filled by users. How to use AI more cost - effectively should be shifted from users to the mechanism level. Ideally, users don't need to know how many models are running in the background. The system can determine that a simple task should be assigned to a cheap small model, and a complex task should call a more expensive model. Just like when you use a search engine, you don't need to know how many servers are responding to you.

Only in this way can more ordinary people like you and me who use AI benefit from this new technology.

After all, the value of technology never lies in how extreme it can be, but in how many people it can reach. If AI capabilities cannot be used by everyone, it is just a carnival for the elite.

Just as electricity was only a privilege of factories before it entered every household, and the information gap still existed before the Internet spread to every county. The same goes for AI. Priority on efficiency is not just a business proposition; it is also a problem of technological equality.

Transforming from a tool for a few people to an infrastructure for everyone is a critical moment in every technological revolution. The popularization of AI does not depend on what the most powerful model can theoretically do, but on how low the cost of running AI on a large scale can be reduced. Now, AI is standing at the door of this moment, and priority on efficiency is the pair of hands that can push open this door.