Everyone is talking about a shortage of computing power, yet 90% of AI chips are "wasted"?
When you're in the zone with Vibe Coding and having a great time, or when a running project suddenly comes to a halt and you open the CLI tool only to see the words "Your quota has been used up", it's impossible not to lose your cool.
Don't even ask. If you do, AI companies will just say it's "a lack of computing power".
But is that really the case?
Analysts at Epoch AI estimate that by the end of 2025, OpenAI will have the equivalent of about 1.7 million H100 GPUs in computing power. This number was 100,000 in 2023 and 400,000 in 2024 - a 17-fold increase in just two years. NVIDIA's market value has exceeded $3 trillion, and global tech giants are spending tens of billions of dollars every quarter to snap up chips. Everything seems to point to the same narrative: computing power is the oil of AI, and whoever hoards more wins.
Every AI company is spending huge amounts of money to hoard GPUs and computing power. How could they possibly be short of my little quota?!
In an episode of the podcast Latent Space, Anjney Midha, the founder of AI infrastructure company AMP, said: "For cutting-edge labs like xAI, GPU utilization may be less than 10% - this is just the tip of the iceberg of the real problem."
I did a simple calculation. MFU (Model Flops Utilization) is an indicator that measures the actual use of GPU computing power for model calculations. If you spend $500 million on a GPU cluster and the MFU is only 10%, it means that the effective computing power you actually get is equivalent to only spending $50 million. The remaining $450 million worth of computing power is just idling.
The strange thing is that these are the smartest engineering teams in the world, spending the most expensive budgets to build the most advanced computing clusters - and then leaving 90% of the computing power idle.
This is not a management mistake of a small company. It's a structural industry secret.
Massive Purchases, Massive Waste
Let me break down this contrast more specifically.
Josh You, an analyst at Epoch AI, wrote in a widely cited report: "Cutting-edge labs have not used most of their AI computing power." He tracked the computing power growth curves of major labs and found a disturbing pattern - the speed of computing power purchases far exceeds the speed of computing power consumption. A large amount of computing resources are in a "reserve" or "idle" state, like hoarded strategic supplies rather than burning fuel.
This is not a problem unique to cutting-edge labs.
Fujitsu cited a more eye-catching set of data in its "State of AI Infrastructure Report" released in 2024: More than 75% of enterprises have a GPU utilization rate of less than 70% even at peak load. Note, this is the "peak" - which means that even at their busiest, three-quarters of enterprises can't even use 70% of their computing power.
VentureBeat made a more radical judgment based on similar data: "95% of AI infrastructure spending is wasted."
I tried to convert these numbers into specific monetary losses. A cloud instance of an H100 GPU costs $30 to $50 per hour. Suppose an enterprise is running a small cluster of 20 GPUs with a utilization rate of only 20% - which is already quite good in the industry - then the annual computing cost wasted due to idleness is about $200,000. For cutting-edge labs with tens of thousands of GPUs, this number needs to be multiplied by several orders of magnitude.
This reminds me of a forgotten piece of history.
In the late 1990s, the US telecommunications industry went through a crazy fiber-optic cable laying boom. Companies like WorldCom, Global Crossing, and Level 3 buried millions of miles of fiber-optic cables underground, investing more than $100 billion. But when the bubble burst in 2001, the industry discovered a shocking fact: more than 95% of the laid fiber-optic cables were so-called "dark fiber" - they were never lit up and never carried any data. They lay quietly underground, like a buried ambition of an era.
Is this just a different version of the same story as today's AI industry buying GPUs but leaving them idle?
But there is a key difference here. The problem with dark fiber was mainly on the demand side - there simply wasn't that much data to transmit at that time. The problem of idle GPUs is more complex because the demand for computing power is clearly there. Every lab is complaining about a lack of computing power, and every researcher is queuing up for GPUs.
Both supply and demand exist. So where exactly is the bottleneck?
GPUs Wait 65% of the Time
I used to naively think that the low GPU utilization rate was because of insufficient workload. Later, when I read some technical analyses at the infrastructure level, I realized that the problem was completely different.
A GPU is not a beast that will just work as long as it's fed. It's more like a picky Michelin-starred chef - if there's a problem with the quality of the ingredients, the serving rhythm, or the kitchen layout, it will stop and wait.
A study by aixenergy revealed a surprising number: During the AI training process, GPUs are idle 30% to 65% of the time. It's not because there are no tasks assigned to them, but because the data isn't ready yet.
This is the so-called "data starvation" problem.
Training a large model requires a huge amount of data. This data needs to go through a series of preprocessing steps such as cleaning, labeling, tokenization, and packaging, and then be loaded from the storage system into the GPU's video memory. The computing speed of a GPU is measured in trillions of floating-point operations per second (TFLOPS), but the IO speed of the storage system can't keep up with this pace. The result is like on a highway, where the throughput of the toll booth determines the actual traffic flow - no matter how many lanes you build on the highway, if the toll booth can only handle two cars at a time, there will be a traffic jam behind.
But the story doesn't end there. I found a paper on arXiv about GPU energy consumption, which revealed a more hidden problem: Even when a GPU enters the so-called "deep idle" state, it still consumes a large amount of electricity. Epoch AI's data shows that about 40% of the total power consumption of a GPU data center comes directly from the GPUs themselves. This means that those GPUs waiting for data are not only not working but also burning a significant amount of electricity.
This is like a Ferrari stuck in the morning rush hour on a ring road: the engine is idling, the fuel is burning, but the car isn't moving. And you're still paying $50 per hour for the rent of this car.
There's also a more subtle trap. The arXiv paper points out that the currently industry-wide monitoring indicator, "cluster-level SM utilization", actually can't effectively reflect the real energy efficiency. SM (Streaming Multiprocessor) is the computing unit inside a GPU. Even if the monitoring panel shows that the SM utilization rate looks normal, in fact, many computing cycles are doing "fake work" - data transfer, memory synchronization, waiting for communication, rather than real model calculations.
This explains a phenomenon that puzzled me before: why some teams report a "GPU utilization rate of 70%", but the training speed is far lower than expected. Because in that 70%, maybe only half is doing effective calculations, and the rest is doing support work. The peak load utilization rate is like a company's "best quarterly revenue" - it's real, but it doesn't represent the norm. Using it to measure efficiency is like using your fastest 100-meter running time to evaluate your daily commuting speed.
When the problem lies in the structure rather than the scale, increasing the quantity not only can't solve the problem but also amplifies the waste proportionally.
When "Making Good Use of Computing Power" Becomes a New Discipline
If the problem is structural, then the solution must also be structural. This is exactly the core proposition put forward by Anjney Midha in that podcast episode. He used a term: "outputmaxxing" - maximizing output.
This term sounds like another trendy Silicon Valley buzzword at first, but the baseline that Midha provided made me realize that it points to a serious engineering problem. He said: "I think the MFU of the current best practitioners is probably between 60% and 70%."
60% to 70%. This is the upper limit that the world's top teams, the most optimized code, and the most carefully tuned infrastructure can achieve. And the industry average is even less than a fraction of this number.
What does this gap mean? It means that for most AI companies, if they can increase the utilization rate from 10% to 60%, it's equivalent to expanding the effective computing power by 6 times without spending an extra cent. There's no need to rush to buy more GPUs, build more data centers, or engage in a price war with NVIDIA - you just need to really make use of what you've already bought.
This is almost the same path that the cloud computing industry has taken. In the early 2000s, the average utilization rate of enterprises' physical servers was only 10% to 15%. Each server ran one application, and the remaining computing power was all idle. Then VMware brought virtualization technology, which stuffed multiple virtual machines into the same physical server. Later, Docker brought containerization, which further reduced resource overhead.
Today, the utilization rate of cloud servers generally reaches 60% to 70%.
From 10 - 15% to 60 - 70%. This leap took about 15 years, gave birth to a trillion-dollar cloud computing industry, and completely changed the way software is built and deployed. The current situation of AI computing power seems to be where the server market was in 2005 - we know where the problem is, but the systematic solution is still in the making.
The change in the business model is accelerating this transformation. In the early days of the AI infrastructure market, the "fixed-fee license" and "bundled token" models were popular - enterprises prepaid a large sum of money to buy a certain amount of computing power quota, and they couldn't get a refund if they didn't use it up. This model naturally encourages waste because the marginal cost is zero, and no one has the incentive to optimize the utilization rate.
VentureBeat's analysis points out that as the industry gradually shifts to pay-per-use, the cost pressure of idle infrastructure is changing from "ignored background noise" to an "urgent matter in the production stage".
When every idle GPU cycle directly corresponds to a number on the bill, "maximizing output" is no longer just a technical ideal but a financial imperative.
At the same time, environmental costs are also forcing an efficiency revolution.
Analysis from Towards Data Science points out that the idleness of most GPUs means that a significant portion of the global AI computing carbon emissions are "ineffective emissions" - they don't produce any intelligence, they just turn electricity into heat. 40% of the power consumption of GPU data centers comes from the GPUs themselves, and a large part of this is consumed in idle and deep idle states. This is not just a matter of money, but also a matter of resources and the environment.
Fujitsu released a technical white paper in 2024 with a straightforward title: "Maximizing GPU Utilization". A number of infrastructure companies such as DevZero, Prodia, and Mirantis have also published articles discussing "why 80% of GPUs are idle" and their respective optimization strategies. This collective anxiety across the industry is itself a signal - the problem has become so big that no one can continue to pretend not to see it.
People have overlooked an important thing. In the narrative of the AI competition, "scale" has always been the only protagonist. Who has the most GPUs, who trains the largest models, who spends the most money - these are the materials for headline news. But efficiency has never been a headline. No one will write a news story about "a company increasing its GPU utilization rate from 15% to 50%", even though in terms of actual output, this may be more valuable than buying 100,000 more GPUs.
The reason why Midha's "maximizing output" is worth taking seriously is that it implies a paradigm shift:
The moat in the AI competition is shifting from "who can buy more computing power" to "who can extract more intelligence from the same amount of computing power". The former is a capital-consuming battle, while the latter is a precise engineering battle. The upper limit of the former depends on your bank account and NVIDIA's production capacity, while the upper limit of the latter depends on your in - depth understanding of computational physics, distributed systems, and data engineering.
This is not a problem of incremental optimization. It's the birth of a new discipline.
Every infrastructure revolution seems to follow the same script: first, there's a crazy construction phase, then it's found that most of the production capacity is being wasted, and then a group of new companies and technologies emerge to specifically solve the problem of "how to make good use of what has already been built". This was the case in the railway era, the electricity era, the Internet era, and the cloud computing era. AI computing power has reached the turning point of this arc.
But this time there's an interesting difference. In previous efficiency revolutions, the objects of optimization were relatively "dumb" resources - steam, electricity, bandwidth, server cycles. And this time, the resource we're trying to optimize is being used to create a form of intelligence. When you "wake up" a GPU from an idle state and let it really participate in model training, you're not just increasing the utilization percentage - you're increasing the number of silicon - based brains that are thinking in the world.
Maybe the most important question in the AI era has never been "how much computing power can we produce", but "how much of the computing power we already have is really thinking".
This article is from the WeChat official account "GeekPark" (ID: geekpark), written by Yuhangyuan, edited by Jingyu, and published by 36Kr with authorization.