Huang with $20 billion "money power" responds to Google: Join hands with Groq to make up for the short - board in inference.
Jay from Aofeisi, QbitAI | WeChat official account QbitAI
Jensen Huang acted quickly and decisively. As soon as Google's TPU posed a threat, he responded with a huge financial investment.
He didn't hesitate to invest $20 billion just to win over a hot "new shovel factory" — Groq.
This undoubtedly marks a significant strategic move by this chip giant in the new AI era. But to some extent, it also reflects Jensen Huang's concerns about a series of new chip paradigms, including the TPU.
So, what can Groq bring to NVIDIA?
In response to this question, well - known tech investor Gavin Baker shared his views.
His series of technical analyses all point to the weakest area of NVIDIA's empire — inference.
In terms of inference, the speed of Groq's LPU far exceeds that of GPUs, TPUs, and any ASICs currently available.
Gavin Baker
This view has been widely praised by netizens:
The GPU architecture simply cannot meet the low - latency requirements of the inference market. The speed of off - chip HBM memory is just too slow.
Netizens' views
However, some netizens pointed out that the SRAM used in the LPU may not be suitable for long - context decoding.
In this regard, Gavin believes that NVIDIA can solve this problem through product "mix - and - match".
Gavin Baker
Let's take a closer look below —
Groq: A vaccine that NVIDIA spent $20 billion on
Gavin believes that the fundamental reason why GPUs are struggling in the new era is that the two stages of the inference process, prefill and decode, have very different requirements for chip capabilities.
First, let's look at prefill:
In simple terms, this step is to let the model "read the question", remember the key information provided by the user in its "mind" for subsequent use.
During the "question - reading" process, the model will ingest the entire context provided by the user at once, and all input tokens can be calculated simultaneously.
This is exactly where GPUs shine. They were designed for graphics processing and can calculate thousands of pixels at once, making them naturally suitable for parallel tasks.
In this preparatory stage, the model doesn't need to respond to the user's question immediately. Even if there is a delay, the model can cover up the waiting time by showing "Thinking...".
Therefore, compared to "speed", prefill requires the chip to have a larger context capacity.
However, when it comes to decode, this logic no longer applies.
Decode is a serial task, and tokens must be calculated one by one. More importantly, users will witness the process of each token being "typed" out. In this case, latency is fatal to the user experience.
However, most of the GPU's data is stored in HBM rather than on - chip storage close to the computing core. This means that for each token generated, the GPU needs to read data from the memory again.
At this point, the problem with GPUs becomes apparent — most of the computing power is idle, the FLOPs cannot be fully utilized, and it often has to wait for the memory to transfer data. The actual computing volume is far less than that in the prefill stage.
In contrast, Groq has a better solution — LPU.
Compared to HBM, the LPU uses SRAM directly integrated into the chip's silicon. This on - chip storage model doesn't require data reading, making it 100 times faster than GPUs. Even when processing a single user, it can generate 300 - 500 tokens per second and maintain full - load operation.
Facts have proven that in terms of speed, the LPU is almost unrivaled — not only against GPUs but also TPUs and most ASICs on the market.
However, this doesn't come without a cost.
Compared to GPUs, the LPU has a much smaller memory capacity. A single Groq LPU chip only has 230MB of on - chip SRAM.
In contrast, even NVIDIA's H200 GPU is equipped with a whopping 141GB of HBM3e memory.
As a result, you have to connect hundreds or thousands of LPU chips together to run a model.
Take Llama - 3 70B as an example. With NVIDIA GPUs, you only need two to four cards, which can be installed in a small server box. For the same model, hundreds of LPUs are required, and the floor space of the data center will be much larger than that using GPUs.
This means that even though the price of a single LPU is lower, the overall hardware investment will still be very large.
Therefore, when AI companies consider the LPU, the most important question is —
Are users willing to pay for "speed"?
A year ago, the market couldn't answer this question. But judging from Groq's current performance, it's very clear: "Speed" is a real and huge demand that is still growing rapidly.
For NVIDIA, this is not only a new business segment but also a high - risk area full of potential disruptors. If NVIDIA misses this trend, its opportunities in the AI era may be disrupted by new players, just as NVIDIA disrupted its competitors through the gaming business back then.
To resist these competitors from eroding its moat, NVIDIA chose to "inject the vaccine" called Groq. It hopes to introduce new blood through talent acquisition, fill the gap in low - latency inference scenarios, and help the NVIDIA giant ship avoid the innovator's dilemma.
The "shovel" enters a new era
The rise of the TPU has torn a crack in NVIDIA's invulnerability.
By developing its own chips, Google has successfully reduced its dependence on NVIDIA's sky - high - priced GPUs. This has significantly helped Google reduce its training and inference costs, allowing Google to maintain a fairly healthy financial position while serving a large number of free users.
Google's comeback with Gemini 3 Pro has proven that GPUs are not the only solution in the AI era. Against the backdrop of the rapid iteration of the technology cycle, chips, as the "heart" of AI, also need to be adjusted according to different development stages.
As the progress of foundation models slows down, the focus of AI competition is shifting from the training layer to the application layer. In the AI application market, "speed" is crucial for the user experience.
This talent acquisition of Groq, although an implicit admission of the company's shortcomings in the inference field, also marks another expansion of the NVIDIA empire.
Having dominated pre - training, NVIDIA is now riding on the wave of Groq to enter the "Inference Continent" where competitors are emerging.
In this new market, NVIDIA may not be as dominant as it is now.
As the CEO of Groq said, inference chips are a high - volume, low - profit business. This is very different from GPUs, which are in high demand even at sky - high prices and have a gross profit margin of 70 - 80%.
Reference links: [1]https://x.com/gavinsbaker/status/2004562536918598000[2]https://www.uncoveralpha.com/p/the-20-billion-admission-why-nvidia
This article is from the WeChat official account "QbitAI". Author: Focusing on cutting - edge technology. Republished by 36Kr with permission.