HomeArticle

Does the algorithm behind Musk's new model come from NVIDIA?

量子位2025-09-26 11:15
Inference efficiency soars by 53 times

Grok-4-fast has recently shown unparalleled performance in cost reduction and efficiency improvement, even outperforming GPT5, which has the so - called "router" advantage.

Faced with such amazing reasoning efficiency, many people's first reaction is that the brute - force scaling of computing power by stacking GPUs has once again worked wonders.

Actually, there is indeed a shadow of NVIDIA behind Grok.

However, what might have contributed to this success is perhaps not Huang's graphics cards, but the algorithm.

Yes, the secret weapon of Grok-4-fast has been associated with an NVIDIA algorithm paper.

The Rocket Engine That Makes LLM 53 Times Faster

As demonstrated by Grok-4-fast, this paper has solved the long - standing problem of reasoning cost in the industry.

Blindly scaling hardware only makes the numbers on the model manufacturers' bills longer, and users' patience is gradually worn out during the long reasoning process.

To address this, the NVIDIA research team has introduced a brand - new "hybrid structure" model - Jet-Nemotron.

After a series of comprehensive benchmark tests, it was found that Jet-Nemotron-2B performs on par with top - tier open - source models such as Qwen3, Qwen2.5, Gemma3, and Llama3.2, and can achieve a speed increase of about 53 times.

For example, on MMLU - Pro, Jet-Nemotron-2B not only has a higher accuracy rate than Qwen3 - 1.7B - Base but also has a generation speed 47 times faster.

In addition, Jet-Nemotron-2B is not intimidated even when facing models with larger parameters. Its accuracy rate on MMLU and MMLU - Pro can even exceed that of DeepSeek - V3 - Small and Moonlight (with a total parameter count of 15B and an active parameter count of 2.2B).

The key to changing all this lies in a new framework called PortNAS.

Different from previous methods, PostNAS does not start training from scratch. Instead, it starts with a pre - trained full - attention model and freezes its MLP weights, only exploring improvements to the attention mechanism.

In this way, not only can the training cost be directly reduced by several orders of magnitude, but also more resources can be devoted to comprehensively exploring the model structure.

Its process includes four core parts: full - attention layer placement, selection of the optimal linear attention module, design of a better linear attention module, and hardware - aware architecture search.

Full - Attention Layer Placement

Most teams uniformly use full attention in all layers of the model, but this wastes computing resources.

Therefore, the NVIDIA team hopes to retain a small number of key full - attention layers to maintain the accuracy of complex tasks while removing redundant layers to improve efficiency.

PostNAS's approach is as follows: first, construct a super - network that includes both attention mechanisms; then, train sub - networks through feature distillation; finally, use beam search to find the optimal placement scheme for attention layers.

Facts have proven that not all attention layers are equally important. Different tasks rely on different layers, and a small number of key layers can cover the requirements of most tasks.

The experimental results show that PostNAS outperforms the uniform placement strategy - with only 2 full - attention layers, PostNAS has an accuracy rate of about 49%, while the uniform placement strategy has an accuracy rate of about 40%.

Selecting the Optimal Linear Attention Module

After determining the full - attention layers, the NVIDIA team began to search for attention modules, aiming to find the currently optimal linear attention module.

The paper evaluates six state - of - the - art linear attention modules, including RWKV7, RetNet, Mamba2, GLA, DeltaNet, and Gated DeltaNet.

Among these six, Gated DeltaNet has the highest accuracy rate, mainly due to two factors:

1. Data - Dependent Gating Mechanism: It can be understood as a router. The model decides whether to pay more attention to new information or the previous historical state based on the input content, thus finding a balance in different tasks.

2. Delta Rule: Instead of overwriting all the contents in memory every time, it only updates the newly changed parts. This can reduce unnecessary redundant storage, save memory, and maintain the continuity of information.

A Better Solution: JetBlock

However, NVIDIA doesn't plan to stop at Gated DeltaNet. It has designed a more powerful linear attention module - JetBlock.

Convolution is crucial for the accuracy of linear attention modules. However, most previous methods use static convolution kernels, which cannot automatically adjust the feature extraction method according to the input.

In contrast, JetBlock uses dynamic convolution. By introducing a convolution kernel generator module into linear attention, JetBlock can dynamically generate convolution kernels based on input features.

The results show that JetBlock outperforms Gated DeltaNet in terms of accuracy in mathematical reasoning and retrieval tasks and still maintains good generation efficiency.

Compared with Mamba2, which has the worst performance, the advantages of JetBlock are even more obvious.

Hardware - Aware Architecture Search

After determining the macro - architecture and selecting the linear attention module, the NVIDIA team further conducted a hardware - aware architecture search to optimize core hyperparameters (dimensions of key/value, number of attention heads...).

In the past, the parameter scale was usually used as the main indicator to measure model efficiency and guide architecture design.

However, the NVIDIA team believes that this method is not ideal because the number of parameters does not directly reflect the efficiency on real hardware.

In response, their improved method is: select hyperparameters with the direct goal of generation throughput.

The NVIDIA team found that compared with the number of parameters, the KV cache size is the most critical factor affecting the generation throughput of long contexts and long texts. When the KV cache size is fixed, models with different parameter scales have similar generation throughput performance.

Based on this, the NVIDIA team chose to keep the KV cache size consistent with the original design and then conducted a small - scale grid search on the key dimension, value dimension, and number of attention heads.

Experiments have proven that in the optimized version, while maintaining the same throughput, the number of parameters increases (184 million vs 170 million), and the mathematical accuracy also improves (34.8% vs 32.8%) (the blue row represents the experimental group, and the gray row represents the control group).

In summary, PortNAS is expected to have three impacts on the current AI industry.

1. The GPU usage time in the reasoning stage is reduced by 47 times, which enables LLM to complete high - quality tasks at a faster speed.

2. Smaller memory requirements, which makes it possible to deploy on cheaper hardware.

3. Higher throughput, which means that model manufacturers can serve more users with the existing infrastructure scale.

Moreover, PostNAS provides a low - cost and high - efficiency way to explore the architecture and is applicable to any pre - trained Transformer.

So basically, any manufacturer can embed PortNAS without retraining the model, and the cost of the model can be significantly reduced while the accuracy is hardly affected.

In addition, Jet - Nemotron is actually open - source.

The corresponding author, Han Cai, said on Github that the code and pre - trained model of Jet - Nemotron will be released after the legal review is completed.

Friends who are interested can check the link at the end of the article~

Is NVIDIA Behind Grok - 4 - fast?

Seeing the equally amazing and highly similar performances of Grok - 4 - fast and Jet - Nemotron, it's hard not to suspect that Musk and Huang have joined hands this time.

On Reddit, some netizens speculated that Grok - 4 - Fast should be created based on Jet - Nemotron.

Jet - Nemotron can significantly reduce the computational volume required for reasoning without sacrificing model performance, which is highly similar to the capabilities demonstrated by Grok - 4 - fast.

This view is supported by data - judging from the pricing of Grok - 4 - fast, its price reduction level is consistent with NVIDIA's prediction for this architecture model (the paper predicts that it will be 20 to 50 times cheaper).

More importantly, if Jet - Nemotron can be applied to Grok, it can also be deployed by companies such as OpenAI, Anthropic, and Google.

Some netizens disagree with this view, believing that the price cut of Grok this time may just be a marketing strategy, and it's impossible to infer whether xAI has adopted any new technology from it.

They may just be burning money to gain market share. I don't think you can infer the adoption of a specific architecture from this.

However, even if Grok - 4 - fast doesn't adopt NVIDIA's technology, this paper is still very valuable because Jet - Nemotron can also be used to further reduce costs. Moreover, it's unlikely that xAI has developed another technology as effective as Jet - Nemotron in such a short time.

Of course, it may also be a breakthrough in other algorithms. If so, it's still extremely groundbreaking because Jet - Nemotron can also be used to further reduce costs. But to be honest, what's the probability that XAI has really found an algorithm improvement that can reduce the price by more than 20 times?

However, all the above views are just speculations, and none of these statements have been verified by xAI...

Another Masterpiece by Chinese Scholars

We don't know whether Grok - 4 - fast has really adopted this technology. What's clear is that behind this breakthrough research result is another concentrated effort by Chinese scholars - all the authors of the paper are Chinese.

The first author of the paper is Gu Yuxian. He is a fourth - year doctoral student in the Interactive Artificial Intelligence (CoAI) research group of the Department of Computer Science and Technology at Tsinghua University, under the supervision of Professor Huang Minlie.