AMD's New Paper Challenges Conventional Wisdom: FP4 Training Is Unstable, Not Due to Insufficient Randomness

Achieve an end-to-end training acceleration of 9-10% on the native FP4 hardware.

As is well known, the cost of training large models is extremely high.

However, it is also known that reducing training precision can significantly lower the training cost. DeepSeek-V3 used FP8 for training and brought the cost down to $5.6 million, which has caught the attention of the entire industry.

After the success of FP8, the industry is still continuously exploring the boundary of low precision: If the precision is reduced from FP8 to FP4, how much more can the training cost be reduced?

Theoretically, the computational throughput of FP4 can be twice that of FP8. Both NVIDIA Blackwell and AMD MI350 series have natively supported FP4 operations at the hardware level. The former claims that the FP4 computing power on B200 can reach 4500 TOPS (sparse). The hardware is ready, but on the software and algorithm side, there has been a persistent problem:

Training large models from scratch with FP4 is very unstable.

In the past two years, works such as LLM-FP4 and NVFP4 pre-training have successively attempted this approach, but few solutions can smoothly complete the full-process pre-training at 4-bit precision while maintaining a convergence quality close to that of FP8.

What's more tricky is that the cause of the collapse has always been unclear. Analysis suggests that the instability of FP4 training is likely due to insufficient randomness.

But recently, AMD, in collaboration with Pennsylvania State University, published a paper that overturns the traditional perception and provides a brand - new and clear diagnosis for native FP4 training.

Paper title: Pretraining large language models with MXFP4 on Native FP4 Hardware
Paper link: https://arxiv.org/abs/2605.09825

This paper completed the full - process pre - training of Llama 3.1 - 8B in MXFP4 format on the AMD Instinct MI355X GPU. The end - to - end training speed is 9 - 10% faster than the FP8 baseline, and the token overhead is only 8 - 9% more. This is currently the first complete experiment of pre - training large models on native FP4 hardware (not software simulation).

More importantly, the paper reveals the core issue: The instability of FP4 training does not stem from insufficient randomness, but from the accumulation and amplification of structural micro - scaling errors along sensitive gradient paths.

What is MXFP4

Before dissecting the paper, it is necessary to understand the MXFP4 data format.

Traditional integer quantization usually uses a single scaling factor for the entire tensor. The core design of MXFP4 is called "Micro - scaling": A tensor is divided into small blocks (for example, a group of 32 elements), and each block is assigned a shared exponent (E8M0 format). Each element within the block is represented by a 4 - bit floating - point number. The reconstruction formula can be written as:

Among them, E_shared is the maximum exponent within the block, and Q_FP4 is the value rounded to the nearest 4 - bit floating - point representable value.

The advantage of micro - scaling is that each small block has its own dynamic range and will not be "kidnapped" by global outliers. This makes the representation quality of 4 - bit floating - point numbers much better than simple global quantization.

However, even with micro - scaling, FP4 training is still unstable.

Troubleshooting experiment: The root cause of instability

The research team first designed a controlled experiment for step - by - step troubleshooting.

A complete Transformer linear layer calculation involves three general matrix multiplication operations:

Fprop (Forward propagation): Calculate Y = XW^T to produce activation values

Dgrad (Activation gradient): Calculate ∇X = ∇Y · W to back - propagate the gradient to the input

Wgrad (Weight gradient): Calculate ∇W = (∇Y)^T · X to produce the gradient for updating weights

The research team kept all other factors unchanged and gradually replaced these three operations from FP8 to MXFP4, observing the impact of each step on convergence. All experiments were performed on the AMD Instinct MI355X using the native FP4 tensor core without relying on software simulation.

The training task is the MLPerf standard setting, pre - training Llama 3.1 - 8B on the C4 dataset, and the convergence target is to reach a validation set perplexity of 3.3.

The first two steps only brought a moderate additional token overhead, but once Wgrad was also replaced with MXFP4, the overhead jumped directly to 26 - 27%.

Wgrad is the bottleneck of FP4 training. Forward propagation and activation gradients have a considerable tolerance for FP4 quantization, but once the weight gradient is quantized to 4 bits, the convergence quality shows a significant degradation.

The mainstream intuition in the industry before was that the FP4 quantization error is essentially a noise problem, so randomness can be injected to "smooth" the error distribution. Two common strategies are:

Stochastic Rounding: Introduce randomness during quantization to make the expected value of the rounding error zero

Randomized Hadamard: Use a Hadamard transform with random sign flips to disperse the data distribution before quantization

After Wgrad is quantized, the two randomness strategies not only fail to stabilize the training but directly lead to non - convergence. Instead of helping, randomness introduces more effective quantization errors on the critical gradient path.

In contrast, the deterministic Hadamard rotation reduces the full - process token overhead from 26 - 27% back to 8 - 9%, and the training trajectory closely follows the FP8 baseline.

This is a very diagnostically valuable result. Both random and deterministic Hadamard rotations are orthogonal transforms that can disperse the energy distribution of outliers. Theoretically, their effects on alleviating quantization errors should be similar. However, their performances in the Wgrad scenario are completely opposite, which reveals the essence of the problem:

The instability of FP4 training is driven by the structural errors generated by MXFP4 micro - scaling on sensitive gradient paths. The failure of randomness strategies is because they introduce different error patterns at each step, and these changing error patterns accumulate along the gradient path, amplifying the instability instead. The reason why the deterministic rotation is effective is precisely because it applies the same transform at each step, keeping the error pattern consistent and avoiding error accumulation.

End - to - end efficiency: Training step throughput +20%, comprehensive acceleration 9 - 10%

After adding the deterministic Hadamard rotation to the full - process MXFP4, the efficiency data is as follows:

The throughput of the training step has increased by 20%. After deducting the additional 8 - 9% token overhead, the end - to - end comprehensive acceleration is still 9 - 10%.

Considering that the precision is directly reduced from 8 bits to 4 bits, both the convergence quality and the acceleration amplitude are quite remarkable.

Left figure: When performing MLPerf pre - training on the C4 dataset, the curve of the validation perplexity of Llama 3.1–8B changes with the number of training tokens. The results show that the performance of MXFP4 + deterministic Hadamard is very close to that of FP8, while the full - process MXFP4 without stabilization has a slower convergence speed and worse training stability. Right figure: A local magnified view in the later stage of training. The target perplexity of MLPerf is 3.3. Compared with the non - stabilized MXFP4 run, the deterministic Hadamard (H16) can maintain a closer consistency with the FP8 baseline.

It is worth noting that the author clearly emphasized an important limitation in the paper: The effectiveness of this FP4 training scheme (MLPerf C4 dataset + Llama 3.1 - 8B) has been verified, but it cannot be directly assumed that it can be seamlessly migrated to all models, all datasets, and all training methods. The behavior of FP4 training may be highly setting - dependent, and specific stabilization strategies need to be re - verified according to the scenario.

Conclusion

Putting this paper into a broader industrial context, it has at least three layers of significance.

First layer: It answers a fundamental "why". Most previous FP4 training work focused on "how to prevent it from crashing". This paper gives a clear causal diagnosis for the first time: The crash is due to the structural micro - scaling errors on the Wgrad path, rather than insufficient randomness. This diagnosis itself has methodological value, telling subsequent researchers that when encountering instability in low - precision training, they should first investigate the source of structural errors rather than blindly increasing randomness.

Second layer: It pushes FP4 from "inference - only" to "training - available". Previously, the industry consensus was that FP4 was only suitable for inference quantization, and at least FP8 was required for training. NVIDIA's emphasis on FP4 inference rather than training on Blackwell also reflects this judgment. This paper completed the full - process pre - training on native FP4 hardware, which means that the FP4 computing power on MI355X and Blackwell prepared for inference can theoretically also be used for training. If FP4 training is verified to be feasible for larger models and more scenarios, it means that the available training computing power of existing hardware will directly double.

Third layer: It uses the OCP open standard. MXFP4 is part of the OCP Microscaling format standard, which is jointly supported by seven companies including AMD, NVIDIA, Intel, Meta, Microsoft, Arm, and Qualcomm. Based on the open standard, this method is portable on the hardware of different manufacturers and will not be locked in a single ecosystem.

From FP16 to FP8, DeepSeek - V3 has proven that halving the precision can significantly reduce the training cost. From FP8 to FP4, this paper takes a crucial first step. Every reduction in precision changes the economics of large - model training.

This article is from the WeChat official account “Almost Human” (ID: almosthuman2014), edited by Leng Mao. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

AMD's new paper challenges conventional wisdom: FP4 training is unstable, and the reason is not insufficient randomness.

What is MXFP4

Troubleshooting experiment: The root cause of instability

End - to - end efficiency: Training step throughput +20%, comprehensive acceleration 9 - 10%

Conclusion