Nvidia's most powerful B200 GPU wastes 60% of its computing power. A Princeton team steps in, raising the utilization rate to 71%.
Are all users of NVIDIA's Blackwell B200 wasting their money?
A joint team including Princeton University pointed out that this GPU has wasted a staggering 60% of its computing resources due to software-hardware compatibility issues.
What to do about the wasted computing power? FlashAttention-4 has the answer.
This attention algorithm, specifically designed for GPUs with the Blackwell architecture, has boosted the utilization rate from the industry's typical 20%-30% to 71%.
FlashAttention-4 was developed by a team led by Tri Dao, in collaboration with Meta, Together AI, and other teams.
Well, NVIDIA itself is also involved...
The Blackwell B200 Can't Reach Its Full Potential
As a new-generation data center GPU, the NVIDIA Blackwell B200's tensor core computing power reaches 2.25 PFLOPS, twice that of the previous-generation Hopper H100.
Theoretically, it should enable a significant leap in the speed of attention calculations.
But reality is a bit different...
This GPU has a serious imbalance in performance.
While the core computing power has increased significantly, the key supporting computing units have remained stagnant.
Specifically, the throughput of the MUFU unit responsible for exponential operations is exactly the same as that of the Hopper architecture, with no improvement;
The bandwidth of the shared memory has also remained unchanged and has not been upgraded in sync with the tensor cores.
This asymmetry in hardware design has directly led to a reversal of the performance bottleneck.
In the attention calculation load at the core of large models, the original performance bottleneck, matrix multiplication, now takes far less time than the auxiliary steps. The time taken for read-write operations of shared memory and exponential operations is 25%-60% more than that of matrix multiplication.
The Tensor Core, with doubled computing power, is often in a waiting state, and a large amount of computing resources are left idle.
As a result, a large number of developers who have spent a fortune on deploying the B200 GPU have wasted over 60% of the resources due to the disconnect between the core computing power and the supporting units.
Doubled computing power?
No! It's more like being unable to use its full strength...
FlashAttention-4 Solves the Bottleneck with Three Strategies
To address the imbalance issue of the Blackwell GPU, FlashAttention-4 has developed three optimization strategies.
The first strategy is to tackle the problems of exponential operations and memory read-write through multiple approaches.
On the one hand, the team simulates the exponential function through software. By using the polynomial approximation method, the high-speed FMA computing unit is involved in the exponential operations originally handled by the MUFU unit, significantly increasing the throughput of exponential calculations;
At the same time, by combining hardware computing and software simulation, the calculation accuracy is ensured while increasing the speed.
On the other hand, the team has introduced the conditional softmax rescaling strategy. It only performs the softmax scaling operation when necessary, directly skipping a large number of useless calculation steps and reducing the amount of non-matrix multiplication operations.
In addition, the team makes full use of the 2-CTA MMA mode of the Blackwell architecture, allowing two computing units to work together to complete matrix operations, with each unit only loading half of the operation data.
This directly cuts the read-write volume of shared memory in half and also reduces subsequent atomic operations, fundamentally alleviating the bandwidth pressure on the shared memory.
The second strategy is to restructure the computing pipeline to maximize parallel computing power.
FlashAttention-4 is deeply adapted to the fully asynchronous MMA operations and the newly added tensor memory TMEM of the Blackwell architecture, and has redesigned the forward and backward pipelines for attention calculations.
This allows the two core steps of softmax calculation and matrix multiplication to be fully overlapped in computation.
When the hardware's tensor core is processing one matrix block, another part of the hardware resources can simultaneously perform softmax calculations on another data block, avoiding idle hardware computing power.
The third strategy is to consider hardware iteration and reserve optimization space for the next generation of GPUs.
The R & D team has also taken into account the hardware upgrade trend of the Blackwell architecture. Currently, the throughput of the exponential operation unit of the B300/GB300 GPU has doubled to 32 ops/clock/SM.
In response to this change, the team has clearly stated that the current software simulation exponential operation scheme of FlashAttention-4 will be re-evaluated based on the actual performance on the next-generation hardware to ensure that the algorithm can continue to adapt to the iterative upgrade of the hardware.
Say Goodbye to C++, Compilation Speed Soars 30 Times
In addition to in-depth optimization at the algorithm level, FlashAttention-4 has also brought changes at the development level.
Different from FlashAttention-3, which was previously developed based on C++ templates, all the code of FlashAttention-4 is written based on the domain-specific version CuTe-DSL framework of Python, achieving zero C++ code development.
This design has led to a significant leap in compilation efficiency.
The compilation time of the forward propagation kernel has been reduced from 55 seconds in FlashAttention-3 to 2.5 seconds, a 22-fold increase in speed;
The compilation time of the backward propagation has been reduced from 45 seconds to 1.4 seconds, a 32-fold increase in speed, and the overall compilation speed has soared up to 30 times.
Actual test data on the B200 GPU shows that its forward propagation computing power reaches a maximum of 1613 TFLOPS/s, achieving a theoretical peak utilization rate of 71%.
Compared with mainstream computing frameworks, the advantages of FlashAttention-4 are also quite obvious.
It is 1.1-1.3 times faster than NVIDIA's official cuDNN 9.13 and 2.1-2.7 times faster than the commonly used Triton framework.
Moreover, in core scenarios such as long sequences and causal masking in large model training and inference, the performance advantages are even more prominent.
One More Thing
The paper also points out that cuDNN has started to absorb the core technology of FA4 in reverse since version 9.13.
It seems that NVIDIA itself can't help but copy the homework (doge).
Paper link: https://arxiv.org/abs/2603.05451
Reference link: https://x.com/alex_prompter/status/2033885345935462853?s=20
This article is from the WeChat official account “QbitAI”. Author: Wen Le. Republished by 36Kr with permission.