HomeArticle

DeepSeek V3.1 is released. What's even more intriguing is UE8M0 FP8.

硅星人Pro2025-08-22 11:17
Computing power is power.

DeepSeek has launched version V3.1. Let's briefly go through the highlights: Hybrid Inference Architecture: One model supports both thinking mode and non-thinking mode simultaneously. Higher Thinking Efficiency: Compared with DeepSeek-R1-0528, DeepSeek-V3.1-Think can provide answers in a shorter time. Stronger Agent Capability: Through post-training optimization, the new model has significantly improved performance in tool usage and agent tasks.

What's even more intriguing is that DeepSeek also emphasized in the top - pinned message: UE8M0 FP8 is designed for the upcoming next - generation domestic chips.

In the current context, this statement is thought - provoking. After all, not long ago, relevant departments summoned NVIDIA and demanded an explanation for the security risks of the H20 chip.

That's why several technical terms have become particularly worthy of attention: What exactly is parameter precision? Why does the chip determine its form?

Behind these changes, it may indicate that the domestic AI industry is entering a new stage of software - hardware collaboration.

1

The Invisible Decimal Point Determines the Fate of the Large Model

In deep learning, parameters are the "weights" between the brain neurons of the model. During training, they need to be continuously updated, stored, and calculated. Precision is the number of binary digits used to record these parameters.

Before introducing FP8, we need to go back to the most basic question in computers: How does a machine store numbers?

The simplest way is called an integer (int). It's like the beads on an abacus, representing precise values like 1, 2, 3, 4. But integers can't represent numbers like pi (3.14), and it's difficult to handle extremely large or small values commonly found in scientific calculations.

So, scientists invented floating - point numbers (floating point), which is the FP in FP8. As the name suggests, the position of the decimal point is "floating". It can be written as a common number like 3.14159 or an astronomical exponent like 6.02×10²³. Essentially, a floating - point number splits a number into three parts: the sign bit, the exponent, and the mantissa. The sign bit determines the sign, the exponent determines the position of the decimal point, and the mantissa determines the precision.

Floating - point numbers can basically represent any number. The cost is that with the same number of memory bits, the more bits used, the more precise the representation; the fewer bits used, the coarser the representation.

For a long time, FP32 (32 - bit floating - point numbers) was the gold standard for computers. It has high precision and a wide range, and is almost the universal method for scientific computing, image processing, and AI. However, when the parameter scale of large models expands to hundreds of billions or even trillions, FP32 becomes cumbersome. Each weight needs 32 bits to store, and the video memory is simply not enough, and the training time is also prolonged.

So, the industry began to try to reduce the precision. First, there was FP16 (16 - bit floating - point numbers), and then FP8 (8 - bit floating - point numbers). To give an inappropriate example, it's like compressing a 4K high - definition photo into a 480p small picture. Some details are inevitably lost, but more pictures can be stored, and the transmission is faster.

As can be intuitively seen from a picture in NVIDIA's technical blog, using the H100, the speed of FP8 is much higher than that of FP16.

When training large models, the biggest bottleneck is not the algorithm but computing power and video memory. NVIDIA's official blog points out that FP8 can double the throughput and halve the video memory usage without significantly sacrificing the model's performance. This is a very attractive advantage when training large models at the GPT level.

In other words, in the field of large models where "scale outweighs precision" is pursued, FP8 has become an inevitable choice.

NVIDIA Technical Blog: https://developer.nvidia.com/zh-cn/blog/fp8-precision-performance/

2

Who Sets the Rules, Who Controls the Computing Power

So, what is FP8? And what is the "UE8M0 FP8" mentioned by DeepSeek? Why does it need to be adapted to domestic chips?

First of all, FP8 itself is not a completely neutral international standard. On the surface, NVIDIA once promoted the standardization of FP8 together with Intel and Arm, introducing two formats, E4M3 and E5M2, which focus on precision and numerical range respectively. It seems like an open industry standardization initiative.

However, when it comes to actual implementation, NVIDIA added many "optimizations" to its own GPUs: For example, dynamic scaling strategies such as per - tensor scaling and per - block scaling are used to solve the problem of the narrow dynamic range and easy overflow of FP8. Another example is that the Tensor Core has a built - in instruction set optimization for FP8, enabling FP8 to fully utilize the computing power on the H100. These optimization details are not written into the unified standard but are deeply integrated into NVIDIA's hardware and software stack.

NVIDIA's latest Blackwell architecture natively supports a new "Microscaling formats", including MXFP8 (8 - bit floating point), MXFP6 (6 - bit), and MXFP4 (4 - bit). Some researchers have conducted large - scale verification on high - quality datasets: For an 800 - million - parameter model, after using the MXFP8 - E4M3 format and a carefully designed numerical conversion strategy, the training results are almost comparable to the traditional BF16 (bfloat16). To put it simply, in the Blackwell architecture, the pre - training effect using the MXFP8 format is the best.

Reference Paper: Recipes for Pre - training LLMs with MXFP8 https://arxiv.org/pdf/2506.08027

Back to the UE8M0 FP8 emphasized by DeepSeek in the official WeChat post comments, it is not NVIDIA's official FP8 standard but a variant format. It is closer to an extreme range - first strategy, almost sacrificing the precision of the decimal part.

It's like using a tape measure with coarse scales but ensuring it's long enough to measure from a room to a playground. Although you can't see the millimeter - level details, at least it won't overflow during the measurement.

Why make such a trade - off? Because domestic GPUs are not fully compatible with NVIDIA's FP8 solution in terms of underlying circuit and instruction set design. As mentioned above, NVIDIA has its own "optimizations", while domestic GPUs do not have these. If directly copied, the result is often numerical instability, gradient explosion, and the training cannot converge.

Considering the news that the release of DeepSeek R2 was postponed due to the poor performance of domestic chips a few days ago, it is necessary for DeepSeek to make a statement at this time. DeepSeek has to make a compromise on the model side: using the UE8M0 "range - first" format to adapt to the hardware logic of domestic chips, ensuring a compromise solution for domestic chips to run smoothly.

This is a kind of "mutual achievement" between software and hardware. Model manufacturers are willing to sacrifice some detail precision in exchange for the stable operation of domestic chips, and chip manufacturers can gradually build their own FP8 ecosystem through this cooperation.

3

The FP8 Alliance of Domestic GPUs

Of course, another question that arises is, on which domestic chips is DeepSeek training?

(This is not investment advice. It's just a rumor to fill the word count.)

For example, the Magic Stone Xi Yun C600 chip was officially launched in 2025. The official clearly states that it natively supports FP8 precision and uses a multi - precision hybrid computing power architecture. It can run traditional FP32/FP16 tasks and efficiently accelerate large - model training with FP8.

The C600 actually completed tape - out in October 2024 and is currently in the small - batch mass - production stage. At the same time, the next - generation C700 series has also been planned and is expected to enter the tape - out test in Q2 2026.

In addition to Magic Stone, Enflame Technology also launched the latest L600 chip in 2025. This chip took two and a half years to develop. The biggest highlight is its integrated training and inference architecture, which can undertake both large - model training tasks and direct inference deployment. More importantly, the L600 natively supports FP8 low - precision, which is exactly in line with DeepSeek's precision strategy.

UE8M0 is just a cold precision parameter that may only be worth half a line in a paper. But today, it seems like a signal,Domestic chip manufacturers and large - model companies are starting to sit at the same table to discuss how to move forward together. Large - model companies are no longer blindly following NVIDIA's computing - power logic but are trying to align with domestic hardware, even if the process is not elegant.

This article is from the WeChat public account "Silicon Star People Pro", author: Dong Daoli. Republished by 36Kr with permission.