ByteDance versucht, das "unmögliche Dreieck" von Seedance 2.0 zu brechen

Die technologische Schutzmauer wird auf der grenzüberschreitenden Neukonstruktion der untersten Architektur aufgebaut.

After Seedance 2.0 reached the peak of Sora, the AI - based video generation industry has entered a phase where fever and anxiety coexist.

Even though Seedance 2.0 is so powerful, it still can't break the "impossible triangle" in this field:

It is always difficult to achieve model size, generation duration, and infrared speed simultaneously.

If you want film - quality like Seedance 2.0, you have to have a multimodal model with hundreds of billions of parameters developed by a large company like ByteDance. The price for this is a maximum video length of 15 seconds, high costs per generation, and a waiting time of several minutes.

If you want to create videos quickly, you have to sacrifice the number of parameters and use a small model with about 1 billion parameters. The price for this is a blurry image, lack of details, and a breakdown after more than 10 seconds.

If it's not possible to generate high - quality, real - time long videos, AI - based video generation will never reach the level of films.

However, ByteDance, which developed the epoch - making masterpiece Seedance 2.0, has far greater ambitions.

The Helios large - scale model jointly developed by Peking University and ByteDance tries to break through the "impossible triangle" with a sharp knife.

Helios is the first 14 - billion - parameter large - scale model that runs at a speed of 19.53 frames per second (FPS) on a single NVIDIA H100 graphics card.

Although this number of parameters can't be considered lightweight, it is rather a "mini - model" compared to the flagship models of large AI companies.

Although it looks a bit lean, it can achieve the image quality of the currently strongest models and can also continuously generate several - minute - long videos in real - time.

01 The Nightmarish "Long - Term Drift"

Users who have used Jimeng, Keling, or Sora should normally have asked themselves: Why can you only generate videos up to 10 or 15 seconds long? Even if the user is rich enough, they can't break this limitation.

In fact, this is not just a problem of computing power. Even if you increase the maximum generation time, the result is likely to be unsatisfactory:

The AI - generated videos are often incredibly impressive in the first few seconds, but the image quality drops rapidly over time. For example, the protagonist can't maintain their facial features, the body structure begins to change, the background begins to distort, and the movements don't conform to physical logic.

This is the phenomenon of "drift".

The process of AI - based video generation is similar to the question - answer process of language models. Language models have to give the next answer based on their memory and context. Multimodal models also have to "draw the future based on the past".

With a fixed FPS number, the video gets longer and the number of frames increases. This means that the AI has to exponentially store the information from each frame.

And in this process, even a tiny error in the previous frames will be continuously accumulated and magnified in the following generations, which will eventually lead to a total breakdown.

To solve this problem, the early academic world found the most intuitive method: When training the AI, let it directly generate long sequences to avoid the magnification of errors. But this reinforcement learning method not only has the problems of under - fitting and over - fitting, but also the computing power cost is unbearably high. Large models with hundreds of billions of parameters can't be used at all, and 1 billion parameters is already the limit.

So the research team of Helios realized that the problem has to be found in the process of video generation.

First, they noticed that the breakdown of long videos is often accompanied by a complete uncontrollability of image brightness and color, but this problem usually doesn't occur in the first few seconds of the video.

So the mechanism of "First Frame Anchor" was born.

The research team set the first frame of the video as the "master anchor" of the entire generation process. The AI has to always pay attention to the first frame in the following long generation process and define the global appearance distribution.

No matter how the following frames are supposed to progress in the instructions, the overall color scheme and the identity of the people defined by the first frame can bring the AI back on the right track at any time and avoid a "sudden style change".

But even so, the occurrence of errors is inevitable. So the AI has to learn how to deal with such "imperfections".

Helios used a special method at the training stage: Frame Aware Corrupt.

Simply put, different errors are randomly inserted into the historical frames that the AI depends on, so that the AI can reduce its absolute dependence on historical frames through reinforcement learning and learn to solve various problems based on general knowledge.

After this kind of training, Helios has a very high tolerance for errors, and the video doesn't break down so easily even when it's long.

The last problem to be solved is the position shift and repeated movement.

The position encoding of the AI in the process of video generation is absolute. If the length of the generated video exceeds the maximum length that the AI saw during training, the interference of the attention mechanism will cause the image to jump back to the original position.

Helios converted the position encoding into a relative reference and no longer cares about "this is the Xth frame", but about "this is the continuation of the last frames". This fundamentally eliminates the periodic repetition of movement.

02 The "Magic" of Computing Power

The problem of image quality deterioration has been solved at the software level, but the more difficult challenge occurs at the hardware level:

14 billion parameters are not a small number. How can you make it run in real - time at 19.5 FPS on a single graphics card?

The essence of AI - based video generation is not different from that of language models. The commonly used Diffusion Transformer (DiT) architecture also uses the self - attention mechanism to capture the spatial details (content of a single frame) and the temporal coherence (movement between frames) of the video.

But since the dimensions of frames in the vector space are higher than those of texts, the content of each frame in the video requires much more computing power than a question - answer session of a language model. If the video is only extended by a few seconds, the computing power and the occupied graphics memory increase exponentially. So you have to use a GPU cluster to distribute the load.

The concern about image quality and video length due to computing power is clearly shown by the closure of Sora and the "intelligence reduction" of Seedance 2.0 after its release: This is not feasible from a commercial perspective.

Helios decided to take a different path. This lower - level reconstruction scheme called "Deep Compression Flow" utilizes almost all the potential of the GPU, from token reduction, step distillation to graphics memory management and performs the "miracle sign" like magic.

1. Token Perspective: Maximum Compression of Space - Time Dimensions

The first problem to be solved is the problem of graphics memory overflow due to the long video context. The solution that Helios offers is the asymmetric compression of space - time dimensions.

We just said that AI - based video generation "draws the future based on the past". So the question of how long the "historical data" needs to be prepared is a key problem.

For humans, memory is similar to a "stack" in the data structure: What happened a second ago is still very well remembered, while what happened ten minutes ago is a bit less clear.

Helios completely adopted this multi - level memory division mechanism from bionics and divided the historical frames that the AI has to look back at into three types: short - term, medium - term, and long - term.

For the frames that passed just a few frames ago, Helios keeps the highest resolution. For the frames that passed a long time ago, Helios strongly compresses them and only keeps the coarsest global arrangement.

This simple idea enables Helios to keep the token usage at a very low constant level even when looking back at very old historical frames. The graphics memory requirement for historical information is directly compressed to one - eighth of the original value, which completely eliminates the unsolvable problem of "graphics memory overflow" when running on a single card.

When generating frames, Helios doesn't start directly with the highest resolution, but uses a bottom - up development strategy.

This is similar to the process of a painter who first quickly sketches the overall color and arrangement contour in a low resolution and then gradually enlarges and fine - tunes the details such as edges and textures.

The early denoising determines the macroscopic structure, and the late denoising is used to optimize the details. Through this task - division mechanism, the computing power can be further reduced to less than half.

2. Step Perspective: Adversarial Layer Distillation

AI - based video generation is slow because the traditional diffusion model requires about 50 steps of the repeated denoising process.

Previous video generation models had to be trained through "simulated expansion and infrared" in the way of learning how to finish in one step to avoid forgetting historical frames.

After the model generates a video, it not only has to be evaluated by a reward model, but also has to continue several simulated future long videos.

Undoubtedly, this leads to a very long duration and graphics memory overflow.

But Helios uses the "Pure Teacher Forcing" model, which allows the model not to simulate future videos, but directly gives a large number of real continuous video clips as the only reference criterion to the model.

During each training, the model only focuses on "drawing the next small piece perfectly" under the given real historical frames. Removing the complex simulation process leads to an exponential increase in training performance.

There is also a similar distillation mechanism in the denoising process as in language models.

But knowledge distillation always has a fatal flaw: The upper limit of the student is not higher than that of the teacher, but the lower limit can be lower. Once the flaws are magnified, the quality of the generated video naturally drops.

So Helios introduced an adversarial night training based on real videos. If the result of the student after the denoising process is only an imitation of the teacher and doesn't contain real physical details, it will be rejected and redone.

This strict training method has wonderfully managed to compress the image fidelity, which originally required 50 steps, directly to less than 3 steps.

3. Graphics Memory Perspective: Reconstruction of the Scheduling Mechanism

The graphics memory of the GPU is fixed, but there are several sub - models in the model that have to be calculated serially.

So the research team developed an advanced scheduling mechanism that uses a special data channel to store only the currently calculated sub - model in the GPU. As soon as the calculation is completed and it is idle, the parameters are immediately transferred to the CPU and wait for instructions.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

ByteDance versucht, das "unmögliche Dreieck" von Seedance 2.0 zu brechen.

01 The Nightmarish "Long - Term Drift"

02 The "Magic" of Computing Power