ByteDance attempts to break the "impossible trinity" of Seedance 2.0
After Seedance 2.0 dethroned Sora, the AI video generation track entered a stage where frenzy coexisted with anxiety.
Even a powerful tool like Seedance 2.0 still can't break the "impossible trinity" in this field:
It's always difficult to achieve model scale, generation time, and inference speed simultaneously.
If you want movie - grade image quality like Seedance 2.0, you must have a multi - modal model with tens of billions of parameters designed by a tech giant like ByteDance. The cost is a maximum video length of 15 seconds, high single - generation fees, and a waiting time of more than ten minutes.
If you want to produce videos quickly, you have to compromise on the number of parameters and use a small model with about 1B parameters. The price is a blurry picture, loss of details, and the video starts to break down after more than 10 seconds.
If high - quality, real - time long - form videos can't be achieved, AI video generation will never reach the level of movies.
However, ByteDance, the creator of the masterpiece Seedance 2.0, has far greater ambitions.
The Helios large model, jointly launched by institutions such as Peking University and ByteDance, is trying to cut through this "impossible trinity" like a sharp blade.
Helios is the first 14B - parameter large model that can run at a speed of 19.53 frames per second (FPS) on a single NVIDIA H100 graphics card.
Although this number of parameters is not lightweight, compared with the flagship large - language models of major AI companies, it can only be regarded as a "mini - version" model.
Although it may seem "thin" in terms of parameters, its image quality can rival that of the current top - tier models. Moreover, it can generate continuous videos lasting several minutes at a speed close to "real - time".
01 The Nightmare of "Long - Range Drift"
Users who have used Dreaming AI, KeLing, or Sora should generally have a question: Why is the maximum length of generated videos only 10 or 15 seconds? Even if users are wealthy, they can't break this limit.
In fact, this is not only a problem of computing power. Even if the maximum generation time is forcibly increased, the generated video may not meet expectations:
The videos generated by AI often have extremely stunning images in the first few seconds, but the image quality drops rapidly as time passes. For example, the protagonist can't maintain facial features, the limb structure starts to mutate, the background gradually distorts, and the actions don't conform to physical logic.
This is the "drift" phenomenon.
The process of AI video generation is actually similar to the question - answering process of large - language models. Large - language models need to make the next response based on memory and context. Multi - modal models also need to "draw the future based on the past".
When the FPS is fixed, as the video gets longer, the number of frames also increases. This means that the amount of information the AI needs to remember from each frame increases exponentially.
In this process, even a tiny flaw in the previously generated images will be continuously accumulated and magnified in subsequent generations, ultimately leading to a complete breakdown.
To solve this problem, the early academic community came up with the most intuitive method of having the AI generate long segments at once during training to avoid the expansion of flaws. However, this reinforcement - learning method not only easily causes problems of under - fitting and over - fitting but also consumes an unaffordable amount of computing power. Large models with tens of billions of parameters simply can't use it, and 1 billion parameters are already the limit.
Therefore, the research team of Helios realized that they had to look for problems in the video - generation process.
They first noticed that the breakdown of long - form videos is often accompanied by the overall loss of control of picture brightness and color, but this problem rarely occurs in the first few seconds of the video.
Thus, the "First Frame Anchor" mechanism was born.
The research team anchors the first frame of the video as the "stabilizing needle" for the entire generation process. The AI must "keep an eye" on the first frame throughout the subsequent long - term generation process to lock in the overall appearance distribution.
No matter how the subsequent images are required to develop in the prompt, the overall color tone and character identities established in the first frame can always pull the AI back on the right track, preventing a "sudden change in painting style".
However, the appearance of flaws is still inevitable. Therefore, the AI must learn how to handle this "imperfection".
Helios adopts a special method during the training phase: Frame Aware Corrupt.
To put it simply, it randomly adds various flaws to the historical images that the AI depends on, allowing the AI to reduce its absolute dependence on historical images through reinforcement learning and learn to fix various problems based on common sense.
After this type of training, Helios has a very high tolerance for errors, and long - form videos are less likely to break down.
The last problem to solve is position offset and repeated movement.
The position encoding of AI in the video - generation process is absolute. When the length of the generated video exceeds the maximum length the AI saw during training, the attention mechanism malfunctions, causing the picture to flash back to the initial position.
Helios changes the position encoding to relative reference, no longer focusing on "this is the Xth frame" but rather on "this is a continuation of the past few frames", fundamentally eliminating the periodic repetition of movements.
02 The "Magic" of Computing Power
The problem of image - quality breakdown has been solved at the software level, but a more difficult challenge appears at the hardware level:
14 billion parameters are not a small number. How can it achieve real - time operation at 19.5 FPS with only one graphics card?
The essence of AI video generation is no different from that of large - language models. The commonly used Diffusion Transformer (DiT) architecture also uses the self - attention mechanism to capture the spatial details (single - frame content) and temporal coherence (inter - frame movement) of videos.
However, since the dimension of images in the vector space is higher than that of text, the amount of computation required for each frame of video content is far more than that of a single question - answer session of a large - language model. Extending the video by just a few seconds will cause an exponential increase in the computational load and the occupied video memory, and a GPU cluster must be used to share the pressure.
Exchanging computing power for image quality and video length, as evidenced by the shutdown of Sora and the "dumbing - down" of Seedance 2.0 after its release, has clearly shown that it doesn't work from a commercial perspective.
Helios has chosen another path. This underlying reconstruction scheme called "Deep Compression Flow" has almost squeezed out all the potential of the GPU, from token reduction, step distillation to video - memory management, performing a "miracle - witnessing moment" like magic.
1. Token Perspective: Extreme Compression of Spatio - Temporal Dimensions
The first problem to solve is the explosion of video memory caused by overly long video context. Helios' solution is to perform asymmetric compression on the spatio - temporal dimensions.
As mentioned earlier, AI video generation is about "drawing the future based on the past". Therefore, how long of "historical data" to prepare is a key question.
For humans, memory is similar to a "stack" in data structures, with the last in being the first out: we have a clear memory of what happened a second ago but a rather vague memory of what happened ten minutes ago.
Helios completely borrows this multi - period memory - blocking mechanism from bionics and divides the historical images that the AI needs to review into three types: short - term, medium - term, and long - term.
For the images that just passed a few frames ago, Helios retains the highest - definition details; for the images that are more distant, Helios performs high - intensity compression on them, only retaining the roughest global layout.
This simple idea allows Helios to keep the number of tokens consumed at a very low and constant level when reviewing very distant historical images. The video - memory occupation of historical information is directly compressed to one - eighth of the original, completely eliminating the insoluble problem of "video - memory explosion" when running on a single card.
When generating images, Helios doesn't start generating directly at the highest resolution but adopts a bottom - up development strategy.
This is similar to the process of a painter. First, quickly outline the overall color and layout at a low resolution, and then gradually zoom in to refine details such as edges and textures.
The early denoising determines the macro - structure, and the later denoising is used to optimize details. Using this task - decomposition mechanism, the computational load can be further reduced to less than half.
2. Step Perspective: Adversarial Hierarchical Distillation
The reason why AI video generation is slow is that the traditional diffusion model needs about 50 steps of repeated denoising processes.
In the past, when video - generation models were learning to generate in one step, to prevent "forgetting" historical images, they had to be trained through "simulated unfolding inference".
After the model generates a video, it not only has to be judged good or bad by a reward model but also has to continue writing several simulated long - form videos of the future.
Undoubtedly, the result of this is extremely long time consumption and video - memory explosion.
However, Helios adopts the "Pure Teacher Forcing" mode, allowing the model to directly use a large number of real continuous video slices as the only reference standard without simulating future videos.
Each time the model is trained, it only focuses on "perfectly drawing the next small segment" based on the given real historical images, eliminating the complex simulation process and exponentially increasing the training efficiency.
During the denoising process, there is also a distillation mechanism similar to that in large - language models.
However, knowledge distillation always has a fatal flaw: the upper limit of the "student" will not be higher than that of the "teacher", but the lower limit may be lower. Once the flaws are magnified, the quality of the generated video will naturally decline.
For this reason, Helios introduces adversarial post - training based on real videos. If the result produced by the "student" after the denoising process is just an imitation of the "teacher" and lacks real physical details, it will be sent back for re - work.
This strict training method miraculously compresses the image fidelity that originally required 50 steps to achieve within just 3 steps.
3. Video - Memory Perspective: Reconstructing the Scheduling Mechanism
The GPU video memory is fixed, but there are multiple sub - models in the model that need to be calculated serially.
For this reason, the research team designed an advanced scheduling mechanism. Using a dedicated data channel, only the sub - model being calculated is stored in the GPU. Once the calculation is completed and it is in an idle state, the parameters are instantly transferred to the CPU for standby.
In modern AI training frameworks such as PyTorch, intermediate variables are saved in the video memory during forward calculation for use in back - propagation.
After noticing this link, the research team directly broke the underlying calculation logic of the framework. As soon as the gradient calculation is completed, the program is manually triggered immediately, and the activation state is released within milliseconds, saving more than twice the idle video memory.
In addition, there are many hidden data - transmission losses in the official deep - learning framework.
To further accelerate video generation, the research team bypassed PyTorch and used the low - level compiler language Triton to write the core code. Even in the traditional attention - mechanism calculation process, a multiplier dimension was directly removed from the memory - occupation complexity.
It is this series of extreme squeezing from the algorithm bottom layer to video - memory scheduling that has enabled the 14B - parameter large model to achieve a miracle on the H100.
03 Helios: Reconstructing the Business Landscape of AI Video
A breakthrough in underlying technology often triggers an earthquake in the industrial chain, and Helios was precisely born at ByteDance, the inventor of Seedance 2.0.
This model, which is neither extremely large nor extremely small, has the unprecedented combination of features of "high quality + real - time + single - card + long - term" and precisely breaks through the barriers to the commercialization of AI video.
The shutdown of Sora and the fact that Seedance 2.0 was found to be "dumbed - down" soon after its release indicate that the biggest obstacle to the large - scale implementation of AI video on the ToC side is the high price.
In the past year, video - generation models with decent effects on the market have all consumed extremely high computing - power costs for generating a video of about 10 seconds.
Under the subscription system, the existing call volume will only make AI companies lose money; even if the API is opened to B - side enterprises, not only is there a gap at the technical level, but the cost required to produce commercial products using the model will also deter developers.
However, Helios has directly lowered the operating threshold of the 14B model to a single H100, and it has a very high throughput.
Although consumer - grade graphics cards are still powerless, this still means that the single - channel concurrency cost of cloud providers and SaaS platforms will be significantly reduced, and the API business model may undergo a qualitative change.
The existing point - based system that charges by the number of generations may be transformed into a token - based charging system similar to that of large - language models in the future.
Only when the generation cost is low enough can multi - modal models change from "luxury goods" to infrastructure like large - language models.
Another disruptive business imagination brought by Helios is that AI video generation is about to remove the label of "offline rendering" and move towards a real - time interactive engine.