HomeArticle

Luo Fuli of ByteDance has propped up half of Seedance's sky.

字母AI2026-03-24 15:15
As the person in charge of pre-training, shaped the model's worldview.

With the launch of Xiaomi's new model, "Genius Girl" Luo Fuli has once again become the focus. In fact, although the number of women in the circle of AI scientists is relatively small, Luo Fuli is by no means the only one. At ByteDance, there is a person like Luo Fuli.

She is Zeng Yan, the person in charge of the pre - training of the Seedance 2.0 video generation model.

Generally, when people talk about Seedance 2.0, the people that come to mind are usually Wu Yonghui, the helmsman, Zhou Chang, the R & D leader, and Jiang Lu, the core leader of video generation technology.

Few people know that Zeng Yan's presence is equally indispensable.

Because pre - training is the "cornerstone" of the entire model, it determines the upper limit of the model's capabilities.

Most people regard pre - training as "feeding data", but real experts know that pre - training is "shaping the model's worldview".

How to allocate data, how to design the architecture, and how to adjust the training strategy. Every decision determines what the model can see, understand, and generate.

No matter how hard you try to optimize later, if the pre - training is not done well, the model will never reach the current height of Seedance 2.0.

Not only has she made great contributions, but Zeng Yan's promotion speed at ByteDance is also quite fast.

Since she graduated and joined ByteDance, it has only taken Zeng Yan 5 years to reach the 4 - 2 rank.

The 4 - 2 rank corresponds to the level of senior director/authoritative architect, belonging to the company's core strategic - level technical backbones. The annual package (including basic salary, year - end bonus, and stocks) is generally over 5 million.

What on earth did she do to achieve such success? Let's start with her academic journey.

01 From Xi'an Jiaotong University to ByteDance

To be honest, when I first saw Zeng Yan's resume, I wasn't particularly impressed.

She was born in 1997, graduated from Xi'an Jiaotong University with a bachelor's degree, and obtained a master's degree in computer science from the University of Montreal in Canada. This path is very common in today's AI circle.

But what happened next was not so "standard".

In September 2021, Zeng Yan joined ByteDance AI Lab as a campus recruit, starting at the rank of algorithm engineer.

Just two months after joining, Zeng Yan published a paper titled "Multi - Grained Vision Language Pre - Training: Aligning Texts with Visual Concepts" on arXiv as the first author, which is the well - known X - VLM model later.

In plain terms, the problem solved by this paper is: how to make AI understand both the "big picture" and notice the "small details".

Traditional vision - language models have two extremes. One is the "broad - brushstroke" school, which only looks at the corresponding relationship between the overall image and the text. For example, if you show an AI a photo, it can only say "this is a beach" and nothing more.

The other is the "microscope" school, which relies on expensive object detectors to pick out each object. Although it can see details, the computational cost is extremely high, and it also depends on a large amount of manually annotated data.

The X - VLM proposed by Zeng Yan takes the best of both worlds.

It can simultaneously learn multi - level visual concepts from the whole to the part, from the scene to the object, and from the coarse to the fine, and accurately align them with different granularity information in the text.

Or, I can use a phrase I just learned recently to describe it: See both the forest and the trees.

This idea of "multi - granularity alignment" seemed just an academic innovation at that time, but it laid the groundwork for Zeng Yan to later become the person in charge of the pre - training of Seedance 2.0.

Because the pre - training of video generation is essentially also a problem of multi - granularity modeling.

If you want to generate a good - looking video, you need to grasp the overall narrative rhythm to make the video have a coherent story line; control the detail quality of each frame to ensure that the facial features of characters do not deform and the movement of objects conforms to physical laws; and establish the correlation in the time - sequence dimension to make the transition between frames natural and smooth.

This is consistent with the underlying logic of X - VLM.

In the next two years, Zeng Yan was like a person on a roll.

She published eight papers as the first author in top international conferences such as TPAMI, ICML, CVPR, ACL, and NAACL, and also served as a reviewer for top conferences such as TPAMI, ICML, NeurIPS, ICLR, ACL, and EMNLP.

In 2023, a crucial turning point came.

ByteDance established the large - model research department Seed, and Zeng Yan and her team transferred to it.

You need to view this time node in the big context. At the end of 2022, ChatGPT emerged suddenly, and at the beginning of 2023, major companies all went all - in on large models. ByteDance also adjusted its technology strategy in this wave.

Zeng Yan's expertise in multi - modal pre - training can give full play to her strength in the new battlefield of video generation.

In the Seed department, Zeng Yan led two important projects as the first author, namely CCLM and Lynx.

Let's start with CCLM (Cross - View Language Modeling).

This project enables the AI model to simultaneously learn the "cross - language" and "cross - modal" understanding abilities. Through a unified pre - training framework, CCLM allows the model trained on English image - text data to be transferred to multi - modal tasks in other languages such as Chinese and Japanese in a zero - shot manner.

To put it simply, it allows AI to "draw inferences from one instance" - the understanding ability learned from English videos can be directly applied to videos in Chinese, Japanese, and Spanish.

Now let's talk about Lynx.

This is a systematic project to study how to train a GPT - 4 - style multi - modal large - language model. In 2023, when GPT - 4 was just released, everyone was exploring how to create a large model that can "describe pictures".

Zeng Yan's team found out the key factors such as model architecture design, training data allocation, and instruction fine - tuning strategy through a series of comparative experiments, and finally created the Lynx model, which performed excellently in multi - modal understanding and instruction - following abilities.

In plain language, it is to study "how to create an AI that can understand pictures and have smooth conversations" and figure out which factors are really important.

What really made Zeng Yan "stand out" was PixelDance at the end of 2023.

The title of the paper for this project is very interesting, called "Make Pixels Dance: High - Dynamic Video Generation". It solves a long - standing contradiction in the field of video generation, which is how to balance dynamism and stability.

Think about it. If an AI - generated video has large - amplitude movements and drastic picture changes, it does look vivid and interesting, but it is prone to "paranormal events" such as picture collapse, character deformation, and sudden disappearance of objects.

On the contrary, if you pursue stability and keep the characters and scenes consistent with no sudden changes in facial features, the generated video will be prone to being stiff, like a slide - show transition rather than a smooth dynamic image.

The breakthrough of Zeng Yan's team lies in that they established strict time - sequence constraints during the pre - training stage.

Traditional video generation models first generate the video and then repair it frame by frame. PixelDance enables the model to learn to generate dynamic content while maintaining consistency.

The core innovation point is to introduce double - image instructions of the first frame and the last frame in the diffusion model framework, cooperate with text instructions to jointly constrain video generation, and at the same time add a time - sequence convolution and time - sequence attention layer to the network structure, anchoring the start and end states of the video from the source of generation, thus ensuring the consistency of the main body and the scene under large - dynamic movements.

It's like training a dancer to do large - amplitude movements while maintaining balance from the very beginning.

The success of PixelDance quickly elevated Zeng Yan's status within ByteDance.

In 2024, she was promoted from an algorithm engineer to an algorithm researcher, becoming one of the youngest researchers in the Seed team. This promotion is not only an recognition of her academic ability, but more importantly, she has proved that she can transform research results into actual products.

In large companies, the difference between these two abilities is like the difference between being able to cook and being able to run a restaurant.

02 From PixelDance to Seedance 2.0

Interestingly, PixelDance is the predecessor of Seedance.

Seed represents ByteDance's large - model department, and "dance" retains the core concept of "making pixels dance". This name change is not only a brand strategy but also marks the transformation of the model from a research prototype to a commercial product.

On June 11, 2025, ByteDance officially released Seedance 1.0, and Zeng Yan was the core R & D leader of this model.

Although it was not until February 2026 that Zeng Yan was officially recognized by ByteDance as the person in charge of the pre - training of the Seedance 2.0 video model, insiders revealed that as early as the second half of 2025, Zeng Yan had already officially led the entire pre - training process of Seedance 2.0 and became the core person in charge of the project.

Her +2 leader is Zhou Chang, and her +3 leader is Wu Yonghui, the leader of the Seed team.

One of the core technological breakthroughs of Seedance 2.0 is the dual - branch diffusion transformer architecture, which was established by Zeng Yan's team during the pre - training stage.

Traditional video generation models adopt the "draw first and then match" mode. That is, they first generate the video picture and then separately generate or match the audio.

The problem with this method is that the separation of audio and video leads to poor synchronization. The lip - sync is off when a character is speaking, the rhythm of the background music is out of sync with the mood of the picture, and the timing of the sound effects does not match the actions in the picture.

Seedance 2.0 achieves native audio - video synergy at the root by generating video and audio in parallel, sharing the same understanding encoder.

The key to this architecture design is that the model considers what the corresponding audio should be while generating each frame of the picture, rather than "matching" the audio after the entire picture is generated.

As I mentioned at the beginning of the article, pre - training is the cornerstone of the entire model's capabilities.

At this stage, Zeng Yan needs to process a large amount of video data and establish the alignment relationship between multiple modalities such as vision, text, and audio.

She introduced a "cross - branch calibration module" to calibrate the rhythm, mood, and scene matching degree of the video and audio in real - time, ensuring that the lip - sync is in sync with the lines, the sound effects match the picture, and the background music is consistent with the mood and atmosphere.

During the pre - training stage, all the multi - modal alignment relationships, physical laws, and movement patterns are stuffed into the model as "default items". As long as the model calls the relevant content later, it will immediately give the result from the pre - training.

It's not simply making the model remember the training data, but making the model extract general laws from a large amount of data and form a basic understanding of the world.

Seedance 2.0 only takes 60 seconds to generate a 2K video with a duration of 1 minute, which is 30% faster than the previous generation, Seedance 1.5 Pro.

Behind the speed improvement, is the fine - tuning of the model architecture, training strategy, and data allocation by Zeng Yan's team during the pre - training stage.

Her team has an extremely fast iteration speed and completed multiple rounds of optimization of the diffusion model during the pre - training stage.

They optimized the attention mechanism to reduce redundant calculations, improved the noise scheduling strategy to speed up the convergence, and selected high - quality training data to improve the sample efficiency.

Each optimization point seems insignificant when looked at alone, but when accumulated, it is a qualitative leap. The larger the model scale, the higher the training cost. Every percentage point of efficiency improvement means millions of yuan in cost savings and weeks of time reduction.

Seedance 2.0 also realizes the multi - shot narrative ability. This means that the model can not only generate long - form videos but also understand the professional shot - splitting logic of "panorama - medium shot - close - up", automatically plan shot transitions, and generate a complete narrative sequence with a montage effect.

This ability largely depends on the large amount of short - video data from ByteDance that Zeng Yan fed during the pre - training stage.

Tens of millions of short videos are generated on Douyin every day. Although most of these videos are shot by ordinary users, there are also many excellent shot languages and narrative techniques.

Zeng Yan's team screened out high - quality samples from this data, allowing the model to learn the shot language and narrative rhythm of human directors. This "director's intuition" extracted from the data.

03 Zeng Yan and Luo Fuli

As female AI scientists, both Zeng Yan and Luo Fuli are good at finding the "balance point" in model R & D.

During the DeepSeek period, Luo Fuli participated in DeepSeek - V2. Through the sparse activation of the MoE architecture, she reduced the inference cost to one - seventieth of that of GPT - 4 Turbo, but the performance was very close to that of the top closed - source models.

It's like designing a large library. Although there are millions of books, each query only needs to open a few of them instead of moving out all the books. This "on - demand activation" mechanism significantly reduces the cost of the large model without much loss of performance.

Luo Fuli found such a balance point between performance and cost.

At Xiaomi, known as the "king of cost - performance", Luo Fuli carried forward the spirit of DeepSeek. She led her team to jointly develop the resource management system ARL - Tangram with Peking University, reducing the computing power cost of the model by 71.2%.

However, a decrease in cost does not mean a decrease in performance. The trillion - parameter flagship model MiMo - V2 - Pro using this technology ranks eighth globally and second in China on the Artificial Analysis comprehensive intelligence ranking list of large models.

Luo Fuli has proved one thing: cost - performance is not an accident of a certain project but a methodology that can be replicated across platforms.

Zeng Yan's balance point is the dynamism and stability mentioned above, enabling the video generation model to tell a good story and have visual tension and impact.

The difference between the two lies in their career plans