HomeArticle

Luo Fuli from ByteDance has propped up half of Seedance's business.

字母AI2026-03-24 15:15
As the person in charge of pre-training, shaped the model's worldview.

With the launch of Xiaomi's new model, "Genius Girl" Luo Fuli has once again become the focus. In fact, although the number of women in the circle of AI scientists is relatively small, Luo Fuli is by no means the only one. At ByteDance, there is a figure like Luo Fuli.

She is Zeng Yan, the person in charge of the pre - training of the Seedance 2.0 video generation model.

Generally, when people talk about Seedance 2.0, the commonly thought - of people are Wu Yonghui, the person in charge; Zhou Chang, the R & D leader; and Jiang Lu, the core person in charge of video generation technology.

Few people know that Zeng Yan's presence is also indispensable.

Because pre - training is the "cornerstone" of the entire model, which determines the upper limit of the model's capabilities.

Most people regard pre - training as "feeding data", but real experts know that pre - training is about "shaping the model's world view".

How to allocate data, design the architecture, and adjust the training strategy, every decision determines what the model can see, understand, and generate.

No matter how hard you try to optimize later, if the pre - training is not done well, this model will never reach the current height of Seedance 2.0.

Not only has she made great contributions, but Zeng Yan's promotion speed at ByteDance is also quite fast.

Since she graduated and joined ByteDance, it only took Zeng Yan five years to reach the 4 - 2 rank.

The 4 - 2 rank corresponds to the level of senior director/authoritative architect, belonging to the company's core strategic - level technical backbones. The annual package (including basic salary, year - end bonus, and stocks) is generally over 5 million.

What exactly did she do to achieve such success? Let's start with her academic journey.

01 From Xi'an Jiaotong University to ByteDance

To be honest, when I first saw Zeng Yan's resume, I wasn't particularly impressed.

Born in 1997, she graduated from Xi'an Jiaotong University with a bachelor's degree and obtained a master's degree in computer science from the University of Montreal in Canada. This path is very common in today's AI circle.

But what happened next was not so "standard".

In September 2021, Zeng Yan joined ByteDance AI Lab as a campus recruit, starting at the rank of algorithm engineer.

Just two months after joining, Zeng Yan published a paper titled "Multi - Grained Vision Language Pre - Training: Aligning Texts with Visual Concepts" on arXiv as the first author, which is the later well - known X - VLM model.

In simple terms, the problem solved by this paper is: how to make AI understand both the "big picture" and notice the "small details".

Traditional vision - language models have two extremes. One is the "broad - stroke" school, which only looks at the corresponding relationship between the overall image and the text. For example, if you show an AI a photo, it can only say "this is a beach" and nothing more.

The other is the "microscope" school, which relies on expensive object detectors to pick out each object. Although it can see the details, the computational cost is extremely high, and it also depends on a large amount of manually labeled data.

The X - VLM proposed by Zeng Yan takes the advantages of both.

It can simultaneously learn multi - level visual concepts from the whole to the part, from the scene to the object, and from the coarse to the fine, and accurately align them with different granularity information in the text.

Or, I can use a phrase I just learned recently to describe it: See both the forest and the trees.

This idea of "multi - granularity alignment" seemed just an academic innovation at that time, but it laid the groundwork for Zeng Yan to later become the person in charge of the pre - training of Seedance 2.0.

Because the pre - training of video generation is essentially also a problem of multi - granularity modeling.

If you want to generate a good - looking video, you need to grasp the overall narrative rhythm to make the video have a coherent story line; control the detail quality of each frame to ensure that the facial features of the characters remain unchanged and the movement of objects conforms to physical laws; and establish the correlation in the time - sequence dimension to make the transition between frames natural and smooth.

This is consistent with the underlying logic of X - VLM.

In the next two years, Zeng Yan seemed to be on a roll.

She published eight papers as the first author in top international conferences such as TPAMI, ICML, CVPR, ACL, and NAACL, and also served as a reviewer for top conferences such as TPAMI, ICML, NeurIPS, ICLR, ACL, and EMNLP.

In 2023, a key turning point came.

ByteDance established the large - model research department Seed, and Zeng Yan and her team transferred to it.

You need to view this time node in the context. At the end of 2022, ChatGPT emerged suddenly. At the beginning of 2023, major companies all went all - in on large models, and ByteDance also adjusted its technology strategy in this wave.

Zeng Yan's expertise in multi - modal pre - training can give full play to her strength in the new battlefield of video generation.

In the Seed department, Zeng Yan led two important projects as the first author, namely CCLM and Lynx.

Let's start with CCLM (Cross - View Language Modeling).

This project enables the AI model to learn the ability to understand both "cross - language" and "cross - modality" simultaneously. Through a unified pre - training framework, CCLM allows the model trained on English image - text data to be zero - shot transferred to multi - modal tasks in other languages such as Chinese and Japanese.

To put it simply, it makes AI learn to "draw inferences from one instance" - the understanding ability learned from English videos can be directly applied to videos in Chinese, Japanese, and Spanish.

Now let's talk about Lynx.

This is a systematic project to study how to train a GPT - 4 - style multi - modal large - language model. In 2023, when GPT - 4 was just released, everyone was exploring how to create a large model that can "describe pictures".

Through a series of comparative experiments, Zeng Yan's team found out the key factors such as model architecture design, training data allocation, and instruction fine - tuning strategy, and finally created the Lynx model, which performs excellently in both multi - modal understanding and instruction - following abilities.

In plain language, it is to study "how to create an AI that can understand pictures and have smooth conversations" and figure out which factors are really important.

What really made Zeng Yan "stand out" was PixelDance at the end of 2023.

The title of the paper for this project is very interesting, called "Make Pixels Dance: High - Dynamic Video Generation". It solves a long - standing contradiction in the field of video generation, which is how to balance dynamism and stability.

Think about it. If an AI - generated video has large - amplitude movements and drastic picture changes, it may seem vivid and interesting, but it is prone to "paranormal events" such as picture collapse, character deformation, and sudden disappearance of objects.

On the contrary, if you pursue stability and keep the characters and scenes consistent with no sudden changes in facial features, the generated video is likely to be stiff, more like a slide - show transition than a smooth dynamic image.

The breakthrough of Zeng Yan's team lies in that they established strict time - sequence constraints during the pre - training stage.

Traditional video generation models first generate the video and then repair it frame by frame. PixelDance enables the model to learn to generate dynamic content while maintaining consistency.

The core innovation is to introduce the dual - image instructions of the first frame and the last frame in the diffusion model framework, cooperate with text instructions to jointly constrain video generation, and at the same time add a time - sequence convolution and time - sequence attention layer to the network structure, anchoring the start and end states of the video from the source of generation, thus ensuring the consistency of the subject and the scene under large - dynamic movements.

It's like training a dancer to do large - amplitude movements while maintaining balance from the very beginning.

The success of PixelDance quickly elevated Zeng Yan's status within ByteDance.

In 2024, she was promoted from an algorithm engineer to an algorithm researcher, becoming one of the youngest researchers in the Seed team. This promotion is not only an recognition of her academic ability, but more importantly, she has proved that she can transform research results into actual products.

In large companies, the difference between these two abilities is like the difference between being able to cook and being able to run a restaurant.

02 From PixelDance to Seedance 2.0

Interestingly, PixelDance is the predecessor of Seedance.

Seed represents ByteDance's large - model department, and dance retains the core concept of "making pixels dance". This name change is not just a brand strategy, but also marks the transformation of the model from a research prototype to a commercial product.

On June 11, 2025, ByteDance officially released Seedance 1.0, and Zeng Yan was the core R & D leader of this model.

Although Zeng Yan was not officially recognized by ByteDance as the person in charge of the pre - training of the Seedance 2.0 video model until February 2026, insiders revealed that as early as the second half of 2025, Zeng Yan had already officially led the entire pre - training process of Seedance 2.0 and became the core person in charge of this project.

Her +2 leader is Zhou Chang, and her +3 leader is Wu Yonghui, the person in charge of the Seed team.

One of the core technological breakthroughs of Seedance 2.0 is the dual - branch diffusion transformer architecture, which is the basic architecture established by Zeng Yan's team during the pre - training stage.

Traditional video generation models adopt the "draw first and then match" mode. That is, they first generate the video picture and then separately generate or match the audio.

The problem with this method is that the separation of audio and video leads to poor synchronization. When a character speaks, the lip - sync is off, the rhythm of the background music is out of sync with the mood of the picture, and the timing of the sound effects does not match the actions in the picture.

Seedance 2.0 realizes the native synergy of audio and video from the root by generating video and audio in parallel, sharing the same understanding encoder.

The key to this architecture design is to make the model consider what the corresponding audio should be while generating each frame of the picture, rather than "matching" the audio after all the pictures are generated.

As I mentioned at the beginning of the article, pre - training is the cornerstone of the entire model's capabilities.

At this stage, Zeng Yan needs to process a huge amount of video data and establish the alignment relationship between multiple modalities such as vision, text, and audio.

By introducing the "cross - branch calibration module", she calibrates the rhythm, mood, and scene matching degree of the video and audio in real - time to ensure that the lip - sync matches the lines, the sound effects fit the picture, and the background music is consistent with the emotional atmosphere.

During the pre - training stage, all the multi - modal alignment relationships, physical laws, and movement patterns are stuffed into the model as "default items". As long as the model calls the relevant content later, it will immediately give the result from the pre - training.

It's not simply making the model remember the training data, but making the model extract general laws from a huge amount of data and form a basic understanding of the world.

Seedance 2.0 only takes 60 seconds to generate a 2K video with a duration of 1 minute, which is 30% faster than the previous generation, Seedance 1.5 Pro.

Behind the speed improvement, is the fine - tuning of the model architecture, training strategy, and data allocation by Zeng Yan's team during the pre - training stage.

Her team has an extremely fast iteration speed and has completed multiple rounds of optimization of the diffusion model during the pre - training stage.

Optimizing the attention mechanism to reduce redundant calculations, improving the noise scheduling strategy to speed up the convergence, and carefully selecting high - quality training data to improve sample efficiency.

Each optimization point seems insignificant when viewed alone, but the accumulation leads to a qualitative leap. The larger the model scale and the higher the training cost, each percentage point of efficiency improvement means millions of yuan in cost savings and weeks of time reduction.

Seedance 2.0 also realizes the ability of multi - shot narrative. This means that the model can not only generate long videos but also understand the professional story - boarding logic of "panoramic - medium - close - up" shots, automatically plan shot transitions, and generate a complete narrative sequence with a montage effect.

Through high - quality samples, Zeng Yan's team has enabled the model to learn the shot language and narrative rhythm of human directors. This "director's intuition" extracted from the data.

03 Zeng Yan and Luo Fuli

As female AI scientists, both Zeng Yan and Luo Fuli are good at finding the "balance point" in model R & D.

During the DeepSeek period, Luo Fuli participated in DeepSeek - V2. Through the sparse activation of the MoE architecture, it reduced the inference cost to one - seventieth of GPT - 4 Turbo, but its performance is quite similar to that of the top - notch closed - source models.

It's like designing a large library. Although there are millions of books, each query only needs to open a few of them instead of moving out all the books. This "on - demand activation" mechanism significantly reduces the cost of the large model without much loss of performance.

Luo Fuli found such a balance point between performance and cost.

At Xiaomi, known as the "king of cost - performance", Luo Fuli carried forward the spirit of DeepSeek to the end. She led her team to jointly develop the resource management system ARL - Tangram with Peking University, reducing the computing power cost of the model by 71.2%.

However, the reduction in cost does not mean a decline in performance. The flagship model MiMo - V2 - Pro with trillions of parameters using this technology ranks eighth globally and second in China on the Artificial Analysis comprehensive intelligence ranking list of large models.

Luo Fuli has proved one thing: cost - performance is not an accident of a certain project, but a methodology that can be replicated across platforms.

Zeng Yan's balance point is the dynamism and stability mentioned above, enabling the video generation model to tell a good story while having picture tension and visual impact.

The difference between the two lies in their career plans.

Luo Fuli jumped from Alibaba to Magic Square and then to DeepSeek. This path is "from large companies to startups, from engineering applications to model research".

Zeng Yan has been deeply involved within ByteDance. In five years, she has risen from a campus recruit to the 4 - 2 position.