HomeArticle

Three top AI experts made a rare appearance on the same stage to discuss the biggest "Rashomon" in the AI industry.

阿菜cabbage2025-05-28 19:57
Last year, the industry had a strong consensus in its beliefs, while this year, everyone is looking for non - consensus.

Text by | Zhou Xinyu

Edited by | Su Jianxun

Is pre - training still the king? In today's AI industry, this is the biggest "Rashomon" event.

In 2023, model pre - training was still the first - principle. However, from Ilya, the former chief scientist of OpenAI, publicly declaring that "pre - training has reached its end" to the emergence of DeepSeek R1 focusing on reinforcement learning, it indicates that pre - training is no longer in the limelight.

From being widely pursued to a decline in reputation, the change in the situation of pre - training is an epitome of the continuous flow between "consensus" and "non - consensus" in the AI industry.

Regarding the consensus and non - consensus on AI technology, Ant Group organized an exchange event on the "Technology Open Day" on May 27, 2025.

The participants in the round - table dialogue are the hottest entrepreneurs, technology executives, and scholars at present:

Cao Yue, the founder of the video model company Sand.AI and the former co - founder of Beyond Light Years. After completing two rounds of financing in July 2024, the company's valuation has exceeded $100 million;

Lin Junyang, the technical leader of Alibaba Tongyi Qianwen (Qwen). From M6 released in 2021 to Qwen3 in 2025, he has been the real top person in charge of the model;

Kong Lingpeng, an assistant professor at the University of Hong Kong and the co - director of the NLP LAB. The diffusion inference model Dream 7B led by him has become the new SOTA of the open - source diffusion language model.

△Image source: Taken by the author

In a sense, both Cao Yue and Kong Lingpeng have reaped a lot in the process of exploring the "non - consensus" of AI. They both tried to apply the mainstream architectures of language model and visual model training to each other:

By applying the mainstream multimodal architecture, the Diffusion Model, to the language model, Dream 7B, in which Kong Lingpeng participated in the research and development, with only 7B parameters, outperformed DeepSeek V3 with 671B parameters in multiple tasks.

Conversely, Cao Yue applied the mainstream autoregressive (Auto Regressive) route of the language model to the training of the video model, achieving an infinite extension of the generated video length.

Their experiences represent the most attractive aspect of today's AI industry: embracing non - consensus and achieving innovation.

In contrast, Alibaba gives the outside world the impression of embracing consensus. For a long time, Qianwen has only released Dense models, which were once the mainstream. It wasn't until February 2025 that the team launched its first MoE model.

As the person in charge, Lin Junyang often hears critical voices from the outside world: "Is Qianwen too conservative?" At the round - table, he clarified: "We are not conservative. We just failed after a lot of experiments. It's really a sad thing."

This is also another aspect of the AI industry: consensus often represents the successful experiences of the majority.

In 2023, when Alibaba was training the Qianwen large model, Lin Junyang described that they had "drastically modified" the Transformer architecture internally many times. However, they finally found that Transformer was still the optimal solution.

Of course, one change that all three of them have felt is: Last year, people still believed in strong consensus, but this year, everyone has started to look for non - consensus.

Lin Junyang gave an analogy for the current industry. Everyone is exploring in different directions to see who can hit the jackpot. "Although everyone seems to be on opposite paths, in fact, there is no contradiction," Kong Lingpeng also has a similar view.

One example is that whether it's doing Diffusion on the basis of the language model like Kong Lingpeng or doing autoregression on the video model like Cao Yue, the goal is to balance Model Bias and Data Bias to achieve better results.

Moreover, regarding pre - training, a new non - consensus has emerged in the United States recently: pre - training is not over yet. Currently, Lin Junyang is also on the side of this new non - consensus. He revealed: "We still have a lot of data that hasn't been put into (Qianwen). Every time we put in new data, there is an improvement."

The following is the compilation of the round - table content by "Intelligent Emergence", and the content has been edited:

Qianwen is not conservative. It's just that a lot of experiments have failed

Zhou Jun (nickname: Xiting), the person in charge of Ant Bailing large model: What's the thinking behind using the diffusion model for language generation?

Kong Lingpeng: When you don't understand your data, don't make more assumptions about the data. Let the model take over more things. This is the reason why we use the diffusion model for language model training.

Some data has a left - to - right Bias (the difference between the output result and the real result). For example, 3 + 3 = 6. It's impossible to have 6 first and then fill in the previous process. For other data, for example, I have three meetings. Meeting A is after Meeting B, and Meeting B must be at noon. This means that the data can't be completely left - to - right.

Take text diffusion models like Gemini Diffusion as an example. It is a model with fewer assumptions than the left - to - right autoregressive model. It can learn bidirectionally and can also handle parallel tasks.

Xiting: Please share the technical challenges faced by the mainstream architectures in the multimodal field based on your practice.

Cao Yue: In a sense, language and video are quite similar. They both have a relatively strong causal prior in the time dimension, that is, the causal relationship in time.

Sora, released at the beginning of last year, actually doesn't have this prior. The 5 - second video it generates is directly modeled by a single model.

Our own feeling is that the causal prior of video timing is still very strong. People watch videos in a certain order. Why is there an order? Because the storage method of video information is sequential.

If we can design a training method that can encode the sequential relationship in the time dimension, we may be able to extract more information from the video, thereby raising the ceiling of the model.

Xiting: Share your changing perception of the Transformer architecture and how you view the current challenges faced by Transformer.

Lin Junyang: I have a deep feeling about Transformer because Transformer emerged soon after I started working in this field. So personally, I'm quite grateful to Transformer.

Along the way, we've tried to change many things, but finally found that Transformer is really useful. In 2022, everyone would make some detailed changes, such as making corresponding changes to the activation function of Transformer. The feeling is that Google is really strong, and PaLM (Google's model based on Transformer training) is really effective.

Especially in 2023, when we first started working on Qianwen, we struggled a lot at the beginning. Maybe some classmates have used our early models. There were so many variations. After a long time, we found that the basic model structure can't be changed casually. So I think there's a bit of mystery in it.

There's a criticism of Qianwen, saying that we are relatively conservative. Actually, we are not conservative. We've done a lot of experiments, and they all failed. It's a sad thing.

Another thing worth noting is the MOE model. We started working on MOE in 2021 with the M6 model. At that time, we found that MOE could scale well, but the model was not strong.

MOE is still worth exploring because, to put it bluntly, today, commercial companies want both effectiveness and efficiency. For the architectures we are exploring now, there is no good conclusion yet. We are still doing experiments and can see the advantages and disadvantages.

But I think it's a good direction because MOE does have the potential to achieve infinite context. However, for some common long - sequence tasks, such as some common programming tasks or extraction tasks, it is sometimes not as good as other solutions.

So, currently, while working on Transformer, we also pay attention to MOE.

Of course, we are also paying attention to the direction of Professor Kong, Diffusion LLM (diffusion language model). This is another line. Currently, it seems that the diffusion language model performs really well in mathematics, code, and reasoning tasks.

This is quite unexpected because when we did various autoregressive experiments back then, we failed in related tasks. But now the diffusion model performs well. However, there is still a relatively large room for improvement in its general ability.

I think everyone is exploring in different directions to see who can hit the jackpot.

Now, the cost of each bet is getting higher and higher

Xiting: What kind of model optimization methods is the industry currently focusing on? Which directions do you think have the greatest potential for efficiency optimization?

Lin Junyang: Everyone is very concerned about every move of DeepSeek. When we saw that DeepSeek could achieve such a large sparsity ratio (the ratio of the number of activated experts to the total number of experts) for MOE, we were quite surprised.

But actually, we have also achieved a similar sparsity ratio. At that time, we tested the efficiency and effectiveness of the model to see if the model could maintain efficiency while getting larger. As a result, a sparsity ratio of 1:20 generally had better experimental results, but 1:10 was a relatively more conservative option. So we mainly operate within this range.

But DeepSeek may have done better, with a sparsity ratio of over 1:20.

MOE is worth further exploration. The more experts there are and the sparser it is, the worse the training stability will definitely be. Correspondingly, we need to make corresponding optimizations for training stability.

But when it comes to the model structure, we need to consider it more safely today. The model architecture may be very friendly to pre - training but very unfriendly to reinforcement learning, which will bring a lot of difficulties. So now, the cost of each bet is getting higher and higher.

At the same time, the model structure itself also needs to consider the long - sequence problem in advance.

So I think for the joint optimization of effectiveness and efficiency, one is to see if the model can become larger and sparser, and the other is whether it can support longer sequences. At the same time, during training, the training of long sequences should not become slower.

Xiting: How can breakthroughs be achieved through architectural innovation in the multimodal field?

Cao Yue: In 2021, we also made some "drastic modifications" to Transformer and did a project called Spring Transformer. At that time, it was quite good in the computer vision field.

But looking further, when people "drastically modify" Transformer, they are often changing the prior. There is a very crucial problem in the process of changing the prior, that is: Will your prior affect the ceiling of the model's effectiveness?

One dimension of exploration is how to add appropriate prior sparsity to attention to reduce computational complexity. I think this is a thing with a high ROI (return on investment).

Another dimension is that the multimodal field often involves the fusion of multiple different token types. If appropriate sparsity is applied during attention in this process, the efficiency of cross - modal fusion can be significantly improved.

Another dimension is how to achieve end - to - end optimization from Tokenize to joint modeling.

Xiting: How to improve the interpretability of Transformer and reduce hallucinations?

Kong Lingpeng: I'd like to reply to Cao Yue first. I think although everyone seems to be on opposite paths, in fact, there is no contradiction.

What we are doing is actually finding a Bias that can best adapt to the data. Or I believe that my model can remove all Biases, but this also places higher requirements on my model.

Speaking of the interpretability and hallucinations of the model, whether Transformer should take the blame is debatable.

I also want to ask Junyang one thing. There is a saying that the reinforcement learning paradigm may not be good news for the "hallucinations" of the model in the later stage because it may learn some wrong reasoning patterns.

Did you notice such a phenomenon in Qwen 3 and Qwen 2.5?

Lin Junyang: I have to admit our weakness. We really can't control "hallucinations".

So now we need to solve several problems. One is how to reduce "hallucinations", and we try to solve it through reinforcement learning.

Another thing related to "hallucinations" or interpretability. We are currently doing some research on SAE (sparse autoencoder) and found that the occurrence of some problems may be very closely related to some features. So through SAE, we find some features and suppress them.

Even if there is a "hallucination" problem in reinforcement learning, it's not terrible. It depends on how to solve it next.

Kong Lingpeng: First of all, an architecture should be considered in combination with hardware. After the architecture, new problems and new architectures will emerge. For example, some architectures are not suitable for reinforcement learning.

My feeling is that we shouldn't use a fixed pattern like "GPU + autoregression/Transformer + pre - training/SFT (supervised fine - tuning) + reinforcement learning" to consider all things.

Lin Junyang: Things may change in the future. Mainly, we have no choice but to use GPUs for training.

I asked a friend who knows about hardware. He said that GPUs are not very suitable for training Transformer, and I can't build one myself.

But our company can do it, or China actually has a certain chance to do hardware - software integration. So in the future, we can think more deeply about the problem.

Creation is actually a search - level problem

Xiting: It seems that the marginal effect of pre - training has started to decline. How can we break through the current bottleneck through architectural innovation?

Lin Junyang: First, I have doubts about the claim of reaching the bottleneck.

Because last year, it became a consensus that pre - training was coming to an end. But this year, everyone is crazy about non - consensus. Now, a new non - consensus has emerged in the United States, saying that pre - training is not over yet.

I don't know whether to be happy or not. In this line of work, I also know where I'm lacking. Anyway, there is a lot to make up for.

If you think Qianwen is doing okay, then I think pre - training has great potential. Because I can say that we still have a lot of data that hasn't been put in. Every time we put in new data, there is an improvement. If we make a slight change to the model and make it a bit larger, the performance improves again.

Xiting: In the multimodal field, what are the points worth paying attention to in the next - generation architecture?

Cao Yue: I very much agree with what Junyang said. Last year, it was said that pre - training was coming to an end, and the language data was almost used up, but there is still great potential in image and video data. This is my initial feeling.

Another dimension is that there are still many commonly used things in the next - generation architecture. After a while, we can take them out and see if they are still commonly used, or if they actually use some prior that we usually don't notice.

If we look at the development history of the past decade, it is actually a process of increasing computing power and decreasing Bias in the entire training process.

Now that we have new computing power, when the computing power is more abundant than before, we can take out some technologies that were not very