HomeArticle

A latest interview with LIU Zhuang from Princeton University attracted 100,000 views: Architecture isn't that important; data is the king.

量子位2026-04-29 12:14
Memory is the biggest bottleneck for AI, and agents are just a stopgap measure.

The citation count exceeds 100,000 times. An alumnus of Tsinghua University's Yao Class, the author of papers such as ConvNeXt, ImageBind, and "Transformers Without Normalization" —

Zhuang Liu, an assistant professor at Princeton University, is a rather special figure in the academic circle. Almost every one of his papers questions some "taken - for - granted" assumptions.

Is the architecture really important? Is the dataset truly diverse enough? Is the normalization layer necessary? Do large language models have a world model? Can AI agents replace doctoral students?

In the latest podcast of "The Information Bottleneck", Zhuang Liu had a conversation with hosts Ravid Shwartz - Ziv and Allen Roush that lasted for more than an hour, answering these questions.

Zhuang Liu gave several core judgments (TL;DR version):

1. The choice of architecture is not as important as you think.

As long as you get the four basic elements right: residual connections, self - attention, normalization layers, and linear layers, whether you use ConvNet or Transformer, you will ultimately end up on the same performance curve.

What has truly driven the progress of AI in the past decade is, to a large extent, the scale of data and computation, rather than just architectural innovation.

2. The datasets are far less diverse than we thought.

He and Kaiming He conducted an experiment: they trained a neural network to determine which dataset an image came from.

The result was that on three so - called "diverse" datasets with hundreds of millions of images, the accuracy rate was as high as over 80% —

This shows that these datasets are still clearly distinguishable in the eyes of the model and are far from a "bias - free global distribution".

3. Large language models have a world model, but only in the language space.

LLMs perform well in high - level event reasoning, but we still don't have a fine - grained world model for the visual space —

The fundamental reason is that the information density of visual data is too high, and the current computing power cannot handle it.

Moreover, for more than half of the work scenarios (especially digital white - collar jobs), a visual world model is not needed at all.

4. Memory is the current biggest bottleneck, not ability.

The inference ability of existing models is already strong enough. What is really lacking is stable long - term memory.

The reason we need so many agents to collaborate is precisely because a single agent cannot remember everything.

5. Autonomous research is not yet up to par, and AI cannot replace graduate students.

He personally tested having Claude Code independently complete a research project within one or two days.

The conclusion is that it can handle low - level tasks, but it cannot propose interesting questions, design experiments, or maintain a sense of direction.

There is a hidden main line throughout the interview: many things we hold as gospel in the field of AI are actually historical accidents.

What really determines success or failure is often those more simple and boring factors — data, scale, and memory.

The following is a summary of Zhuang Liu's latest interview by QbitAI. For better understanding, there are some deletions and polishings, and editor's notes are added where necessary. Enjoy!

The architecture is not that important, but details determine everything

Editor's note: Around 2020, there was a "Transformer boom" in the field of computer vision. In 2020, Google Brain proposed the Vision Transformer (ViT), and the entire vision community quickly migrated to it. Traditional convolutional neural networks (ConvNet) were generally considered outdated. In 2022, Zhuang Liu's team published ConvNeXt, gradually "modernizing" the classic ResNet architecture and finally making its performance catch up with the then - strongest Vision Transformer. The conclusion was surprising: the gap between the two did not come from the architecture itself, but from the different training schemes.

Ravid: Today we'll talk about some of your papers. Generally, we want to explore what the truly important components in today's AI are. You have many research results. I think we can start with "which components are the most crucial".

A few years ago, you published a paper on "Convolutional Neural Networks for the 2020s". Can you first introduce this paper, and then we'll break down the various components of current AI systems?

Zhuang Liu: Well, of course. It was a very interesting experience.

We wrote this paper in 2021. At that time, the Transformer had just entered the field of computer vision through the introduction of the Vision Transformer, and the entire vision community was switching from traditional convolutional networks to Vision Transformers, with better and better performance.

In this work, we wanted to study: Has ConvNet really lost its competitiveness?

Is it possible to verify whether ConvNet can be modernized to reach the level of the then - Vision Transformer by systematically controlling all design details?

We wanted to figure out whether the seemingly existing performance gap between the Transformer and ConvNet was due to the fundamental differences in architecture — such as using self - attention or convolution — or due to some seemingly minor design details.

Ultimately, we found that the answer was the latter.

After a lot of research on the components of ConvNet, we finally made the model reach the level of the then - strongest Vision Transformer on various tasks.

This shows that whether you choose ConvNet or Vision Transformer, as long as you get all the details right, you can achieve the same state - of - the - art performance on vision tasks.

Ravid: Do you still believe this now? Do you still think the architecture is actually not that important?

Zhuang Liu: I wouldn't say that — Generally, I tend to agree, but I won't say the architecture is unimportant.

What I mean is that as long as you get all the details right and explore the design space sufficiently, you will converge to a point similar to the "Pareto front" — achieving the best balance between accuracy and efficiency.

It is very difficult to break through this front line.

I think in the past few years, apart from the architectures that have matured a few years ago, there have actually been few architectural innovations that have been widely adopted.

However, the exploration process itself is very interesting.

Recently, some open - source model companies, such as Kimi and DeepSeek, are still constantly tinkering with architectures, such as how to modify residual connections and how to connect different layers. I really respect this kind of work.

In fact, the reason why architectural research in the academic community is not very active now is partly because we can't afford the computing resources required to verify these effects on a sufficiently convincing scale.

But I still use the school's resources to try. Now with the help of Claude Code, I can write code myself to explore, which is very interesting.

From a practical point of view, I think what data we use to train the model is more important than which architecture we choose — as long as the input - output interfaces remain the same.

The architecture is essentially the way we parameterize function approximators, which is the most basic function of neural networks or deep learning.

As long as you get a few things right, such as using residual connections, self - attention or other reasonable mechanisms, and placing activation functions and feed - forward layers in the right positions, you can get very close to or even reach the frontier curve of performance and efficiency.

From a practical application perspective, I think what's more important is: what data was this model trained on? How does it handle context and memory?

There are indeed some architectural works addressing the issues of context and memory.

I think this is the most urgent problem to be solved to take AI to the next level.

Allen: As I understand it, you gradually modernized ResNet in a design direction similar to Swin Transformer and finally got a ConvNet that can strongly compete with the Transformer.

In that paper, which ablation experiment most changed your view on "where the advantages of the Transformer come from"?

Editor's note: An ablation study is a common method in deep - learning research, which means removing or changing a certain component in the model one by one and observing how the performance changes to determine the contribution of each component.

Zhuang Liu: Which one? I think it's every one of them.

Look at that graph. No single change can significantly boost the performance. Some changes are more effective than others, but none of them can change everything.

Figure 2 of the ConvNeXt paper, showing the complete process of modernizing ResNet and the corresponding performance changes at each step

Maybe the use of activation functions and the reduction of the number of normalization layers are points that interest me and have obvious performance improvements.

But what really works is combining all the changes.

These seemingly minor components, when combined, produce a performance gap that is usually only achievable by major changes like replacing convolution with self - attention.

So I think the biggest inspiration from this paper is: the combination of these small details has a greater impact than those seemingly core network components.

Ravid: For me, it feels like we're trying a lot of things, and when some of them work, the model gets better. Then, looking back, we start to really understand which components are crucial.

Do you think we need to have a breakthrough first and then understand the details? Or do we just need to keep trying and making mistakes without a clear direction?

Zhuang Liu: The Transformer is definitely a blessing for the entire community. Introducing the Transformer into computer vision is of great significance.

It was definitely one of the most important breakthroughs in those years.

But another benefit of the Vision Transformer is that it unifies text and image representations.

The use of the Transformer is very crucial for later developments, such as LLaVA, this kind of multimodal framework — using a visual encoder to encode an image into tokens and then inputting them together with text tokens into a downstream large language model.

This is the basic framework of many current multimodal models.

Editor's note: LLaVA (Large Language and Vision Assistant) is a multimodal large language model framework proposed in 2023, which connects an image encoder (usually CLIP) and a large language model (such as LLaMA), enabling the model to understand both images and text. This framework has become the basic idea for later multimodal models such as GPT - 4V and Gemini.

Back to our research, this in - depth analysis of details is more like a lesson. It has changed my own perception and that of many others, which makes me even more proud.

Of course, people can still use ConvNet, which also has its own advantages, especially in pure vision tasks: it is easy to deploy, relatively easy to understand, and because the operations are local, it has better support for higher resolutions and long sequences.

The two architectures are just good at different things.

Ravid: Okay, the architecture is not that important. You also have a more recent paper proving that the normalization layer is not that important, right?

Basically, you can use the hyperbolic tangent activation function to replace the normalization layer with just some adjustments, and the effect is just as good.

So what do you think the truly important core components are? And why have good AI models only emerged in the past five years, not ten years ago?

Editor's note: Here it refers to the paper "Transformers Without Normalization" published in 2025 by Zhuang Liu in collaboration with Yann LeCun and others. The normalization layer is an almost ubiquitous component in modern neural networks, with the most common being LayerNorm, whose role is to stabilize the training process and accelerate convergence. This paper replaces LayerNorm with an activation function called "dynamic tanh" and still achieves the same or even better performance as the standard Transformer in various settings.

Zhuang Liu: This is a good question.

First of all, the Transformer was proposed about ten years ago, nine years ago to be precise.

So for a long time after that, we basically still used a similar basic framework, with only some minor changes, such as activation layers, mixture - of - experts (not always used), local attention, sliding - window attention, etc., but the core framework is basically the same as when the paper came out nine years ago.

So my answer is: data and the scale of computation used during training.

This is like the classic story from GPT - 1 to GPT - 3 — basically the same model, trained with more computation, more data, more diverse data, and larger - scale Internet data, and we get the powerful capabilities we see now.

So I would attribute this to data, followed by computing power.

I think data is the main factor because most models are currently trained for no more than one epoch.