HomeArticle

A team led by Xie Saining published a new paper inspired by Twitter arguments. Their new work iREPA only requires three lines of code.

量子位2025-12-16 17:39
An academic paper triggered by a tweet

If you want to talk about real academia, you have to look at Twitter.

Just now, Xie Saining revealed that his team's new work iREPA actually originated from a debate with a netizen more than four months ago.

Although this brief online debate ended with Xie Saining being convinced by the netizen, more than three months later, there was an unexpected follow - up -

Multiple teams collaborated and wrote a complete paper along this line of thought, and the core framework only requires three lines of code.

The acknowledgment section also thanked the netizens who participated in the discussion at that time.

An Academic Paper Triggered by a Tweet

Here's what happened.

A netizen said in August:

Stop being obsessed with the classification scores on ImageNet - 1K! Self - supervised learning (SSL) models should be specifically trained for dense tasks (such as REPA, VLM, etc.), because these tasks truly rely on the spatial and local information in patch tokens, rather than the global classification performance represented by the [CLS] token.

(Note: Dense tasks are computer vision tasks that require the model to make predictions for "every pixel" or "every local region" in an image. These tasks require precise spatial and local detail information, not just global classification labels)

Regarding the netizen's view, Xie Saining said:

No, using patch tokens doesn't mean you're doing a dense task. The performance of VLM and REPA is highly correlated with their scores on IN1K, and only weakly correlated with the patch - level correspondence. This isn't a problem with the [CLS] token, but rather the difference between high - level semantics and low - level pixel similarity.

In response to Xie Saining's rebuttal, the netizen cited the example that SigLIPv2 and PE - core are superior to DINOv2 for REPA.

Meanwhile, another netizen joined the fray:

This is a reasonable question. To make a direct comparison, in the absence of an early checkpoint of DINOv3, perhaps REPA can be used to compare PEspatial and PEcore. Here, PEspatial can be understood as aligning the Gram - anchor of PEcore to an earlier network layer and combining it with SAM2.1.

In response, Xie Saining said:

Very good! Thank you for your guidance/hint. I really like this solution. Otherwise, there would be too many interfering factors. We already have both checkpoints (G/14, 448 resolution). I hope we can get some results soon.

More than three months later, Xie Saining said that his previous judgment was untenable, and this paper actually brought a deeper understanding.

There was also a considerate tip, suggesting that netizens could look at the acknowledgment section.

One of the netizens who participated in the discussion said it was interesting to be mentioned in the acknowledgment:

Thank you for following up all the way! I'm really flattered to be mentioned in the acknowledgment.

Xie Saining also said that this discussion itself was a small experiment - he wanted to see if a new "online water - cooler effect" could really happen.

He enjoys this state: first having differences and arguments, and then pulling intuition back to verifiable scientific conclusions through real experiments and efforts.

It has to be said that such open, immediate, and correctable academic discussions are really worth having more of.

Next, let's take a look at the latest paper spawned by this.

Spatial Structure is the Main Factor Driving the Performance of Target Representation Generation

Continuing from the above discussion, this latest paper explores a core fundamental question:

When using pre - trained visual encoder representations to guide a generative model, which part of the representation actually determines the generation quality?

Is it its global semantic information (classification accuracy on ImageNet - 1K) or its spatial structure (i.e., the pairwise cosine similarity between patch tokens)?

The conclusion given by the paper is: Better global semantic information doesn't equal better generation. Spatial structure (rather than global semantics) is the driving force behind representation generation performance.

The traditional view (including Xie Saining himself) holds that representations with stronger global semantic performance will lead to better generation results, but the research shows that a larger visual encoder may actually lead to worse generation performance.

Among them, a visual encoder with a linear detection accuracy of only about 20% can actually outperform an encoder with an accuracy of >80%.

Moreover, if you try to inject more global semantics into patch tokens through the CLS token, the generation performance will be lowered.

Meanwhile, the research also found that representations with better generation effects often have a stronger spatial structure (which can be measured by the spatial self - similarity index):

That is, how the tokens of a certain part of an image will pay attention to the tokens of other regions in the image.

In terms of specific research methods, the research refined and verified this observation through a large - scale quantitative correlation analysis: the analysis covered 27 different visual encoders (including DINOv2, v3, Perceptual Encoders, WebSSL, SigLIP, etc.) and 3 model scales (B, L, XL).

In further evaluations, the importance of spatial information was further emphasized: even classic spatial features like SIFT and HOG can bring a competitive improvement comparable to that of modern, larger - scale visual encoders such as PE - G.

After reaching a conclusion through testing, the paper analyzed and modified the existing representation alignment (REPA) framework and proposed iREPA.

Projection layer improvement: Replace the standard MLP projection layer in REPA with a simple convolutional layer.

Spatial normalization: Introduce a spatial normalization layer for external representations.

These simple modifications (such as the implementation under the DeCo framework) aim to retain and strengthen spatial structure information, and can significantly improve performance compared to the original REPA method.

It's worth mentioning that iREPA can be added to any representation alignment method with just three lines of code, and can achieve continuously faster convergence in various training schemes (such as REPA, REPA - E, Meanflow, and the recently launched JiT).

Reference Links

[1]https://x.com/YouJiacheng/status/1957073253769380258

[2]https://arxiv.org/abs/2512.10794

[3]https://x.com/sainingxie/status/2000709656491286870

[4]https://x.com/1jaskiratsingh/status/2000701128431034736

This article is from the WeChat public account "QbitAI", author: Focus on cutting - edge technology. Republished by 36Kr with authorization.