HomeArticle

数据邪修大法好:仅用文本数据就能预训练多模态大模型

量子位2026-03-03 15:58
不仅成本低还能超越基线

Can a multimodal large model be pre - trained without images?

In the research and development of multimodal large language models (MLLMs), the industry has long adhered to an expensive consensus: Without image - text pairs, there is no multimodal ability.

In order to enable the model to understand images, it is necessary to spend a huge amount of money to collect a large number of images and generate high - quality image descriptions for each image. This one - to - one strongly supervised data has always been regarded as the fuel for multimodal training.

However, the latest research ReVision from institutions such as the Hong Kong University of Science and Technology (Guangzhou) and NUS presents a counter - intuitive conclusion: In fact, during the pre - training stage of multimodal large models, which most rely on large - scale image - text pair data, those expensive pairing relationships are not necessary.

Theoretical foundation: Why can "representation alignment" replace "pairing"?

Before delving into the geometric details, it is necessary to clarify the underlying constraints for ReVision to hold. The method in this paper is not applicable to any two independent feature extractors, but is strictly based on the established joint representation space in multimodal contrastive learning.

1. Pre - training has established "semantic topological consistency": After pre - training on a large amount of data, dual - tower models (such as CLIP, SigLIP) have forced images and texts to be mapped into the same high - dimensional embedding space through the InfoNCE loss function. In this space, although the feature distributions of different modalities do not completely overlap, they already have highly consistent semantic topology. That is, visual embeddings and text embeddings expressing the same semantics, although there is a distance in the absolute position in space, have the same relative distance relationship with other semantic concepts.

2. The essence of the modality gap: The systematic geometric offset. As the paper points out, this non - overlapping phenomenon is not random chaos, but a systematic offset. This means that there are only rotation, scaling, and translation deviations between the image distribution and the text distribution geometrically.

Conclusion: Since contrastive learning has solved the problem of semantic relevance, all that remains is the misalignment of geometric distributions. Therefore, we do not need to rely on expensive paired data again to relearn the semantic correspondence. Instead, we only need to use the statistical information of unpaired data to correct the first - order moment (mean) and second - order moment (covariance) of the text representation, so that its distribution characteristics are aligned with the image representation, and cross - modal interchangeability can be achieved.

Digging deeper: What does the modality gap really look like?

Since it is clear that only the geometric offset needs to be solved, what does this offset look like? Why is it said that paired data is not needed? Because the research found that there is a huge geometric misunderstanding in the previous understanding of the modality gap.

To cross the gap, we first need to see its shape.

Past misunderstanding: The isotropic fallacy

Although previous methods recognized the distance between images and texts in the shared representation space of contrastive learning pre - training, they simply assumed that this deviation was uniform. They assumed that the noise in the gap was like a perfect sphere (isotropic), spreading evenly in all directions.

Alignment based on this assumption often only corrects the offset of the center point, but ignores the differences in the internal structure, resulting in the dilution of fine - grained semantics in the features.

Discovery: Anisotropy within a fixed framework

The ReVision team deconstructed this phenomenon at the micro - level through the fixed - framework modality gap theory. In a frozen reference frame, the gap can be decomposed into two precise geometric parts:

Stable bias: This is not only a positional offset, but also includes a passive, systematic drift caused by subspace rotation.

Anisotropic residuals: This is the most crucial discovery. The fluctuations inside the gap are not spherical, but stretched (anisotropic) like an ellipsoid.

In the semantic subspace, these fluctuations are highly locked with the gradient direction and carry the core semantic information.

In the orthogonal subspace, the noise and the bias are vertically distributed. If we forcefully simulate with spherical noise, a phantom drift will occur, causing the feature direction after projection onto the spherical surface to be incorrect.

Conclusion: In the shared representation space of contrastive learning pre - training, the modality gap is not a mess, but a geometric structure with a specific aspect ratio and orientation. As long as we can accurately reproduce this anisotropic shape, we can perfectly simulate visual features.

Core breakthrough: Breaking the data shackles of "one - to - one correspondence"

Based on the precise control of the shape of the modality gap, the research team found a shortcut to bypass expensive paired data during the pre - training stage.

Core logic: Train the model with geometrically aligned representations. The team's premise assumption is very bold but in line with geometric intuition: For large models, it doesn't really "look" at images; it looks at the distribution shape of features. If we can extract the geometric features of image data through mathematical means and endow these statistical laws to pure text data, then this text will be disguised as an image in the feature space.

Prerequisite: Statistics replace pairing. Once this logic holds, the corresponding strongly supervised image - text pairs are no longer necessary for pre - training. We only need to meet two low - cost prerequisites:

1. A large amount of unpaired text: Provide rich semantic knowledge.

2. The statistical distribution of unpaired images: Provide the geometric mold of the "visual space".

Conclusion: As long as we master the statistical distribution law of images, we can transform any text data in the world into visual signals mathematically and feed them to the model. This enables us to use cheap text data to simulate the expensive visual training process.

How is it done? Modality replacement of "treating the form with the form"

The research team proposed a strategy called ReAlign, which is a data alignment based on geometric principles:

Step 1: Anchor alignment

First, solve the most basic position problem. The system calculates the centroid of the image data in space and translates the center of the text data there. This eliminates the first - order bias.

Step 2: Trace alignment

This is a key step for anisotropy. Instead of injecting spherical noise as in traditional methods, the text features are stretched and rotated through linear affine transformation according to the global trace of the image data.

This step ensures that while the text features retain their own semantic structure, they perfectly reproduce the anisotropic residuals of the visual features in terms of geometric scale and shape.

Step 3: Centroid alignment

Finally, to eliminate the phantom drift generated when projecting onto the unit hypersphere, the team performed an explicit secondary correction. This ensures that the features are accurately aligned on the final manifold surface.

Result: After this set of combined operations, the features of a pure text have infinitely approximated the real image features in mathematical properties. The whole process does not require real images to participate at all, and does not require any manually annotated paired data.

Why are "unpaired texts" even stronger?

You may ask: Since the goal is to understand images, why not use image - text pairs directly and instead use pure text in a roundabout way?

This is the most subversive discovery of ReVision: In the face of data scale, the pairing relationship of data is no longer important, and the knowledge density of data is the key.

1. Breaking through the crisis of data depletion

High - quality image - text pairs are limited, and the cleaning cost is extremely high. However, unpaired texts are almost infinite. Every book and every paper on the Internet can now be transformed into fuel for training multimodal models through ReVision.

2. A dimensionality - reduction strike in knowledge depth

Traditional image - text pairs often contain limited semantic information.

The unpaired long texts used in this research can be a whole paragraph of text containing rich semantics, without explicit image constraints. When the model learns visual concepts through these long texts, it learns not only the features of images, but also the complex world knowledge and reasoning logic behind them.

3. Extreme cost - effectiveness

The experimental data is exciting: The model pre - trained with 2 million pure texts (after ReAlign geometric transformation) actually outperforms the baseline model pre - trained with 1 million real image - text pairs.

More importantly, the pre - training data cost of the former is only 74% of the latter.

Conclusion

The emergence of ReVision has opened a new door for the training of multimodal large models.

It proves that we don't need to be restricted by paired data. As long as we understand the geometric shape of the modality gap and make good use of the magic of statistics, a large amount of pure text resources are the best visual teaching materials. There is no need for expensive annotations and no need for one - to - one constraints. As long as there is text, AI can learn to understand the world.

Arxiv:

https://arxiv.org/abs/2602.07026 Github:

https://github.com/Yu - xm/ReVision.git HuggingFace Daily Paper:

https://huggingface.co/papers/2602.07026 For cooperation details:

yuxm02@gmail.com

This article is from the WeChat official account "QbitAI", author: ReVision team, published by 36Kr with authorization.