Yann LeCun's Final Paper at Meta: Insights and Implications

Still struggling with JEPA before leaving the job.

"LeJEPA: Provable and Scalable Self-Supervised Learning without Heuristics"

"This might be the last paper published by LeCun under the Meta banner."

Indeed, this paper with "Le" in its title introduces a self-supervised learning method. It was submitted to arXiv on November 11th and represents LeCun's latest public achievement.

On the same day, news of his departure from Meta was made public.

If LeCun's joining in 2013 marked the beginning of an era for Meta's AI research, then LeJEPA is his farewell work at Meta.

What kind of "last dance" is LeJEPA?

LeJEPA: A Self-Supervised Learning Method Based on Isotropic Gaussian Embeddings

The core of LeJEPA is to propose a self-supervised learning method based on isotropic Gaussian embeddings. By introducing SIGReg regularization, it effectively solves the representation collapse problem and significantly improves the model's generalization ability.

In the traditional JEPA framework, prediction tasks often face the problem of representation collapse.

This means that during the training process, the model may map all inputs to a single point or a low-dimensional space, making the samples in the embedding space indistinguishable and thus unable to effectively capture the semantic differences between samples.

To address this issue, existing methods rely on heuristic techniques such as stop gradients, asymmetric view generation, and teacher-student networks. However, these methods are considered alternative solutions due to the lack of exploration of the fundamental theory of JEPA.

Against this background, the research proposes a new JEPA framework - Latent-Euclidean Joint Embedding Predictive Architecture (LeJEPA), whose core is to make the embedding space follow a specific statistical distribution, thereby improving the model's prediction performance.

The Impact of Embedding Distribution

First, the research analyzes the impact of embedding distribution on bias and variance through ordinary least squares regression (OLS).

The results show that the isotropic Gaussian distribution can minimize the bias and variance during the training process.

In particular, with the same total variance, non-isotropic distributions lead to higher bias and variance, while the isotropic Gaussian distribution can effectively ensure the minimum bias and variance, thus improving the stability and accuracy of downstream tasks.

Through experiments on nonlinear probing and geometric intuition, the research further verifies the superiority of the isotropic Gaussian distribution.

The experiments show that in both regression and classification tasks, the isotropic Gaussian distribution maintains the minimum error, while non-isotropic distributions exhibit higher variance.

The research shows that the isotropic Gaussian distribution is the optimal distribution for the embedding space. It can ensure the minimization of bias and variance without task information, thereby improving the performance of downstream tasks.

SIGReg: Regularization for Achieving Gaussian Distribution

To achieve the above distribution matching, the research proposes Sketched Isotropic Gaussian Regularization (SIGReg), a tractable and provably correct regularization method.

The innovation points of SIGReg are as follows:

It transforms the distribution matching problem into a statistical hypothesis test by matching the null hypothesis with the target distribution.
It provides a testing method that ensures efficiency during multi-GPU training and bounds the gradients and curvatures.
It solves the curse of dimensionality problem in high-dimensional spaces.

SIGReg uses univariate directional tests combined with the Epps-Pulley test to determine the matching degree between the embedding distribution and the target distribution (isotropic Gaussian distribution).

It transforms distribution matching into a test of the null hypothesis and the alternative hypothesis and uses statistics to determine whether to reject the null hypothesis, thus confirming whether the distributions match.

Solving High-Dimensional Problems

SIGReg also solves the computational challenges in high-dimensional spaces through two mechanisms:

Smoothness: The Sobolev smoothness of the embedding function ensures that the entire space can be effectively constrained with only O(K) directional slices for effective statistical testing.
SGD iteration characteristics: The cumulative effect of repeated sampling of directions during training allows the model to quickly converge to an isotropic distribution even with a small number of directions (e.g., M = 16), which is better than a fixed set of directions.

In terms of implementation, LeJEPA combines SIGReg and prediction loss. It achieves distribution matching through the Epps-Pulley statistic and ensures computational efficiency and stability through mini-batch training. The final total loss is a weighted sum of the SIGReg loss and the prediction loss.

SIGReg loss: Calculated through the Epps-Pulley statistic, it ensures bounded gradients during training and improves computational efficiency through integral approximation. The bias introduced by mini-batch training has a small impact on training.
Prediction loss: Similar to the DINO method, it calculates the difference between all view predictions and the global view.
LeJEPA total loss: A weighted sum of the SIGReg loss and the prediction loss, where a hyperparameter λ is used to balance the weights of the two parts.

Experimental Validation and Results

To verify the reliability of LeJEPA, the research conducts experiments on multiple large-scale architectures, including ViT, ConvNeXt, ResNet, MaxViT, and Swin Transformer, with model sizes approaching 1 billion parameters.

The experimental results show that LeJEPA outperforms existing methods on these architectures and maintains simplicity and robustness during training.

In particular, on domain-specific datasets (e.g., Galaxy10, Food101), LeJEPA outperforms transfer learning methods based on DINOv2 when pre-training directly on the target data.

Overall, LeJEPA continues the previous exploration of JEPA and re-establishes self-supervised learning as a core method in AI research.

LeJEPA provides a simple and theoretically supported framework that makes learning representations from data more efficient and demonstrates superior performance in multiple tasks.

JEPA World Model

Since LeCun first proposed JEPA in "A Path Towards Autonomous Machine Intelligence" in 2022, the architecture based on JEPA has been developed for a full three years.

JEPA (Joint-Embedding Predictive Architecture) is a self-supervised learning framework that aims to improve the model's expressive and reasoning abilities through a joint prediction method based on the embedding space.

Different from generative models, it cannot simply be used to predict y from x. It only captures the dependency between x and y without explicitly generating a prediction of y.

In addition, to address the long-term planning problem, JEPA can be further enhanced through a hierarchical architecture (i.e., H-JEPA) to improve its abstraction ability.

In H-JEPA, the lower-level representations handle short-term prediction tasks, while the higher-level representations are used for long-term prediction.

This hierarchical structure allows the model to operate at different abstraction levels during long-term planning, thereby improving predictability and reducing information loss.

It is worth noting that the JEPA architecture is usually closely related to the world model, although it still differs from the general concept of the world model.

Traditional world models generally refer to models that can simulate the environment or system. Their main purpose is to achieve long-term planning and decision-making (e.g., in reinforcement learning) by predicting future states.

JEPA, on the other hand, is an architecture that learns state and action transitions through a joint embedding space. It focuses on combining representation learning and self-supervised learning to complete prediction and planning tasks.

In JEPA, the purpose of the world model is to predict the future representation of the world state.

Specifically, JEPA trains the world model by learning state and action transitions. Its core is to infer the representation of the future state from the representation of the current state, which is completed in the joint embedding space. This space learns the relationship between state representations and actions by minimizing the prediction error.

Although the original JEPA paper presented a reflection on generative AI, described a vision for the future of artificial intelligence, and pointed out that this vision may take decades to achieve.

Since its release in the summer of 2022, with LeCun's promotion, the JEPA architecture has made some significant progress.

I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Compared with other image SSL methods, I-JEPA fully utilizes the flexibility of the Transformer architecture.

In I-JEPA, the context encoder is a ViT that only processes visible context blocks.

The predictor receives the output of the context encoder and predicts the representation of the target block at a specific position based on the position tokens (shown in color).

The target representation corresponds to the output of the target encoder, and the weights are updated through the exponential moving average of the context encoder weights in each iteration.

V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video

V-JEPA is an extension of I-JEPA to the video domain, which is achieved by treating videos as 3D images.

The training process is based on a video clip containing T frames with a spatial resolution of H×W, which is flattened into a sequence of L tokens.

First, the input of the x-encoder is obtained by removing some tokens from the video clip.

Then, the x-encoder processes the masked video sequence and outputs an embedding vector for each input token.

Next, the output of the x-encoder is concatenated with a set of learnable mask tokens that contain the position information embeddings of the masked spatio-temporal patches.

The prediction network processes the concatenated token sequence and outputs an embedding vector for each mask token.

Finally, the output of the prediction network is regressed to the prediction target through the L1 loss. The prediction target corresponds to the output of the y-encoder.

In July this year, LeCun's team further released V-JEPA 2.

V-JEPA 2 is based on V-JEPA and further improves the action prediction and world modeling abilities, enabling robots to interact with unfamiliar objects and environments to complete tasks.

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

MC-JEPA is an extension