StartseiteArtikel

Meta bricht die 8-jährige Regel von Transformer und ändert die grundlegendsten Regeln der KI. Das Modell hat erstmals ein Unterbewusstsein entwickelt.

新智元2025-10-24 19:42
Der Transformer will sich "verwandeln"!

Will the most fundamental rules of AI be rewritten? When the model drafts mentally before speaking, is AI still just a "probabilistic parrot"?

Transformer can be said to be the cornerstone of the entire LLM, but this cornerstone is about to shake!

Eight years have passed! The underlying architecture of Transformer, which has lasted for eight years, seems to be broken by Meta.

Meta has launched a new model called "Free Transformer", which has sparked a hot discussion on social media in the field of AI architecture.

For the first time, it breaks the core rule of all GPT models since 2017: instead of generating text token by token blindly, it can "think in advance" before generation.

Paper link: https://arxiv.org/pdf/2510.17558

The researchers introduced a latent random variable Z into the decoder, allowing the model to perform internal sampling and planning before output, which is equivalent to adding a layer of "subconsciousness" to the Transformer.

This innovation only adds about 3% of the computational overhead, but significantly improves the model's performance in reasoning and structured generation, surpassing larger-scale models in tests such as GSM8K, MMLU, and HumanEval.

Meta claims that this may be the first Transformer with "intrinsic intent".

Building machine "subconsciousness" with latent random variables

Meta added a latent random variable (Z) to the decoder.

It can be regarded as a "subconscious layer" before text generation. The model samples internal choices to guide the style or structure of the entire sequence.

Technically, this is achieved through a conditional variational autoencoder (VAE) built into the Transformer.

Meta named it Free Transformer.

How different Transformer architectures handle the random hidden state named Z.

The first one shown in the figure is the standard Transformer, which only predicts the next token based on the previous tokens.

The second architecture adds a random state Z and uses an additional encoder network during training to infer the hidden state corresponding to each sample.

The third architecture, called Free Transformer, simplifies this process. It directly injects the random state into the middle layer of the model instead of using an independent full encoder. During the training process, the encoder is still used once to help the model learn how to select good hidden states, but it only works with a part of the network.

During the inference process, the encoder is skipped, and the random state Z is directly sampled.

This design enables the model to make global decisions early, helping it produce more consistent and stable outputs without too much additional computation.

Therefore, half of the modules act as a shared encoder, and the remaining modules decode based on the latent context.

In a regular setting, if a random hidden state is used, both the encoder and the decoder must be used simultaneously every time text is generated.

This doubles the cost.

The Free Transformer avoids this.

It learns a shared internal structure during the training process and then discards the encoder.

During inference, it directly samples the hidden state and only runs the decoder.

Compared with the standard model, this design only adds about 3 - 4% of the FLOPs computational overhead, significantly reducing the computational burden.

It is trained using the classic VAE objective:

Cross-entropy loss + a KL divergence penalty term between the encoder distribution Q(Z|S) and the prior P(Z).

Meta uses a free bit threshold (κ) to prevent collapse and only adds the KL loss when the divergence > κ.

This allows Z to encode useful structures (such as topics, emotions, or pattern positions) without overfitting.

The KL divergence penalty combined with the free bit method prevents the latent state from memorizing the entire sequence.

This architecture injects the latent state in the middle of the stacked layers: the learned vector is added to the key-value pairs, and then the decoding process continues normally.

The latent state corresponding to each token is selected from 65,536 possibilities and is constructed by 16 independent bits.

The key breakthrough is that it retains the advantages of the conditional variational autoencoder (which helps the model plan better) while eliminating the additional cost that usually makes it impractical.

In this way, you can get a more stable Transformer with global perception ability at almost the same cost as an ordinary Transformer.

It can achieve this by only adding about 3% of the computational volume during training.

Ordinary decoders only select the next token based on the tokens that have already been generated, which causes them to infer global choices relatively late.

FreeTransformer first samples a tiny random state, and then each token is generated based on this state.

During training, the decoder is paired with the encoder through a conditional variational autoencoder, enabling the model to learn to generate useful latent states.

The results are excellent!

During the inference process, the encoder is skipped, and the state is selected by a uniform sampler, and the generation process proceeds normally.

This provides the model with early global decision-making and reduces the fragile behavior after small-scale token errors.

Meta trained models of 1.5B and 8B.

The performance in heavy reasoning benchmarks such as GSM8K, HumanEval+, and MMLU has been significantly improved.

Improvements of the 1.5B model:

  • The score of HumanEval+ increased by 44%
  • The score of the MBPP test increased by 35%
  • The score of the GSM8K math problem set increased by 30%

The above effects are achieved with only a 3 - 4% increase in computational overhead.

Moreover, the model remains stable without training collapse or abnormal fluctuations.

The FreeTransformer adds a random "hidden thinking layer" to the architecture.

It doesn't just predict; it makes decisions before predicting, which may mark the beginning of the post-autoregressive era.

In a nutshell, a tiny encoder adds beneficial bias, making reasoning and encoding more reliable.

The thinking Transformer is no longer just "parroting".

This may be an important turning point where the thinking mode of the Transformer is reshaped, moving from "predicting the next word" to "thinking about how to express".

What exactly does the latent variable Z learn?

The following are the test examples given in the paper.

The synthetic sequence has a fixed length, containing a "target" composed of a random letter repeated 8 times at a random position, independent and identically distributed noise composed of exclamation marks, and a prompt indicating the target letter.

  • Each sample uses "letter + >" as a prompt (e.g., K>).
  • The main body is a line of underscores _ of a fixed length, with a "target" composed of 8 identical uppercase letters (e.g., KKKKKKKK) embedded at a random position.
  • In addition, each character is replaced with a ! with a probability of 1/16 to form independent and identically distributed noise.

The following figure shows the generation behavior and the information carried by the latent variable Z of the Free Transformer in this synthetic task at different values of K.

Each model presents two groups of boxes:

  • Blue boxes: A Z is independently sampled for each sequence.
  • Green boxes: The same Z is shared by the entire group of sequences, making it easy to see if Z "locks in" certain global attributes.

As κ increases (from less to more information), the phenomena are as follows:

  1. κ = log(2)/64 (≈1/64 bit): Almost no useful information is encoded from Z, and it behaves like an ordinary decoder without a latent variable; the difference between the green and blue boxes is very small.
  2. κ = log(2)/8 (≈1/8 bit): Z first learns to only encode the position of the target; in the green boxes, the position of the target remains consistent across multiple samples, but the noise ! is still random.
  3. κ = log(2) (1 bit): Z further encodes both the target position and the noise pattern; therefore, the distribution of ! in multiple samples in the green boxes is also very similar.
  4. κ = 8·log(2) (8 bits): Z carries too much information and almost "puts the entire sequence into Z" - resulting in training/generation degradation (the model relies too much on Z, and the output is incorrect).

This figure clearly demonstrates through group comparison that allowing a larger KL quota will enable the model to move more "global decisions" into the latent variable; too little is not enough, and too much will cause collapse.

The FAIR Lab is truly engaged in research

It is noted that the paper's author, François Fleuret, is from Meta's FAIR Lab.

François Fleuret is a research scientist and educator in the field of machine learning.

He currently serves as a research scientist in the "Core Learning & Reasoning" team of Meta Fundamental AI Research (Meta FAIR).

As is well known, FAIR is led by Yann LeCun.

Today, there is a major piece of news that Zuckerberg's super-intelligence experiment has laid off another 600 people.

Yann LeCun was even forced to issue a statement:

"I am not involved in any Llama project. It has always been handled by other teams. I mainly research the next generation of artificial intelligence beyond LLMs."

Judging from the Free Transformer, Yann LeCun's words are true.

Although he has always opposed the LLM technology itself, these innovations are expanding the boundaries of AI.

It is hoped that Zuckerberg will treat this Turing Award winner well.

References:

https://x.com/rryssf_/status/1980998684801401302

https://arxiv.org/abs/2510.17558

This article is from the WeChat official account "New Intelligence Yuan". Author: Dinghui. Republished by 36Kr with permission.