Is Zuckerberg's bold bet starting to pay off? Meta's new approach speeds up long-context processing in LLMs by 30 times.
After the chaos of the previous period, Mark Zuckerberg's investment finally seems to be showing initial results.
Recently, Meta Superintelligence Labs jointly proposed an efficient decoding framework called REFRAG, aiming to solve the efficiency bottleneck faced by large language models (LLMs) when processing long context inputs, especially in application scenarios such as retrieval-augmented generation (RAG).
Paper title: REFRAG: Rethinking RAG based Decoding
Paper link: https://arxiv.org/abs/2509.01092
Why is long context processing so difficult?
In current AI applications, using LLMs to process long text inputs containing a large amount of external knowledge is the key to improving the capabilities of question-answering, dialogue, and agent applications. However, this process also brings severe challenges: in traditional LLMs, the computational and memory overhead of the attention mechanism increases quadratically (N²) with the length of the input.
This means that if the text length doubles, the speed may slow down by a factor of 4, which leads to significant system latency and consumes a large amount of memory for storing the key-value (KV) cache, thereby reducing system throughput. This forces developers to make a painful trade-off between knowledge richness and system efficiency.
Meta's research points out that in RAG applications, the context processed by LLMs contains a large number of paragraphs retrieved and spliced from external knowledge bases, but only a small portion of them is closely related to the user's query. These irrelevant paragraphs result in a waste of computational resources. The core idea of REFRAG is based on this observation, aiming to optimize the decoding process by identifying and skipping invalid computations on these irrelevant contexts.
How does REFRAG solve the problem?
The REFRAG framework achieves significant performance improvement through an elaborate four-step process that leverages the sparse structure of attention. Its key difference from traditional RAG is that it avoids having the LLM directly process the lengthy original text.
- Compression: First, a lightweight encoder reads the retrieved documents and compresses every 16 tokens into a "chunk vector" that condenses the semantic essence.
- Shortening: Next, the main model directly processes these chunk vectors instead of the original tokens. As a result, the length of the input sequence is immediately reduced by a factor of 16.
- Acceleration: Since the input becomes extremely short, the computational overhead of the attention mechanism is significantly reduced, and the KV cache, which is the main consumer of video memory, also becomes smaller. This is the fundamental reason why it can achieve amazing speed improvement.
- Selection: To prevent the loss of key information during the compression process, the framework introduces a reinforcement learning (RL)-based strategy as a "quality inspector". It can intelligently select the key segments with the highest information density and the most relevance to the task, ensuring that they are not compressed and thus retaining the core information.
Meta states that the effectiveness of this framework has been verified in various long context tasks, including RAG, multi-round dialogue, and long document summarization, achieving breakthrough results:
- Speed improvement: It accelerates the time to first token (TTFT) by up to 30.8 times. In the scenario of 16k tokens, compared with baseline methods such as CEPE, it achieves more than 16 times the TTFT acceleration. As can be seen from the performance chart, the longer the text, the more obvious the advantage of REFRAG. Its acceleration effect increases exponentially with the increase of the context scale, while the baseline methods only show linear growth.
- Context expansion: It can expand the effective context size of existing LLMs by 16 times, enabling them to handle a larger amount of information.
- Accuracy improvement: While significantly improving the speed and expanding the context, it ensures that the accuracy of the model does not decrease. More importantly, in the GSM8K benchmark test, REFRAG can not only handle contexts 8 times longer (80 chunks vs 10 chunks) but also doubles the running speed. Ultimately, the score almost doubles, from 6.71 to 12.08.
In short, REFRAG has turned the concept of "large context RAG" into a reality.
Although its effects sound very promising, the comment section also points out that its ultimate value still needs to be tested in a wider range of real-world application scenarios.
Some people also question the RL strategy in this research.
Method
To achieve effective alignment between the encoder and the decoder, this research follows the work of Yen et al. (2024) and adopts a continuous pre-training method based on the "next paragraph prediction" task.
During training, each data point contains a total of s + o = T tokens. Through this pre-training process, the model can learn how to use chunk embeddings to efficiently perform downstream tasks.
To further improve the model's performance, this method also introduces a selective compression mechanism implemented through RL. After completing the continuous pre-training (CPT) alignment, the model undergoes supervised fine-tuning to adapt to specific downstream application scenarios, such as RAG and multi-round dialogue.
In the core task of CPT, the model's workflow is as follows: the encoder first processes the first s tokens
, and the compressed information output by it will assist the decoder in predicting the next o tokens
.
This task aims to train the model to make efficient predictions using context information, laying the foundation for its performance in real-world applications. The ultimate goal is to enable any combination of encoder and decoder to work together, ensuring that the content generated by the decoder based on the compressed context is highly similar to that generated when it has access to the complete, uncompressed context.
Continuous pre-training scheme
To ensure the success of the CPT phase, the researchers proposed a training scheme that includes a reconstruction task and a curriculum learning method. Ablation studies show that this scheme is crucial for achieving excellent CPT performance.
Reconstruction task. The goal of this task is to train the encoder to compress text with minimal information loss. Specifically, the first s tokens
are input into the encoder, and then the model is trained to reconstruct the exact same tokens in the decoder
. During this process, the decoder model itself remains "frozen" (i.e., its parameters are not updated), and the training focus is entirely on the encoder and the projection layer used to connect the two.
This task mainly achieves two goals:
- Efficient compression: Train the encoder to compress k tokens into a chunk embedding while retaining the original information to the greatest extent possible.
- Spatial mapping: Train the projection layer to effectively map the chunk embeddings output by the encoder to the token space of the decoder, enabling the decoder to "understand" and accurately reconstruct the original information.
A specific intention of designing the reconstruction task is to encourage the model to rely more on its context memory (i.e., obtain information from the input) rather than its inherent parametric memory (i.e., the knowledge already learned by the model) during training. Once the encoder and decoder are initially aligned through this task, the decoder is unfrozen, and CPT officially begins.
Curriculum learning. Although the above training tasks are conceptually clear, they are extremely challenging in practice. The difficulty lies in that as the chunk length k increases, the number of possible token combinations increases exponentially at a rate of
(where V is the size of the vocabulary). Effectively compressing such a large diversity into a fixed-length embedding is a major technical challenge. In addition, reconstructing
tokens from L chunk embeddings further exacerbates the complexity of the task.
Contrary to intuition, directly continuing to pre-train the decoder to utilize the encoder output, even in the reconstruction task, fails to reduce the perplexity. To address this optimization challenge, the researchers suggest using curriculum learning for both tasks. Curriculum learning enables the model to gradually and effectively master complex skills by gradually increasing the task difficulty. For the reconstruction task, training starts with reconstructing a single chunk: the encoder receives a chunk embedding for
, and the decoder uses the projected chunk embedding ecnk1 to reconstruct these k tokens. Subsequently, the model reconstructs from
and
to
, and so on. To continuously adjust the task difficulty, the researchers change the data mixing ratio over time, starting with samples dominated by simpler tasks (e.g., single chunk embeddings) and gradually transitioning to samples dominated by more difficult tasks (i.e., L chunk embeddings). Figure 6 provides a visual representation of the data mixing during curriculum learning.