Lu Baotong, Microsoft Research: Reshape Model Attention with Vector Retrieval - Attention
In large language models, the ability to perform reasoning with ultra-long contexts is one of the main bottlenecks affecting their performance.
This is due to the quadratic complexity of Self-attention and the fact that KV cache consumes video memory in advance and increases with the length of the context. For example, when an 8B model is reasoning on a context of 1M tokens, the KV cache can easily exceed 100GB of video memory, and an ordinary GPU cannot handle it in this case.
Based on this problem, a new interview article in the Attention series focuses on the new mechanism proposed in the paper "Retrieval Attention: Accelerating Long-context LLM Inference via Vector Retrieval": A training-free, dynamic sparse attention scheme for ultra-long context reasoning.
The following is a summary of an interview between Oasis and Dr. Baotong Lu, one of the core authors of the paper. It takes about 20 minutes to read the full text.
Enjoy
The core idea of Retrieval Attention is that each Query actually only needs to strongly interact with a small portion of Keys, and the remaining attention is redundant; moreover, attention itself is naturally sparse.
Therefore, the core approach of the research team is to move most of the KV vectors from the GPU to the CPU. Every time the model needs to perform reasoning, it uses approximate nearest neighbor search (ANN Search) to find a small number of Keys that are most relevant to the current Query, such as only the top 1%, and then calculates the attention in parallel with a small number of "predictable" KVs on the GPU and merges the results.
In actual tests on an RTX4090 (24GB), an 8B-level model can stably generate text under a 128K context, with a time of ~0.188s per token, and the accuracy is almost the same as that of full attention (benchmarks such as ∞-Bench and RULER).
In subsequent work, RetroInfer further expanded the performance advantage. On an A100 GPU, it achieved 4.5 times the decoding throughput compared to full attention, and 10.5 times the throughput compared to other GPU-CPU sparse attention systems when dealing with a context of 1M tokens.
The core innovation of Retrieval Attention is that it introduces a vector retrieval mechanism into the attention calculation path and achieves true dynamic sparsification at the computational level.
Specifically, the authors constructed an attention-aware vector index: by learning the exact mapping between Query and Key during the pre-filling stage and projecting this relationship into the Key–Key graph structure, the retrieval process can achieve a high recall rate with a very low scanning ratio (about 1–3%). This design allows each Query to dynamically locate the most critical Keys with minimal computational cost, thus avoiding the inefficiency problem caused by the different distributions of Query and Key when traditional ANN retrieval is applied to attention.
In terms of system architecture, Retrieval Attention further proposes a dual-path attention mechanism with CPU–GPU collaboration: The GPU is responsible for retaining a small amount of "predictable" local KV caches, while the CPU dynamically calls large-scale KV storage through retrieval. The two paths of calculation are independent and parallel, and finally the results are fused through a numerically stable re-calibration formula, achieving a dual reduction in video memory usage and inference latency.
It is worth noting that the entire mechanism does not require re-training the model. Retrieval Attention can be plugged into the existing Transformer as a modular component. By only modifying the forward logic of the attention layer, it can significantly accelerate long-context reasoning without sacrificing accuracy. This idea of "training-free dynamic sparse attention" provides a practical engineering path for the scalability of long-context language models.
The following is part of the conversation between Oasis and Dr. Lu:
Oasis: Please introduce your research background first, especially your research on Database Management Systems. What were your thoughts at that time that led to this design idea?
Dr. Lu: My main research direction is database management systems, which focuses on how to more efficiently store, retrieve, and manage massive amounts of data. From a disciplinary perspective, database systems and machine learning seem to be two relatively independent fields, one focusing on systems and structures, and the other on models and optimization. However, they actually have some common underlying problems, such as how to organize information more efficiently under limited resources.
During my doctoral studies, I mainly focused on researching how new hardware architectures, such as persistent memory and disaggregated memory, would bring new challenges to database design. This specifically included how to design new data structures and database indexes for these hardware, as well as optimizing the query performance of storage systems. It can be said that the core of this direction is to combine the changes in underlying hardware with the performance issues of data management.
Oasis: You just mentioned that the core of your research is to enable the system to achieve more efficient data management under different hardware architectures. In this process, how did you gradually extend from traditional database optimization problems to the current research direction related to model inference?
Dr. Lu: The evolution of traditional database systems is actually a series of continuous improvements in technologies to support more efficient data retrieval and access. The core goal is very simple: how to make queries faster and more stable. For example, the optimization of index structures and the improvement of caching mechanisms are all aimed at enabling the system to maintain query performance in the face of continuously growing data volumes.
However, with the changes in hardware architectures, such as persistent memory, disaggregated memory, or hierarchical storage systems, there are actually many incompatibilities between these new hardware and traditional models. Therefore, we need to rethink: how should the database system be designed under this new architecture to maximize the potential of the hardware?
My research during my doctoral studies mainly revolved around this point.
On the one hand, I focused on optimizing the performance of the system under new hardware, such as how to reduce access latency and improve index efficiency; on the other hand, I also studied how to make the system more intelligent when processing large-scale tasks. These experiences later influenced my current research direction.
Now, our team is more concerned with how to transfer the "retrieval" logic in traditional database systems to the model level. For example, when a model faces long-context reasoning, the bottleneck it encounters is very similar to that of a database, which is how to quickly locate key information in a large amount of storage. This made us think that perhaps we can combine the retrieval idea of databases with the attention mechanism of models.
Our approach is to try to transplant the mature vector retrieval methods in databases into the inference process of language models, so that the model only accesses the "most relevant" information during generation. In other words, we do not re-train the model, but hope to enable the model to more efficiently utilize its existing memory through system-level design. This is actually the starting point of our current paper.
Oasis: Can you specifically talk about how this research idea of combining with hardware was formed? How did you think about and combine it at the hardware or system level at that time?
Dr. Lu: This idea actually originated from a very specific problem.
We found that the attention mechanism is essentially a process of vector retrieval. When the model is generating text, it needs to continuously "search" for relevant information from the existing context. In this process, Memory (i.e., video memory) has become the biggest limitation.
For example, the Memory of current GPUs is very limited. We found that when the model is performing long-context reasoning, it will continuously accumulate a large amount of KV caches, and the management method of these caches is actually very similar to the "index + cache" logic in traditional database systems. So we thought that since databases can efficiently manage massive amounts of data through hierarchical storage and retrieval mechanisms, can the attention mechanism of models also learn from a similar idea?
Therefore, we began to try to offload part of the cache from the GPU to a larger storage layer, such as CPU memory, and then dynamically call the required part through retrieval. This "hierarchical + retrieval" structure enables the system to remain efficient under long contexts without incurring additional training costs.
Oasis: In fact, you transferred the "storage - index - retrieval" thinking in database systems to the attention calculation of the model?
Dr. Lu: Yes, the thinking is very similar. The goal of databases is to make data access faster, and what we are concerned about is to make the process of the model "finding information" faster during generation. The difference is that we are dealing with unstructured and dynamically changing context information, so the retrieval and caching strategies need to be redesigned.
It can be said that we reapplied the experience of traditional database systems in performance optimization to the attention mechanism of models, making it closer to the real-world "access" process in terms of computation.
This is also the original starting point of Retrieval Attention.
Oasis: Can we start from more basic parts to help readers understand unfamiliar concepts? What is the core logic of the "data management mechanism" in a Database Management System? Can it be regarded as the core of a "large database"?
Dr. Lu: I can briefly explain it first. In the context of machine learning or cognitive science, the original starting point of the "Attention Mechanism" is actually a simulation of human attention. That is, when processing a series of information, how the model determines which content is more important and which can be ignored.
For example, when we give the model a question, it will not treat all the input information equally, but will automatically assign "attention weights" and focus on the parts that are more relevant to this question. This is the core idea of the attention mechanism: In a sequence, according to the task goal, focus the computation on the small number of truly important elements.
Early sequence models, such as RNN or Seq2Seq, actually already had this "selective focusing" characteristic, but at that time it was more dependent on the sequential structure.
The emergence of Attention enabled the model to establish connections between any positions: it is no longer limited by time or position, but can directly capture global dependencies. This is why Transformer has become a milestone architecture.
From an implementation perspective, the attention mechanism can be understood as a "query - match - weight" process. The model decomposes the input into three types of vectors: Query, Key, and Value.
Query represents the information that needs to be processed currently, that is, "what I am looking for now";
Key represents all possible clues to be matched;
Value is the content corresponding to these clues.
The model calculates the similarity between the Query and each Key to obtain a set of weights, and then uses these weights to perform a weighted sum of all the Values. The result is the part that the model considers "most worthy of attention", which is the final attention output.
If we place the attention mechanism within the framework of a database system, it can actually be understood as a dynamic information retrieval system. Every time the model generates a new token, it needs to "query" the most relevant information in the existing semantic space, which is very similar to the process of a database executing a query request.
In this analogy, the attention weights represent the "matching degree" of a query. The higher the weight, the more the currently generated token should "pay attention" to that part of the input, that is, extract more information from the corresponding Value. In other words, every time the model generates a word, it is actually performing a "search and summarize" operation.
If we further map it to the semantics of a database, the Query is the query statement issued by the user, the Key is equivalent to the field in the index table, and the Value is the specific stored content. The model calculates the similarity between the Query and the Key to find the most matching Value, and then performs a weighted sum to obtain the output. Each step of generation is like performing a dynamic, multi-field fuzzy query in a database.
From a system perspective, this mechanism is very similar to an "implicit retriever" that completes the indexing and matching of information inside the model without an explicit structuring process. Our research starts from here, hoping to make this implicit and invisible retrieval logic explicit and optimize it in a systematic way.
To put it simply, we want to make the attention mechanism of the model more like a "controllable database":
During each generation, the model no longer passively traverses the entire context, but can actively query, filter, and call the information it really needs.
Oasis: "Retrieval Attention" is the core proposition of your paper. When you initiated the project, what specific problem did you most want to solve? Did you encounter or discover any unexpected gains during the research process?
Dr. Lu: What we wanted to solve was actually the inevitable performance bottleneck in long-context reasoning.
In the traditional attention mechanism, as the context becomes longer, the computational complexity increases quadratically with the sequence length, and at the same time, a large number of Key–Value (KV) vectors need to be stored, resulting in extremely high video memory consumption. Taking Llama - 3 8B as an example, if the context reaches 1 million tokens, the KV cache alone would require more than 125GB of video memory, almost exceeding the limit of a single GPU.
In order to reduce video memory usage, many systems transfer part of the KV cache to the CPU, but a new problem arises: when the model generates each token, which most critical information should it retrieve from thousands of caches? These "important tokens" change dynamically and cannot be predicted in advance.
Retrieval Attention proposes a solution to this problem. We store all the KVs in the larger CPU memory, and the GPU only retains a small amount of "active caches". Whenever a new token is generated, the system uses vector retrieval to find the most relevant KVs in the CPU memory and then dynamically loads them back to the GPU for calculation. In this way, the video memory usage is reduced to about 1/10 of the original, and almost no accuracy is lost.
The real challenge lies in how to determine "which KVs are the most valuable". To address this, we designed a dynamic retrieval mechanism based on approximate nearest neighbor (ANN) and built a CPU–GPU collaborative scheduling system to enable parallel retrieval and calculation. Finally, during the generation process, the model no longer passively traverses the entire context but can actively "query" the information it needs, achieving efficient reasoning in long contexts.
Oasis: Is this also one of the main contributions of this work?
Dr. Lu: Yes, I think the main contributions of this work can be divided into two levels.
The first level is theoretical. We proposed a new perspective: The attention mechanism can essentially be regarded as a retrieval system. Queries in traditional vector databases usually assume that the queries and data are in the same distribution, while in the attention mechanism, the queries (query) and data (key) are naturally in different distributions. Therefore, we redesigned the "retrieval structure" of attention to enable the model to more accurately find the truly important context information. In our subsequent work, RetroInfer, we further observed that the sparsity levels (the number of important data)