HomeArticle

Memory drops by 50 times without loss of accuracy. MIT proposes attention matching. Can it end the video memory crisis of large models?

机器之心2026-06-01 10:48
OpenClaw may really be able to ingest the knowledge of the entire world with a single machine.

Imagine this scenario: You're staring at the screen, watching your autonomous AI agent (such as OpenClaw) operating frantically.

It's autonomously reviewing an epic open - source project with hundreds of thousands of lines of code, moving through countless files, API documents, and debugging logs. It behaves like an indefatigable super - programmer, but beneath this "all - powerful" appearance lies a hardware nightmare that could explode at any moment. As the context gets longer and longer, the "working memory" of the large model is skyrocketing, mercilessly devouring the expensive GPU video memory pool like a bottomless pit!

The video memory killer that strikes fear into the hearts of all enterprise - level AI developers is KV Cache.

But now, a solution has emerged, from a research team at the Massachusetts Institute of Technology (MIT) (Adam Zweiger, Xinghong Fu, etc.). They have developed a brand - new latent space compression technology called "Attention Matching".

Paper title: Fast KV Compaction via Attention Matching

Paper address: https://arxiv.org/pdf/2602.16284

Code address: https://github.com/adamzweiger/compaction

It can compress the context memory of large language models by up to 50 times in just a few seconds, with almost no loss of accuracy!

This means that tasks such as ultra - long conversations or giant document analysis that originally required an entire H100 GPU array to barely support can now be easily run concurrently on a single graphics card. An efficiency revolution in AI infrastructure seems to have quietly begun.

The Expensive Working Memory: The Achilles' Heel of Large Models

To understand how incredible this technology is, we must first face the weakness of large models.

LLMs are autoregressive. They generate responses token by token. To avoid recalculating the entire chat history of tens of thousands of words from start to finish when predicting each new word, the model must cache the "mathematical essence" of each previously processed token. These extracted multi - dimensional vectors are the "Key" and "Value" pairs, i.e., KV Cache.

As the context expands, this layer of working memory will irreversibly inflate.

In modern enterprise - level applications, such as analyzing hundreds of pages of legal contracts, maintaining the memory of a private AI companion for months, or running an autonomous coding agent like OpenClaw, the KV Cache of a single user's request can instantly soar to dozens of gigabytes.

As Adam Zweiger, the first author of the paper, said: "In ultra - long context services, KV Cache is the biggest physical bottleneck. It not only locks down the concurrency, forces you to reduce the batch size, but also forces the system to perform frequent offloading that severely affects performance."

Facing this money - guzzling beast, researchers have tried many solutions:

Token Discarding and Merging (such as H2O, SnapKV, PyramidKV, etc.): These methods attempt to discard the tokens that the model considers "unimportant". They can work for mild compression, but once the compression ratio is increased (e.g., trying to compress more than 10 times), the model's intelligence will experience a cliff - like decline.

Text Summarization: This is currently the most helpless standard in the industry. When the memory runs out, the system pauses, lets the model write a context summary, and then clears the original memory. This method is extremely "lossy" and will completely erase extremely crucial tiny details (such as a rare indicator in a medical record).

Latent Space Compression (such as Cartridges): This is a recent cutting - edge exploration, which proves that high - ratio compression is not only feasible but also can maintain high accuracy. However, its cost is extremely high: it requires extremely slow end - to - end gradient descent to train these compressed memories. It takes hours to compress a piece of context, even with an expensive GPU! This is simply a pipe dream in real - time enterprise applications that require "instant responses".

We need an ultimate magic that has both the accuracy of Cartridges and the speed of traditional methods. And MIT's "Attention Matching" is exactly designed for this purpose.

The Counter - Intuitive Mathematical Magic: The Underlying Logic of "Attention Matching"

The researchers at MIT didn't focus on slow machine learning training but came up with a brilliant mathematical shortcut. They took a step back and asked a very fundamental question: What does the model really care about when we compress the memory?

The answer is: The model doesn't care how many Keys and Values you store. It only cares about what results this pile of memories can return when it throws out a query (Query, i.e., q)!

To perfectly deceive the AI into thinking that "the compressed memory is exactly the same as the original large memory", the compressed key - value pairs (C_k, C_v) must strictly match two core mathematical properties of the original memory:

Attention Output: This is the actual information vector extracted by the AI.

Attention Mass: This is an extremely crucial point. When splicing new tokens or old memories, the influence of a piece of memory depends on its "quality".

If you directly compress 1000 tokens into 20, the "total quality" of these 20 tokens will definitely be no match for the original 1000, which will cause the model to extremely underestimate this part of the compressed memory during subsequent inferences. To solve this deadlock, the research team introduced a tiny but ingenious variable: Scalar Deviation per Token β.

This β deviation is like a "lever weight". It multiplicatively re - weights the retained Keys at the exponential level of the attention calculation, allowing just 1 retained Key to burst out a huge "quality" representing 50 removed Keys!

If expressed in rigorous mathematical language (such as Formulas 1 and 2 in the paper), the goal they want to optimize is to find (C_k, β, C_v) such that for all relevant queries q:

And match the total mass:

Even more amazingly, due to this delicate framework construction, this seemingly complex non - linear optimization problem naturally disintegrates! The researchers completely abandoned the computationally expensive backpropagation and gradient optimization.

First, after locking C_k, the mass matching problem degenerates into a non - negative least squares (NNLS) problem, and the deviation β can be calculated instantly.

Subsequently, the attention output matching problem directly becomes a standard ordinary least squares (OLS) problem, and the compressed value C_v can be obtained in the blink of an eye through simple algebraic matrix operations!

This is simply a dimensionality - reduction strike. What originally took hours of training has been optimized to the order of "seconds" by linear algebra.

From VentureBeat, generated by AI

Anticipating Your Anticipation: How to Extract "Reference Queries" and Select "Golden Keys"?

With the mathematical weapon, the subsequent engineering implementation is equally amazing. To let the compression algorithm know what to retain, the system needs a batch of "reference queries" (Q_ref) as "substitutes" for the questions the model may ask in the future.

The research team designed an extremely clever "rehearsal" mechanism:

Repeated Pre - filling: Quietly add a hidden instruction at the end of the document: "Repeat the previous context", and then capture the internal Query vectors generated by the model when trying to repeat.

Self - learning: Let the model perform a quick synthesis task on the document, such as "extracting all core facts" or "structuring dates into JSON", so as to detect what kind of Queries the model will generate during in - depth inferences.

Armed with these highly representative Query probes, the system starts to select "golden keys" (C_k) from the vast sea of original Keys. The paper provides two methods:

Highest Attention Keys: This is a lightning - fast heuristic method that directly selects the Keys that receive the highest attention in the reference queries. It is extremely fast and cost - effective.

Orthogonal Matching Pursuit (OMP): This is a more geeky and greedy algorithm. It is like building blocks. At each step, it carefully selects a Key that can best fill the "quality error" residual, and then recalibrates the weights with NNLS. Although it takes a little more time (still just a few minutes), it can push the compaction quality to the peak (AM - OMP).

Not All "Attentions" Are Created Equal: Non - uniform Compression Strategy

This is not the key point. When delving into the model architecture, they found an interesting phenomenon: In the multi - head attention mechanism, not all "heads" are workaholics.

Some heads are extremely greedy and require a large KV capacity to maintain performance (such as the heads responsible for long - range dependencies); while others are extremely laid - back. Even if you cut off 90% of their memory, they can still operate perfectly (such as the heads that only focus on local lexical structures).

Based on this insight, the team developed a Non - uniform Compaction strategy: They pre - calculated a "sensitivity curve" for each model, just like giving a medical examination to each attention head. During actual compression, the system no longer uses a one - size - fits - all approach. Instead, it allocates the extremely precious video memory budget to the "core heads" that are most sensitive to information. The introduction of this strategy has directly led to a qualitative leap in the performance of the compressed model!

Even on hybrid architecture models like Gemma - 3 - 12B that make extensive use of sliding window attention, Attention Matching still shows amazing adaptability and robustness.

Stress Test: The Moment to Witness the Miracle

To verify whether this technology can really survive in the real - world grinder, the researchers selected Qwen3 - 4B, Llama3.1 - 8B, and Gemma3 - 12B and put them into two completely different test fields.

1. QuALITY Benchmark Test: Crushing the Competition

In this standard reading comprehension test with 5000 to 8000 words, Attention Matching, at a 50 - fold extreme compression ratio, only took a few seconds to a minute (depending on whether the OMP algorithm was used) and completely outperformed all predecessors based on token pruning, such as H2O+, SnapKV, and KVzip. Its accuracy curve closely followed that of Cartridges, which took hours, demonstrating what "fast, accurate, and ruthless" means.

2. LongHealth Medical Records: The Graveyard of Traditional Solutions

This is a dataset representing a real enterprise - level challenge. A total of 60,000 tokens are filled with complex medical records, test reports, and medication records of multiple patients, with extremely high information density.

In this test, the "text summarization" most commonly used in the industry has completely become a laughingstock. Its accuracy has dropped to the same bottom line as "providing no context", which means that the model might as well not have read the summary.

On the other hand, Attention Matching is like a war god, far surpassing all traditional stop - gap measures.

Of course, Zweiger also frankly gave engineering advice: "For tasks with extremely high information density, if you want to retain all details, it is recommended to adjust the compression ratio more moderately (such as 10 or 20 times) in exchange for absolute accuracy."

3. AIME 2025 Online Dynamic Compression: Changing the Engine in Flight

What is most exciting is the proof - of - concept for online compression. Facing the top - level mathematical reasoning questions of AIME, the researchers locked the physical memory ceiling. The model is like performing extremely mentally - consuming calculations in a small cage.

Whenever the memory is full, the system will instantly press