HomeArticle

NVIDIA and MIT Take Action: Chinese Team Makes a Major Open-Source Release, Slashing Inference Memory of Large Models by 10 Times

新智元2026-05-14 20:53
Can an ordinary 24GB consumer-grade graphics card enable a 32B large model to read six long documents in one go and automatically write a weekly report? Researchers from NVIDIA, MIT, and Zhejiang University have jointly come up with a new method that reduces memory consumption by a factor of 10, without sacrificing intelligence or causing GPU memory overflow, completely breaking through the hardware ceiling.

A RTX 4090 with 24GB of video memory is used to run a large model with 32B parameters for agent tasks.

Without any KV compression, the video memory runs out immediately, and the model can't even start running.

After switching to TriAttention, the model runs stably. It successfully reads 6 documents and automatically generates a complete weekly report.

This is not a modification by a community expert but a joint paper from MIT, NVIDIA, and Zhejiang University.

https://arxiv.org/pdf/2604.04921

The core idea is to estimate the importance of each KV token using the triangular concentration of Q/K in the pre-RoPE space, and then only keep the truly important ones.

For example, other methods of compressing the KV cache are like stuffing all luggage into compression bags, flattening everything regardless of whether it's a down jacket or a brick.

TriAttention is like rummaging through the suitcase first, throwing away the bricks, and only packing the down jackets.

The TriAttention demo shows the complete process of Qwen3 - 32B completing the OpenClaw agent task on a single RTX 4090.

One of the authors, Yukang Chen, posted this set of comparisons on X. On the left, without compression, the video memory reports an error immediately; on the right, with TriAttention enabled, the agent reads 6 documents all the way and outputs a complete weekly report.

2.5x Throughput, 10.7x Memory Reduction

How effective is it? Let the numbers speak.

In the AIME25 mathematical reasoning task, on the premise of matching the accuracy (40.8%) of Full Attention, TriAttention increases the throughput by 2.5 times.

Looking at the memory: the KV cache memory is reduced by 10.7 times.

Performance trade - off on AIME25 (Qwen3 - 8B). (A) At the same accuracy (40.8%), the throughput of TriAttention is 2.5 times higher than that of Full Attention. (B) TriAttention reduces the KV cache memory by 10.7 times while maintaining the same accuracy as Full Attention.

Note that here we are talking about KV cache memory, not the entire video memory of the machine, nor the total memory occupied by the model parameters.

But even just the KV cache is often the last straw that breaks the video memory in long - sequence reasoning scenarios.

Cutting this part off is the dividing line between being able to run and not being able to run.

The main experiment was conducted on Qwen3 - 8B, covering tasks such as AIME24, AIME25, and MATH500.

Under the condition of a 32K token generation length, TriAttention hardly sacrifices accuracy but takes the reasoning efficiency to a new level.

Run a 32B Large Model on a Single 4090

A real - world deployment case is mentioned in the appendix of this paper.

The scenario is OpenClaw, a multi - round agent workflow. The task is to read 6 markdown documents and generate a weekly report.

The model is Qwen3 - 32B, using AWQ INT4 quantization, running on a single RTX 4090 (24GB).

Running this task without compressing the KV cache? The video memory runs out immediately.

With a long system prompt and multi - round document reading, the KV cache expands beyond what the video memory can hold.

After TriAttention takes over, the agent successfully reads all the documents and generates a complete report.

The model used is the Qwen3 - 32B AWQ INT4 quantized version, not the original FP16 full - blooded version; it runs the OpenClaw agent workflow, not a general long - text benchmark.

But it just proves that "a complete, practically valuable agent task can be run on consumer - grade hardware."

vLLM Plugin is Ready, MLX Experimentally Started

TriAttention is not just in the paper.

The authors have provided vLLM integration in the GitHub repository. The README clearly states that TriAttention includes a vLLM plugin and provides instructions for the OpenAI - compatible API server mode, Python API, and OpenClaw access.

Compared with the experimental results in the paper, this is an engineering extension at the repository level.

This means that you don't need to change the model architecture or retrain the model. You just need to attach this plugin to get the KV compression benefits on the existing vLLM inference pipeline.

In the Apple Silicon direction, a separate docs/mlx.md is placed in the official repository, covering all chips from M1 to M4, running based on the MLX framework and mlx - lm, with sample code and hardware benchmarks attached.

The official TriAttention repository has provided experimental support documentation for MLX, covering M1 - M4 chips https://github.com/WeianMao/triattention/blob/main/docs/mlx.md

However, the official document also marks in the title that this is still experimental support, which means they are already testing MLX in the early stage, but there is still a long way to go before a mature Mac local deployment.

Two Routes in the KV Compression Track

There are two routes in the KV cache compression track.

One is the quantization school.

Google Research released TurboQuant on March 24. Its positioning in the official blog is a solution for "achieving extreme compression with zero precision loss", aiming to compress the bit number of the KV cache and vector search to an extremely low level.

The LongBench benchmark test chart in the Google Research official blog. TurboQuant shows robust KV cache compression performance on the Llama - 3.1 - 8B - Instruct model compared with various compression methods in the LongBench benchmark test.

Someone in the community has already run Gemma 4 31B on Apple Silicon using TurboQuant.

The other is the selective retention school.

TriAttention is the new representative of this route. It doesn't compress the bits but directly determines which KV tokens are worth keeping and which can be discarded.

The end goal of the two routes is actually the same: to make large models run on consumer - grade hardware without running out of video memory and losing accuracy.

But the methodologies are completely different.

Quantization flattens each piece of luggage, while selective retention directly reduces the number of luggage.

Theoretically, the two can even be used in combination.

Currently, there is no strict head - to - head comparison with the same model, the same hardware, and the same task, so we can't say "who crushes who".

But it is certain that these two routes are accelerating the progress towards consumer - grade deployment.

A year ago, "running large models locally" was still an art form in the geek circle, and it took a lot of effort to run a 7B model.

Now, a 32B model can complete an agent task on a single consumer - grade graphics card. The MLX ecosystem on Apple Silicon has a new repository every week, and the vLLM plugin makes KV compression a one - click solution that can be used right after attachment.

The KV cache compression track is evolving from an ablation experiment in papers to an engineering reality that every developer can touch.

Author Introduction

Weian Mao

Weian Mao

Weian Mao is currently a postdoctoral researcher at MIT CSAIL. He graduated with a doctorate from the AIML of the University of Adelaide, under the supervision of Professor Chunhua Shen. His current research focuses on large language models, especially on inference efficiency and KV cache compression in long - context reasoning. He has also engaged in research in areas such as computer vision and protein design.

Xi Lin

Xi Lin

Xi Lin is a senior undergraduate student majoring in Computer Science and Technology at Zhejiang University. His research interests focus on the algorithm - system co - design of efficient AI, especially on the design of hardware - friendly sparse and quantization modules and efficient inference strategies. His work is closely related to high - performance computing and machine learning systems.

Wei Huang

Wei Huang

Wei Huang is currently a doctoral student at the University of Hong Kong. His research focuses on Efficient AI and large - scale vision/language models.

Currently, he is interning at NVIDIA Research, collaborating with researchers such as Yukang Chen, and conducting relevant research under the guidance of Song Han. He has participated in projects such as QeRL and LongLive.

Reference materials:

https://arxiv.org/abs/2604.04921

https://x.com/yukangchen_/status/2041366586423165152

https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ 

This article is from the WeChat official account "New Intelligence Yuan", edited by Yuanyu. Republished by 36Kr with authorization.