HomeArticle

YANG Yuqing from Microsoft Research: Agent's Attention System | Attention

绿洲资本2025-09-05 11:41
Counselor's vitality

The ability to understand long contexts is a crucial path for large models to tackle complex tasks and system scheduling.

When it comes to solving the computational bottleneck in the Prefill phase, TriangleMix is currently one of the few underlying structural optimization methods that can balance performance and accuracy.

This method was proposed by Dr. Yang Yuqing, the chief R & D manager of Microsoft Research, and others in the paper "TriangleMix: A Lossless and Efficient Attention Pattern for Long-Context Prefilling": a training-free Attention pattern combination suitable for ultra-long inputs. Through a structural design with a shallow dense layer and a deep triangular sparse layer, this method significantly reduces the latency in the prefill phase while maintaining the output quality of the model.

TriangleMix is a structural hierarchical scheme for Attention. It can reduce the Time to First Token (TTFT) by 12% - 32% and improve the Attention kernel latency by 3.7× - 15.3× at lengths of 32K - 128K.

The underlying logic is: by analyzing the gradient sensitivity, it cuts off the useless Middle Q - K blocks and only retains the Streaming and the end aggregation areas.

This optimization is training-free and can be combined with dynamic sparsity (such as MInference, FlexPrefill) to achieve end-to-end cost reduction and efficiency improvement without changing the architecture.

For Dr. Yang's team, TriangleMix is not an independent work but a part of the thinking about the attention mechanism, information organization, Context orchestration logic, and even the native system of agents.

However, what exactly has changed in the understanding of Attention behind TriangleMix? Why can it "almost losslessly" eliminate a large amount of computation? And can this method be extended to Memory, Retrieval, and larger agent system architectures? Based on these questions, we had an in-depth conversation with Dr. Yang.

Before delving into the conversation, let's quickly understand the technical motivation and core structure of TriangleMix.

When dealing with long-context tasks, the Attention of large models often faces the problem of a sharp increase in computational load during the prefill phase. Its complexity grows as O(N²) with the input length. Especially at input scales of 32K - 128K, this brings significant video memory pressure and TTFT, becoming the main bottleneck hindering performance improvement in actual deployment.

TriangleMix proposes a hierarchical sparse Attention architecture to address this issue: by analyzing the gradient sensitivity of each layer of Attention to the final output, the authors found that the model has extremely low dependence on the Middle Q - K area in the deep layers. Therefore, they retain the standard Dense attention in the shallow layers and switch to the Triangle-shaped mask in the deep layers - skipping the middle part and only retaining the front section (Streaming area) and the end (Last Q - K area), thus significantly reducing the computational cost of deep Attention and reducing the complexity from O(N²) to O(N).

In practical applications, TriangleMix adopts a hierarchical Attention segmentation strategy: the first 16 layers use the standard Full attention, and the last 16 layers switch to Triangle attention, only activating the lower triangular area of the Attention matrix (i.e., each Q can only attend to the K in front of it).

This structure supports combination with existing dynamic sparse methods (such as MInference, FlexPrefill) to build a Hybrid mode; it is also a training-free structural optimization method that can be directly deployed on mainstream large models like Llama - 3.1 and Qwen2.5 without retraining.

Paper experiments show that when applying Triangle attention to the last 62.5% of the layers (i.e., L_tri_start = 12) in Llama - 3.1 - 8B - Instruct and Llama - 3 - 8B - 262K, the model still retains 99.7% of its original performance.

This means that TriangleMix can use the O(N) attention structure in most deep layers without significantly losing expressive power, thus achieving significant inference acceleration.

The actual measurement results in the paper also show that TriangleMix can significantly reduce Latency and Memory consumption with almost no loss in accuracy.

Actual measurement results display

In the Llama - 3.1 - 8B - Instruct model, Triangle attention compresses the kernel latency per layer from 750ms (128K Context) to 49ms, with an acceleration ratio of 15.3×, and the TTFT decreases by 12% - 32%.

In multiple benchmark tasks such as RULER (Revisiting Long Context Benchmark) and LongBench, TriangleMix shows almost the same accuracy as Dense attention, verifying its "training-free + almost lossless" structural advantages.

We sorted out our in-depth interview with Dr. Yang, focusing on the research insights behind the paper and the entire system evolution path connected by TriangleMix, covering topics from structural design to deployment efficiency.

This article is the sorted interview content, with a reading time of about 15 minutes.

Enjoy

"So I think instead of talking about Attention alone, we should view it from a higher perspective - placing it in larger topics such as agent systems, training mechanisms, Context expression, and task structures.

- Dr. Yang

Oasis: Hello, Dr. Yang. You work at Microsoft Research, a place that combines academia and industry. Could you introduce how to view and think about the TriangleMix research from the perspective of the interaction between academia and industry?

Dr. Yang: Let me first introduce our current overall research direction and framework.

Our team (the Shanghai Machine Learning Systems Team at Microsoft Research Asia) mainly conducts collaborative innovation research on systems and algorithms for large models and agent systems. Our work mainly focuses on two aspects:

The first part is about the efficient computation of large models, especially in long-context scenarios. We particularly focus on the research and acceleration of the sparse attention mechanism.

In this regard, in addition to TriangleMix, which we are discussing today, the team's main achievements also include:

MInference (NeruIPS 24) and MMInference (ICML 25) introduce sparse computation into Attention, mainly reducing the computational load and latency (Time-to-First-Token, TTFT) during the Prefilling phase;

Retrieval Attention and the subsequent work RetroInfer introduce the retrieval technology of Vector Index into the computation of Attention and the organization of KV Cache, achieving high inference throughput with low GPU memory;

SCBench (ICLR 25) systematically classifies and compares the performance of various sparsification methods from the perspective of KV Cache Sharing;

LeanK (EMNLP 25) explores the impact of the current mainstream position encoding technology on the frequency domain distribution (Dimension) of KV Cache and reduces the storage and computational requirements by reducing the redundancy in the frequency domain;

The other part is what we call "agent-native systems", which involves systematic research on the development, optimization, and efficient deployment of agent systems. In this work, we regard the agent as the first-class citizen of system research, rather than only focusing on the model part.

Because we realized early on that when the service object of the system is an agent rather than a single large model call, agent-native systems have more room for performance improvement and efficiency optimization, enabling agents to be more efficient, cost-effective, and further improve their work quality, better solve problems, and create real value.

Take the Parrot (OSDI 24) system we proposed in 2024 as an example. Its starting point is that in the "agent system", the computational graph can bring additional optimization space to the inference system. Traditional large model inference systems mainly focus on the optimization of "single requests", but in reality, no agent can complete a task with a single call. The performance of systems that only consider single calls is usually not optimal at the agent level.

Oasis: Please elaborate on the point that an agent cannot complete a task with a single call.

Dr. Yang: An agent is essentially a set of software programs that involve multiple model calls. There are specific dependencies between these calls (for example, the output of the previous model call becomes the input of the next one), and it also involves tool use or database queries. This requires system-level optimization to consider the "entire task chain" rather than single inference.

In actual deployment, we observed two notable points:

First, if you conduct system-level optimization from the global perspective of the agent, you can basically obtain additional performance improvements. In some scenarios, we even saw more than a 10-fold improvement compared to traditional methods. This is because the optimization goal has changed - it's not about "making a single request fast" but about "making a whole set of tasks more coordinated".

Second, there is also a very interesting change: more and more large model traffic is initiated by programs calling themselves rather than by humans. These programmatic call chains are actually more like a new system rather than the traditional "user input + model output".

Oasis: Will this change also affect the architectural design of training services?

Dr. Yang: Yes. We also launched a new project called AgentLightning focusing on agent training and optimization. Here, the question we are exploring is: how to build a standardized training service in the face of various forms of agent implementations. And the key is that this service must be non-intrusive. Many current optimization methods assume that you use a certain framework, but many real projects don't have a unified framework, and some developers even find the framework itself a burden.

So we are creating a "agent optimization middleware", which brings new possibilities for improving the capabilities of the base model. The unified data interface provided by Agent Lightning allows the continuously generated and meaningful agent interaction data to flow into the base model in a standardized way, thereby further enhancing the capabilities of the base model;

On the other hand, it also depicts a blueprint for future AI application development. By seamlessly empowering all agents with model optimization capabilities as a service, Agent Lightning significantly reduces the threshold for developing, iterating, and deploying high-performance adaptive agents.

We also focus on training-free optimization methods, such as prompt optimization and context optimization.

Oasis: So, does the structural expression of Context also become a key issue?

Dr. Yang: Yes. In current agent applications, developers often need to put different data objects into the context, such as tables, files, and code repositories.

However, these different objects need to be converted (Rendering) into tokens that the model can process. Due to the differences in data distribution during model training, these specific conversion requirements often vary from model to model. Inappropriate conversion can easily affect the final performance.

For example, for a table, whether you expand it as text or encode it as structured tokens will lead to completely different results. And these decisions should not be made by agent developers, just as front-end engineers don't need to manage the DOM tree manually.

So we developed a framework called POML, which is a bit like the HTML language and front-end framework in web development. It allows developers to simply declare the object type, and the system will automatically convert it into the underlying structure and then map it to the appropriate token expression. This work is not only a convenient tool for developers but also provides us with many insights into the context, such as how to understand, manage, and optimize the model context at what granularity.

These insights have brought a lot of interesting thinking to our Attention + Context architecture.

Oasis: So has your current focus shifted from single-module optimization to the construction of the entire system logic?

Dr. Yang: Absolutely. Our team is currently focusing on "agent-native systems", including:

Agent optimization middleware: Decouples agent developers from the system layer and supports the scheduling of different strategies such as model training, prompt, and context optimization.

Multimodal structure fusion: For example, Video RAG and Memory components. We integrate semantic Memory with Knowledge graphs to serve different task requirements.

Crowd-sensitive system design: Includes optimizing code interaction for visually impaired people and developing agents for daily training of people with cognitive disabilities, using agents to help special groups.

The underlying optimization methods are still Attention + Memory + Retrieval, but now we are more concerned about how to turn these into a service-oriented, deployable, and interpretable system solution.

Since our work is from a system perspective, we don't think of Attention as a single module. I think instead of talking about Attention alone, we should view it from a higher perspective - placing it in larger topics such as agent systems, training mechanisms, context expression, and task structures.

<