HomeArticle

Let the large model understand "highlighted annotations": Edit the Key vector before attention calculation, and use spectral decomposition to make the model "follow your instructions"

量子位2026-03-31 15:30
Make large models easily grasp the key points

It's not that easy to make large models focus on a specific sentence in the prompt.

In the field of NLP, Attention Steering is one of the core technologies for controlling the focusing behavior of large language models (LLMs). Among them, Prompt Highlighting, which enables the model to prioritize the key text specified by the user, is a crucial strategy.

However, existing methods require explicit storage of the complete attention matrix, making them completely incompatible with efficient implementations such as FlashAttention, resulting in significant latency and video memory bottlenecks.

To overcome this challenge, Weixian (Waylon) Li from the University of Edinburgh, in collaboration with researchers from Huawei UK Research Institute, Queen Mary University of London, and RayNeo, proposed SEKA (Spectral Editing Key Amplification) and its adaptive variant AdaSEKA.

This method takes a different approach. It directly edits the Key vectors before attention calculation, learns the "relevance subspace" through spectral decomposition to guide attention allocation, and is naturally compatible with FlashAttention, with almost zero latency overhead. Currently, this work has been accepted by the top artificial intelligence conference ICLR 2026.

Core Method: Rewrite Key Vectors Before Attention Calculation

This paper proposes SEKA (Spectral Editing Key Amplification), whose core idea is very intuitive: Instead of modifying the attention matrix after attention calculation, directly edit the Key vectors before calculation to guide attention allocation from the source.

SEKA learns the relevance subspace through spectral decomposition and edits the Key vectors before attention calculation; AdaSEKA further uses the Query vectors to dynamically combine multiple expert projections.

Specifically, SEKA consists of two stages: offline learning and online inference:

Offline Stage: By constructing contrastive prompt pairs (positive/negative/neutral), extract Key embeddings under different conditions, and use singular value decomposition (SVD) to learn a "relevance subspace". This subspace captures the most significant change directions in the Key vectors when certain tokens are relevant to the question.

Online Inference Stage: For the tokens to be highlighted, project and amplify their Key vectors along the learned relevance subspace. The formula is simple and elegant: k’ = k + g·P·k, where P is the projection matrix and g is the gain coefficient.

This operation is mathematically equivalent to adding a low-rank bias term to the attention scores. However, since it acts entirely at the Key embedding level, it is naturally compatible with efficient implementations such as FlashAttention and does not require access to or storage of the attention matrix.

Selective Guidance: Not All Attention Heads Are Worth Intervening

A key design of SEKA is that it does not apply guidance to all KV heads but only selects those heads that are sensitive to "relevance".

The green areas are concentrated in the middle and later layers, indicating that "retrieval" is mainly distributed in these layers, which is also the basis for SEKA to selectively apply guidance.

The above figure shows the relevance sensitivity of all layers and KV heads of Qwen3-8B. The green areas (high ℓ₂ distance) are concentrated on specific heads in the middle and later layers, which highly coincides with the distribution of "retrieval heads" found in recent mechanism analyses. SEKA takes advantage of this discovery and only applies guidance to these sensitive KV heads, avoiding interference with other functional heads. Ablation experiments also confirm that removing this screening mechanism will lead to a significant decline in performance.

Advanced Method: AdaSEKA Makes Guidance "Task-Specific"

The projection matrix of standard SEKA is fixed, and manual parameter tuning may be required for different types of tasks. Therefore, this paper further proposes AdaSEKA (Adaptive SEKA), which introduces a multi-expert routing mechanism:

Learn multiple sets of "expert projections" for different tasks (such as fact correction, instruction following, etc.) respectively.

During inference, use the alignment degree between the Query vector and each expert subspace to automatically calculate dynamic weights and combine the guidance operator most suitable for the current prompt in real-time.

This mechanism does not require any additional training, has extremely low computational cost, and significantly reduces the burden of hyperparameter tuning. New experts can be added modularly at any time without recalculating existing experts.

Experimental Results

This paper conducted comprehensive experiments on standard benchmarks such as CounterFact (knowledge conflict), Bias in Bios (occupation extraction), and Pronoun Changing (instruction following) using Qwen3 (4B/8B/14B) and Gemma3 (4B/12B).

The following table shows the performance of each method on different models:

SEKA and AdaSEKA rank among the top two in most settings, increasing the accuracy from 30 - 50% to nearly 99% on CounterFact.

The efficiency comparison is also impressive:

SEKA only adds 0.03 seconds of latency and 0.03 GB of video memory per sample, with an efficiency advantage dozens of times that of PASTA and is completely compatible with FlashAttention.

The significance of SEKA lies not only in a more efficient attention guidance method but also in revealing an important discovery: There is a structured "relevance subspace" in the Key embeddings of large models, which can be discovered and utilized through simple spectral decomposition.

This discovery provides a new perspective for understanding and controlling the attention mechanism of Transformers and opens up new ideas for building more controllable and efficient large language model systems. In today's increasingly popular long-context applications, an efficient and effective attention guidance framework has important practical value.

Paper Title: Spectral Attention Steering for Prompt Highlighting

Paper Link: https://arxiv.org/abs/2603.01281

Code: https://github.com/waylonli/SEKA

This article is from the WeChat official account "QbitAI", written by the SEKA team and published by 36Kr with permission.