Huawei's new architecture cuts the main artery of Transformer, and the inference ability of any model soars on the spot.
It's time to operate on the main artery of Transformer.
Even though it holds the position of the cornerstone in the current AI world, its own problems are quite obvious:
Once it encounters complex math problems or needs multi - step logical reasoning, it starts to talk nonsense seriously...
Where exactly does the problem lie?
The answer lies in the core mechanism of Transformer —— Attention.
In essence, the traditional Attention mechanism is like a pairing comparison: each word only has a direct relationship with another word to generate an attention weight.
Although this architecture is good at capturing long - distance dependencies, it struggles when modeling complex, multi - hop, and multi - point logical relationships.
For example, it can easily understand "A knows B", but if it needs to understand "Zhang San got to know Wang Wu through Li Si", that is, the complex and indirect relationship between multi - hop and multi - point, its thinking circuit seems not deep enough, and the ceiling of its reasoning ability is instantly reached.
Now, this ceiling has been broken through by Noah's Ark Lab of Huawei!
Recently, the team introduced a brand - new architecture called Nexus, which is the Higher - Order Attention Mechanism.
It directly targets the core pain points of the Attention mechanism. Using higher - order attention can effectively model the complex associations between multi - hop and multi - point.
And from the experimental results, the effect is quite amazing.
As long as the new Nexus architecture is adopted, the model's ability in complex reasoning tasks such as mathematics and science can be significantly improved immediately, and it is also without any increase in parameters.
Marvelous, truly marvelous.
Next, let's take a deep dive into the ingenious improvements of Nexus.
The Ingenious Improvements of the Higher - Order Attention Mechanism
To understand the significance of the higher - order, we must first review the fundamental flaws of the traditional self - attention mechanism.
The standard self - attention mechanism essentially generates Query (Q), Key (K), and Value (V) by passing the input sequence X through three linear transformations WQ, WK, and WV respectively, and then calculates the attention weight through softmax:
But here comes a key problem: both Q and K are static linear projections that are independent of the context.
That is to say, the Query vector of a certain token is only determined by itself and cannot perceive the existence of other tokens; this results in the attention weight only reflecting the direct relationship between pairs.
The First Ingenious Improvement: Innovation of Q and K
The first improvement by Noah's Ark Lab of Huawei is ingeniously targeted here: Nexus makes the generation process of Q and K itself an attention operation.
In other words, before calculating the final Q and K, the token will first conduct a "pre - reasoning"; this process is actually a nested self - attention mechanism.
The token first aggregates information from the global context through this internal loop to form a more refined representation with context - awareness ability, and then uses this representation to calculate the final Q and K.
It's like before you ask me a question (calculating Attention between Q and K), each token has thought deeply internally and fully absorbed the environmental information in the entire sequence.
The Q and K generated in this way naturally get rid of the rigidity of linear projection and have the dynamics to capture complex relationships.
The Second Ingenious Improvement: Clever Use of the Recursive Framework
The most ingenious part of the Nexus architecture lies in its Recursive Framework.
This internal attention loop can be recursively nested.
If we regard one layer of Attention as a first - order relationship (A knows B), then using the output of Attention as the input of the next layer of Attention can build a second - order relationship (Zhang San got to know Wang Wu through Li Si), and even higher - order relationships.
In Nexus, this recursive nesting is cleverly integrated into a single - layer structure to form a hierarchical reasoning chain.
The paper further recursivizes the above process and defines the m - th order attention as follows:
Here, m = 1 is the standard attention; m = 2 means Q and K are generated by one - time inner - layer attention; m = 3 means Q and K are generated by second - order attention, which is equivalent to "attention of attention of attention".
This structure naturally supports multi - hop reasoning chains. Just like when a person solves a math problem, they first understand the key variables in the question (the first layer), then think about the formula relationships between them (the second layer), and finally verify whether the overall logic is self - consistent (the third layer).
The Third Ingenious Improvement: No Increase in Parameters
Complex architectures often mean higher computational costs and more parameters, but Nexus avoids these problems through its ingenious design —— weight sharing strategy.
Specifically, both the inner and outer attention modules reuse the same set of projection weights WQ, WK, and WV.
This means that although the computational path is more complex, the number of model parameters is exactly the same as that of the original Transformer.
There is a key assumption behind this design: regardless of which layer of the recursion, the semantic transformation method of projecting tokens into Query or Key is similar.
The team has proven through experiments that this assumption holds.
In the ablation experiment of Pythia - 70M, the average accuracy of the Nexus - QK - Shared version with weight sharing is still nearly 1 percentage point higher than the baseline, and there is no increase in the number of parameters.
This makes Nexus an extremely efficient expression density enhancer —— achieving stronger reasoning ability with the same parameters.
Replace with Nexus, and the Reasoning Effect is Immediate
So, how effective is Nexus?
The paper verified it in two dimensions: training small models from scratch and transforming the architecture of existing large models.
Small Models Lead Across the Board
The research team trained Nexus from scratch on the Pythia series (from 70M to 1B) and evaluated it on six standard reasoning datasets: ARC - C, ARC - E, HellaSwag, LogiQA, PiQA, and SciQ.
The results are very consistent: Nexus outperforms the original Transformer at all scales.
The improvement is particularly significant in tasks that require multi - step reasoning or scientific common sense. For example:
On SciQ (science Q&A), the accuracy of the 70M model increased from 61.5% to 68.5%, a 7 - percentage - point increase;
On PiQA (physical common - sense reasoning), the accuracy of the 1B model increased from 62.5% to 63.6%.
This shows that Nexus is particularly good at dealing with problems that cannot be solved by surface pattern matching and is truly doing reasoning.
Ready - to - Use for Large Models
Facing larger - scale models, Nexus also shows the ability to be plug - and - play.
The team directly replaced the standard attention layers of the 1.5B and 7B versions of Qwen2.5 with the Nexus structure and only trained it in the SFT (Supervised Fine - Tuning) stage without changing the pre - trained weights.
The results show that on three high - difficulty math reasoning benchmarks (MATH - 500, AIME24, GPQA - Diamond), Nexus brings stable improvements:
The accuracy of Qwen2.5 - 1.5B on MATH - 500 increased from 78.6% to 80.1%;
The accuracy of Qwen2.5 - 7B on AIME24 increased from 45.2% to 47.5%.
Especially worth noting is the improvement on AIME24, because this kind of questions require strict multi - step logical deduction, and one wrong step means a complete failure. The improvement of Nexus shows that it indeed builds a more coherent reasoning chain internally.
From this perspective, Nexus is not only a new training paradigm but also an architecture upgrade kit. You don't need to retrain a model with hundreds of billions of parameters. You just need to replace the attention layer in the fine - tuning stage to unlock stronger reasoning ability.
Reasoning Ability Can Be Inherent in the Architecture
Although Nexus currently focuses on language models, its idea is universal.
Modeling higher - order relationships is also crucial in vision, graph neural networks, and multi - modal tasks; for example, in video understanding, "A saw B hitting C" is a typical ternary relationship that the traditional Attention has difficulty capturing directly.
The Huawei Noah team said that they will next explore the application of Nexus in vision Transformers and multi - modal large models and optimize its computational efficiency.
The IQ ceiling of Transformer may never lie in the number of parameters but in the expressive ability of its attention mechanism. Nexus of Huawei Noah injects higher - order reasoning ability into this core module in an elegant and efficient way.
It doesn't rely on adding more materials or prompt engineering but reconstructs the model's thinking mode from the bottom of the architecture.
Therefore, Nexus also reminds us that sometimes, a smart architecture is more important than the scale.
Paper Address:
https://arxiv.org/abs/2512.03377
This article is from the WeChat official account “QbitAI”, author: Jin Lei. It is published by 36Kr with authorization.