Das 2-Bit-Komplexmodell ist mit Vollpräzision vergleichbar. Das universelle Framework von Peking-Universität ermöglicht es, dass große Modelle auch auf Mobiltelefonen reibungslos laufen.
Without retraining, model compression achieves performance comparable to FP16 at 2 bits.
Recently, a team from Peking University proposed a universal framework called Fairy2i, which is directly based on existing pre - trained models and enables extremely low - bit quantization.
This framework losslessly converts the real - valued model into a complex form by using a widely - linear representation. Then, it combines phase - sensitive quantization and recursive residual quantization to achieve a breakthrough: The performance at only 2 bits approaches that of a full - precision model.
Here are more details below.
Research Highlights: Reuse of Real Weights and Recursive Residual Quantization
It is well - known that large models are often difficult to be efficiently deployed on edge devices such as mobile phones and cars during inference due to their large parameter storage and computational requirements.
Traditional quantization methods often have the problem that the performance drops significantly when the model is compressed to an extremely low bit - width (e.g., 1 - 2 bits). Especially when reusing pre - trained models directly, it is difficult to find a balance between compression and accuracy.
Fairy2i specifically solves this problem. This includes:
1. General Linear Representation: Cost - effective Lossless Adoption and Bridge between Real and Complex Numbers
During the "construction" phase, Fairy2i significantly reduces the costs required for training by solving the problem of how to "transform" a real - valued model into a complex - valued model.
In contrast to methods like iFairy, which require high computational power for full - scale pre - training (Pre - training from scratch), Fairy2i chooses a more efficient way of "adoption".
The team has proven a mathematical equivalence: Any linear layer with an even number of dimensions can be losslessly re - parameterized into an equivalent "general linear complex form" (Widely - Linear Complex Form).
This means that one can directly load the pre - trained weights of models like LLaMA and convert them into a complex form without changing the original parameter size.
This strategy not only avoids the high computational power required for building a complex - valued model from scratch but also keeps the inference results of the model before quantization completely the same. This provides an ideal starting point for subsequent extremely low - bit quantization.
2. Phase - Sensitive Quantization: Efficient Coding with {±1, ±i}
During the "quantization" phase, Fairy2i inherits the core advantages of iFairy.
It uses the four fourth roots of unity on the unit circle {+1, - 1, +i, - i} as the codebook. Compared with binary (+1, - 1) or ternary quantization (+1, 0, - 1) in the real domain, these four points in the complex domain fully utilize the 2 - bit coding space and have higher information density and better symmetry.
3. Recursive Residual Quantization: Eliminate Errors with Minimal Effort
To better approach the performance of a full - precision model, the team proposed a mechanism called recursive residual quantization.
Since there are errors in a single quantization, the "errors" are quantized again. Fairy2i represents the weights as the sum of several low - bit terms.
Experiments show that with only T = 2 recursion steps (i.e., equivalent to 2 bits), the quantization noise can be significantly reduced.
Moreover, like iFairy, Fairy2i also has the property of being "multiplication - free" during inference.
Since the weights are quantized in combinations of {±1, ±i}, the matrix multiplications during inference are converted into simple addition, subtraction, and data - exchange (Swap) operations.
Even more sophisticated is that the recursive residual calculation in Fairy2i is data - independent. This means that the calculations in multiple steps can be performed in parallel, which improves the accuracy without significantly increasing the inference latency.
Performance: Strong Performance, Approaching FP16
The experimental results show that Fairy2i has achieved impressive results with the LLaMA - 2 7B model.
In terms of language modeling ability (PPL on the C4 dataset), Fairy2i (2 bits) has achieved an extremely low perplexity of 7.85.
This performance is not only significantly better than existing 2 - bit quantization methods but even surpasses some 3 - bit quantization models and is very close to the performance of the full - precision FP16 model (6.63).
In the evaluation of downstream tasks (Zero - shot Accuracy), Fairy2i also shows strong performance, with an average accuracy of 62.00%.
This result shows that Fairy2i has almost closed the performance gap caused by extremely low - bit quantization and is only a small step away from the full - precision model (64.72%). Thus, a performance leap is achieved with an extremely low bit budget.
The emergence of Fairy2i not only solves the problem that pre - trained real - valued large models are difficult to quantize efficiently but also unlocks the potential of extremely low - bit quantization using complex - domain techniques. This makes it possible for large models to run smoothly on edge devices.
It should be noted that Fairy2i has currently only been trained with 30 billion (30B) tokens due to limited computational power.
The team is convinced that the complex representation has excellent capacity that has not been fully explored. With further training on larger datasets in the future, Fairy2i may not only compete with the original full - precision base model but may even completely outperform it in accuracy.
The corresponding article has already been published. This technology could be the key to the widespread use of large models on edge devices.
The team would like to express special thanks: This research was strongly supported by Nine Chapters Cloud (www.alayanew.com) and the University of the Great Bay Area.
Article link: https://arxiv.org/abs/2512.02901
HuggingFace: https://huggingface.co/PKU - DS - LAB/Fairy2i - W2
GitHub: https://github.com/PKULab1806/Fairy2i - W2
modelscope: https://modelscope.cn/models/PKULab1806/Fairy2i - W2
This article is from the WeChat account "Quantum Bit". Author: Fairy2i - Team. Published by 36Kr with permission.