The 2-bit complex model rivals full precision, and Peking University's general framework enables large models to run smoothly on mobile phones.
Without retraining, model compression achieves 2-bit performance comparable to FP16.
Recently, a team from Peking University proposed a general framework for extremely low-bit quantization directly based on existing pre-trained models - Fairy2i.
This framework losslessly converts real-valued models into complex forms through widely linear representation, and then combines phase-aware quantization and recursive residual quantization to achieve a breakthrough in performance approaching that of full-precision models with only 2 bits.
Here is more detailed content.
Research Core: Reusing True Weights and Recursive Residual Quantization
As is well known, large models are often difficult to deploy efficiently on edge devices such as mobile phones and cars during inference due to their large parameter storage and computational requirements.
Traditional quantization methods often face the problem of severe performance degradation when compressing models to extremely low bits (e.g., 1 - 2 bits). Especially when directly reusing pre-trained models, it is difficult to find a balance between compression and accuracy.
Fairy2i addresses this pain point specifically, as follows:
1. Generalized Linear Representation: Low-cost Lossless Inheritance, Bridging the Gap between Real and Complex Numbers
In terms of "architecture", Fairy2i significantly reduces the training cost by solving the problem of how to transform real-valued models into complex models.
Different from methods like iFairy that require high computational power for pre-training from scratch, Fairy2i chooses a more efficient "inheritance" path.
The team proved a mathematical equivalence: any even-dimensional real-valued linear layer can be losslessly reparameterized into an equivalent "widely-linear complex form".
This means that the pre-trained weights of models such as LLaMA can be directly loaded and converted into complex forms without changing the original parameter scale.
This strategy not only perfectly avoids the huge computational power consumption required to build a complex model from scratch but also keeps the inference results of the model completely unchanged before quantization, providing a perfect starting point for subsequent ultra-low-bit quantization.
2. Phase-Aware Quantization: Efficient Encoding Using {±1, ±i}
In terms of "quantization", Fairy2i inherits the core advantages of iFairy.
It uses the four fourth roots of unity {+1, -1, +i, -i} on the unit circle as the codebook. Compared with binary (+1, -1) or ternary (+1, 0, -1) quantization in the real domain, these four points in the complex domain make full use of the 2-bit encoding space, with higher information density and better symmetry.
3. Recursive Residual Quantization: Eliminating Errors at Extremely Low Cost
To further approach full-precision performance, the team proposed the Recursive Residual Quantization mechanism.
Since there are errors in one-time quantization, Fairy2i quantizes the "errors" again and represents the weights as the sum of several low-bit terms.
Experiments show that only T = 2 recursive stages (i.e., equivalent to 2 bits) are needed to significantly eliminate quantization noise.
In addition, like iFairy, Fairy2i also has the characteristic of "no multiplication" during inference.
Since the weights are quantized into combinations of {±1, ±i}, the matrix multiplication during inference is converted into simple addition, subtraction, and data swapping operations.
More ingeniously, the recursive residual calculation of Fairy2i is data-independent, which means that the calculations of multiple stages can be processed in parallel. While improving accuracy, it hardly increases the inference latency.
Performance: Strong Performance, Approaching FP16
Experimental results show that Fairy2i has achieved remarkable results on the LLaMA - 2 7B model.
In terms of language modeling ability (PPL on the C4 dataset), Fairy2i (2-bit) achieved an extremely low perplexity of 7.85.
This performance is not only significantly better than existing 2-bit quantization methods but also surpasses some 3-bit quantization models, approaching the level of full-precision FP16 (6.63).
In the evaluation of downstream tasks (Zero-shot Accuracy), Fairy2i also performed strongly, with an average accuracy of 62.00%.
This result shows that Fairy2i almost fills the performance gap caused by ultra-low-bit quantization, being only one step away from the full-precision model (64.72%), achieving a performance leap under extremely low bit budgets.
The emergence of Fairy2i not only solves the problem of difficult efficient quantization of pre-trained real-valued large models but also fully exploits the potential of ultra-low-bit quantization through complex domain technology, making it possible for large models to run smoothly on edge devices.
It is worth noting that due to limited computational resources, the current Fairy2i is only trained with 30 billion (30B) tokens.
The team firmly believes that the complex representation has excellent capacity that has not been fully exploited. With continuous training on larger-scale datasets in the future, Fairy2i is expected not only to match but also completely surpass the original full-precision base model in terms of accuracy.
Currently, the relevant paper has been made public, and this technology may become a key driving force for the popularization of large models on edge devices.
The team specially thanks: This research was strongly supported by AlayaNew (www.alayanew.com) and the Greater Bay Area University.
Paper link: https://arxiv.org/abs/2512.02901
HuggingFace: https://huggingface.co/PKU-DS-LAB/Fairy2i-W2
GitHub: https://github.com/PKULab1806/Fairy2i-W2
modelscope: https://modelscope.cn/models/PKULab1806/Fairy2i-W2
This article is from the WeChat official account "QbitAI", author: Fairy2i team. Republished by 36Kr with permission.