Lossless compression beyond ZIP is here. The University of Washington turns large models into lossless text compressors.
When large language models generate massive amounts of data, the challenge of data storage also arises.
In response, researchers from the SyFI Lab at the University of Washington (UW) have proposed an innovative solution: LLMc, which is an engine that uses large language models themselves for lossless text compression.
Benchmark test results show that on various datasets such as Wikipedia, novel texts, and scientific abstracts, the compression ratio of LLMc is better than that of traditional compression tools (such as ZIP and LZMA). Meanwhile, compared with other closed - source compression systems based on LLM, LLMc also demonstrates equal or even better performance.
It's worth mentioning that this project has been open - sourced. The main author is Yi Pan, an undergraduate student from the ACM class at Shanghai Jiao Tong University, who is currently interning at the University of Washington.
The Compression Mechanism of LLMc
The inspiration for LLMc came from an internal discussion in the lab a year ago. At that time, the researchers faced a core challenge: the kernel operations involved in LLM inference are highly non - deterministic, which makes precise and reproducible compression and decompression difficult.
However, with the breakthrough in deterministic LLM inference in the industry, this problem has been solved, paving the way for the birth of the new engine. The research team quickly built a prototype of LLMc and successfully proved the feasibility of using LLM for efficient compression.
The connection between LLM and data compression is rooted in the basic principles of information theory.
Shannon's source coding theorem states that the optimal coding length of a symbol is proportional to its negative log - likelihood. In short, the higher the probability of an event, the less information is required to encode it.
Since the core task of LLM is to predict the next token, a good LLM can assign a very high probability to the next token in a real sequence.
This means that LLM is essentially a powerful probability prediction engine, which is the key to achieving efficient compression. LLMc takes advantage of this principle to convert the high - dimensional distribution of natural language into structured probability information, thus achieving unprecedented compression effects.
The core idea of LLMc is a clever method called "rank - based encoding".
During the compression process, the LLM predicts the next possible tokens based on the current context and generates a complete list of probability distributions. In most cases, the actual token that appears is always among the top few in this prediction list.
Instead of directly storing the token itself (e.g., its ID), LLMc stores the "rank" of the token in the probability - sorted list. These ranks are usually very small integers, so they take up very little storage space.
During decompression, the system uses the exact same LLM and context to reproduce the probability distribution at that time. Then, it only needs to read the previously stored "rank" to accurately select the corresponding token from the list, thus losslessly restoring the original text.
In this process, the LLM itself acts like a large - capacity "codebook" or reference system shared between the compressor and the decompressor.
Challenges and Limitations
Although LLMc has achieved breakthrough results, the research team also points out some challenges and limitations of the current version.
Efficiency issue: The computational complexity of LLM inference is quadratic with the sequence length, and long - sequence inference is limited by memory bandwidth. To alleviate this problem, LLMc adopts a strategy of processing text in blocks to improve GPU utilization and reduce computational overhead.
Throughput: Due to its heavy reliance on large - scale model inference, the current processing speed of LLMc is far lower than that of traditional compression algorithms.
Numerical stability: To ensure the determinism of the decompression process, the system needs to use special kernels (batch_invariant_ops) and perform integer encoding on token ranks instead of directly using logarithmic probabilities.
Application scope: The current implementation is mainly for natural language. How to extend it to other modalities such as images, videos, or binary data is a direction worthy of exploration in the future.
Reference link:
https://syfi.cs.washington.edu/blog/2025 - 10 - 03 - llmc - compression/Github
Website:
https://github.com/uw - syfi/LLMc
This article is from the WeChat public account "QbitAI", author: Shuofeng. It is published by 36Kr with authorization.