HomeArticle

Zhipu was just a little unlucky. Their research on visual tokens clashed with that of DeepSeek again.

量子位2025-10-23 09:29
Are pixels the ultimate tokens for AI?

What a coincidence... Zhipu and DeepSeek have clashed again.

The competition is fierce. Less than a day after the release of DeepSeek-OCR, Zhipu open-sourced its own visual token solution - Glyph.

Since they are competing on the same stage, it's natural to invite Karpathy, who has been enthusiastically endorsing DeepSeek these days, to take a look:

Perhaps you'll also be interested in our work.

Just publish a paper, why are they vying for attention? (doge)

Netizens joked: It's like there's a domineering CEO romance in the AI world.

Zhipu Also Does Visual Compression

Yes, just like DeepSeek-OCR, the goal of Zhipu's paper is also to solve the problem of the long context of current LLMs through visual means.

The Surging Context

As the capabilities of LLMs skyrocket, the demand from users and manufacturers for long contexts is becoming more and more urgent.

After all, whether it's long document analysis, code review, or multi-round conversations, models can't be like goldfish that forget what they've seen. To make them perform tasks reliably, they need to have a stable "working memory".

But expanding the context is a rather thankless task.

For example, if you expand the context from 50K to 100K, the computing power consumption will approximately quadruple.

The reason is that more tokens mean the model needs to remember more activation values, caches, and attention weights. These things require a lot of resources during the training and inference phases.

If it can truly improve performance, spending more money might be acceptable.

But what's most distressing is that even after investing a lot of money to expand the context, the model may not necessarily become smarter.

IBM's research points out that simply "stuffing more tokens" doesn't guarantee a linear improvement in the model's performance.

On the contrary, when the input is too long and the information is too complex, the model may be overwhelmed by noise and information overload, becoming more confused.

Currently, there are roughly three mainstream solutions to this kind of problem:

The first type is to expand the position encoding.

In the Transformer structure, the model doesn't know the order of the input. So, each token needs to be given a "position encoding" to tell the model the order.

The approach of expanding the position encoding is to directly extend the existing position encoding range.

For example, "interpolate" the position range from 0 - 32K to 0 - 100K. In this way, the model can accept longer inputs during operation without retraining.

However, this doesn't solve the problem of inference cost. The model still needs to traverse all the context during the inference phase.

Moreover, although the model can continue to read, since it has never seen such a long context during training, forcing it to read now won't yield good results.

The second type is to modify the attention mechanism.

Since the context has become longer, make the model "read" faster. For example, use techniques like sparse attention and linear attention to improve the processing efficiency of each token.

But no matter how fast it is, the total number of tokens remains the same. If the context reaches hundreds of thousands, even high efficiency won't be enough.

The third type is the Retrieval Augmented Generation (RAG) approach.

It first selects the key points through external retrieval and then feeds them to the model. The input becomes shorter, and the inference becomes easier.

But as we all know, the output of RAG is definitely not as good as the model's answer based on the training data. Moreover, the additional retrieval step will slow down the overall response.

It's really a headache to find a solution for the context problem.

Reading from "Pictures"

To solve this problem, the research team proposed a new paradigm - Glyph.

The principle is simple: Since the information density of plain text is not high enough, put it into a picture.

When an ordinary LLM processes text, it splits the sentence into individual tokens and inputs them sequentially, which is very inefficient.

For example, if a sentence can be divided into 1000 tokens, the model has to calculate 1000 vectors and perform attention calculations between them.

In contrast, Glyph doesn't read word by word. Instead, it first arranges the entire text into an image-like visual token and then hands this "screenshot" to the Vision-Language Model (VLM) for processing.

The reason for doing this is that the information density that an image can carry is much higher than that of plain text. Only one visual token is needed to accommodate the content that originally required several text tokens.

In this way, even a VLM with a fixed context can easily handle extremely long texts that would "overwhelm" an LLM without the need for tools like sparse attention or RAG.

For example, the novel "Jane Eyre" has about 240K text tokens. For a traditional LLM with a context window of only 128K, it can only handle half of it.

In this case, if you ask some questions that involve a large span of the story, the traditional model will probably not be able to answer.

For example: After the heroine left Thornfield, who helped her when she was in trouble?

But if you use Glyph to render the whole book into a compact image, it only needs about 80K visual tokens.

In this way, a VLM with a 128K context can easily read the entire "Jane Eyre", understand the story line, and answer questions from a broader perspective.

How is such an obvious effect achieved?

The training process of Glyph mainly consists of three stages:

Stage 1: Continual Pre-training

The goal of this stage is to enable the model to transfer its long-context understanding ability from the text world to the visual world.

Specifically, the research team renders a large number of long texts into images of different styles. They expose the VLM to various types of layouts, fonts, and arrangements to "read pictures and recognize text" to train stronger generalization ability.

In this process, the model will continuously learn how to align the text information in the image with the semantics of the original text.

Stage 2: LLM-driven Rendering Search

Although diverse rendering methods can improve the model's generalization ability, in practical applications, both efficiency and accuracy need to be considered.

How text is converted into images determines the delicate balance between compression ratio and readability.

Using a large font size and loose layout is not good because the information density is too low, which goes against the original intention of visual tokens.

However, overly pursuing information density is also not advisable.

Small fonts and a tight layout may lead to a high compression ratio, but the model may have difficulty "seeing clearly" and make misunderstandings.

Therefore, the research team introduced an LLM-driven genetic search algorithm to allow the model to automatically explore the optimal rendering parameters, such as font size, page layout, and image resolution, to achieve the maximum compression without losing semantics.

Stage 3: Post-training

After finding the optimal rendering scheme, the research team did two things: supervised fine-tuning and reinforcement learning to make the model smarter and more stable in "reading text from pictures".

In addition, they also added an auxiliary OCR alignment task during the SFT and RL stages to teach the model how to accurately restore text details from images and truly integrate visual and text abilities.

Finally, Glyph has mastered two powerful skills:

1. Understand long texts and make accurate inferences.

2. Recognize details without getting confused when reading pictures.

With this combination of methods, Glyph can still handle high-compression visual context tasks with ease.

Slashing 75% of the Context

Now that we understand the principle, let's see how Glyph performs in practice.

Facts have proven that Glyph indeed helps to significantly reduce the number of tokens.

The experimental results show that Glyph achieves a 3 - 4 times token compression ratio in multiple long-context benchmark tests while still maintaining an accuracy comparable to mainstream models (such as Qwen3 - 8B).

This compression not only reduces the computing power burden but also brings about a nearly 4-fold increase in prefill and decoding speed and about a 2-fold acceleration in SFT training.

What's even more surprising is that in the case of extreme compression, a VLM with a context window of only 128K can still handle text tasks equivalent to millions of tokens without any problem.

In addition, although Glyph's training data mainly comes from rendered text images, it also performs well in multi-modal tasks, proving its strong generalization potential.

In summary, this paper proposes a long-context modeling framework called Glyph.

The core idea is to "draw" long texts into pictures and then let the VLM read the text from the pictures quickly, thus achieving efficient context expansion.

The Paper's Authors

Who made such amazing achievements?

The first author of the paper is Jiale Cheng, a doctoral student at Tsinghua University. His main research areas include natural language generation, dialogue systems, and related artificial intelligence interaction technologies.

Currently, Jiale has published several papers and has a good influence on Google Scholar.

In addition, there are three other main contributors to the paper: Yusen Liu, Xinyu Zhang, and Yulin Fei.

Unfortunately, there isn't much public information about them.

The corresponding author of this paper is Professor Minlie Huang.

Professor Huang graduated from Tsinghua University for both his undergraduate and doctoral degrees. Currently, he is a tenured professor in the Department of Computer Science and Technology at Tsinghua University. He also serves as the deputy director of the Intelligent Technology and Systems