Compress Reasoning Cost by 11 Times by Using a Few Abstract Symbols to Replace the Chain of Thought

A, B, C, D, E, F...

In 2026, the AI industry is experiencing a quietly emerging cost crisis.

In the impression of most developers, the cost of AI models has been on a downward trend in the past few years. Indeed, from 2022 to 2024, the inference cost of cutting - edge models decreased by more than a thousand times. This trend made many teams believe that it was only a matter of time before AI was deployed in products.

However, the emergence of inference models has shattered this expectation. OpenAI's o - series, Anthropic's Claude Extended Thinking, DeepSeek R1 - these models will conduct a large amount of "thinking" internally before generating the final answer, producing thousands of intermediate inference steps, and then spit out the final answer. These intermediate steps have a special name on the bill: Reasoning tokens.

The problem is that you have to pay for these thinking processes, even if you can't see them at all.

According to the statistics of industry research institutions at the beginning of 2026, for a complex code review task, if an inference model is used, the cost may be 5 to 10 times that of an ordinary model. In a multi - step planning task, the number of tokens consumed by internal thinking steps sometimes exceeds ten thousand. A team's test found that when Claude Opus 4.6 and Grok - 4 were asked to answer the same question, they gave exactly the same answer, but Grok - 4 consumed more than twice the number of tokens as Claude, and the cost gap was nearly 10 times. All of this is just because the model thinks too much.

In other words, AI is paying a huge price to "make itself clear".

To some extent, this cost is by design. The existing mainstream inference models all rely on a mechanism called Chain - of - Thought (CoT): let the model write out the reasoning process step by step in natural language like a human being, and then give the answer. This method is effective, but reasoning in natural language is inherently verbose.

Against this background, a team from IBM Research published a paper. They raised a question: What if AI doesn't need to think in human language at all?

Paper title: Thinking Without Words: Efficient Latent Reasoning with Abstract Chain - of - Thought
Paper address: https://arxiv.org/pdf/2604.22709

Abstract Chain - of - Thought: A Language Humans Can't Understand

The paper from IBM Research named this method Abstract Chain - of - Thought (Abstract - CoT for short).

The core idea is surprisingly simple: instead of letting the model write down the reasoning process in natural language, give it a brand - new "symbol vocabulary" and let it think with these symbols, and then directly generate the answer.

There isn't a single word in this vocabulary that humans can understand. It consists of a set of special placeholder tokens, such as <TOKEN_A>, <TOKEN_B>... all the way to <TOKEN_Z>, and then continues to expand with double - letter combinations. These symbols are meaningless to humans, just like a code. However, in the experimental results of the paper, they can replace the natural - language reasoning chains that often have hundreds of steps and compress the reasoning steps to within dozens of symbols.

If we use a real - life analogy to understand: it's a bit like an experienced chef who no longer needs to say every step of the operation out loud, but relies on a set of gestures and notations that only he understands to complete all the calculations in his mind quickly and then directly serve the dish. For outsiders, this process is opaque; but the result is exactly the same.

In an example shown in the paper, for a math word problem, the standard Chain - of - Thought model needs to go through 8 natural - language steps to get the answer; while the Abstract - CoT version only uses 14 abstract symbols to reach the exact same conclusion. Both processes are correct, but the number of reasoning tokens consumed by the latter is less than one - tenth of the former.

Two Challenges: Cold Start and "Learning a New Language"

This idea sounds simple, but there are two fundamental difficulties in implementation.

The first difficulty is the cold - start problem. These new symbols have never appeared in the model's vocabulary, and their embedding vectors are randomly initialized, which are meaningless to the model. You can't expect a child who has never learned a certain language to suddenly be able to think in that language.

The second difficulty is: How to make the model learn to think effectively with these symbols instead of just randomly stacking them?

IBM's research team designed a two - stage training plan to address these two problems.

Stage 1: Policy Iteration Warm - up

The core mechanism of this stage is an "information bottleneck" design. Specifically, during training, the model will see the problem, the standard natural - language reasoning chain (provided by the teacher model), and a sequence of abstract symbols at the same time. But the key is that the generation of the final answer is only allowed to "see" that sequence of abstract symbols and cannot directly "see" the natural - language reasoning chain.

This is like: let a student get both the complete problem - solving process and a summary note at the same time, but can only look at the note to answer the questions during the exam. Over time, the student learns how to condense the key information into the note, because only when the note is sufficient can they pass the exam.

After multiple rounds of iteration, the model gradually learns how to compress the key information required for reasoning into those abstract symbols.

Stage 2: Warm - started RL

After the warm - up stage, the research team introduced reinforcement learning (GRPO algorithm) to further optimize the generation strategy of the abstract symbol sequence. The model is required to directly generate high - quality answers only by relying on those abstract symbols (without any natural - language reasoning chain assistance). A generative reward model is responsible for scoring the output quality, and the feedback signal drives the model to continuously improve its "symbol language".

Experimental Results: How Much is Saved and What's the Cost

The paper verified the effectiveness of Abstract - CoT on three major benchmark tests: mathematical reasoning (MATH - 500), general instruction following (AlpacaEval), and multi - hop question answering (HotpotQA).

The two most core data points are as follows:

In the MATH - 500 mathematical reasoning test, with Qwen3 - 8B as the base model, the standard Chain - of - Thought + reinforcement learning method (SFT + RL) generates an average of 1671 tokens per question, with an accuracy rate of 92.6%. Abstract - CoT (Warm - up + RL) only generates 144 tokens, and the accuracy rate reaches 90.8%. The compression ratio is about 11.6 times, and the performance gap is only 1.8 percentage points.

In the AlpacaEval general instruction test, Abstract - CoT not only compresses the number of tokens from 496 to 225 (about 2.2 times), but the winning rate actually increases from 58.4% to 60.8% - while the generated content is significantly reduced, the quality is improved.

More difficult tests also show a similar trend. The results of GPQA - Diamond (graduate - level question answering) and AIME'25 (math competition questions) show that even for high - difficulty reasoning tasks, Abstract - CoT can achieve a token compression of 2.7 to 7.9 times, while the performance is almost the same as the full - scale Chain - of - Thought.

One detail is worth noting: the effect of using "cold - start RL" alone (training abstract symbols directly with reinforcement learning without the warm - up stage) is very poor, and in most settings, it is even worse than the baseline model. This shows that the warm - up stage is indispensable - the model must first learn the basic semantics of this "language" before it can be further optimized in the reinforcement learning stage.

Unexpected Discovery: Abstract Symbols Spontaneously Formed "Language Rules"

In the experimental analysis, the research team discovered a phenomenon they didn't expect.

After reinforcement learning training, the usage frequencies of 64 abstract symbols spontaneously formed a power - law distribution - a few symbols are used frequently and repeatedly, while most symbols are used very rarely. This distribution is highly consistent with Zipf's law in natural language (the basic law of natural - language word - frequency distribution).

Specifically, a symbol called <TOKEN_F> is used far more frequently than all other symbols, becoming a high - frequency word in this "language" similar to "de" or "shi" in Chinese. Other symbols are like rare characters and only appear in specific situations.

What does this mean? Researchers believe that this is a "concept reuse" mechanism spontaneously learned by the model. Frequently appearing symbols may correspond to reasoning steps commonly required across tasks (such as "initializing variables" or "verifying boundary conditions"); rare symbols may correspond to rare reasoning patterns in specific fields.

Of course, there is currently no way to directly "interpret" the specific semantics of these symbols. This language is still opaque to humans.

Limitations and Outlook

Abstract - CoT still has obvious limitations at present. The most direct one is that this abstract reasoning process is completely unreadable to humans, which means that its applicability will be limited in scenarios that require auditability (such as medical, legal, and financial decision - making assistance).

In addition, this method relies on the existing natural - language Chain - of - Thought data to complete the warm - up training. This means that Abstract - CoT is still "parasitic" on language reasoning at present - without the prior knowledge of language reasoning, the cold - start training of pure abstract symbols can hardly work. To some extent, this shows that AI must first "learn to speak" before it can "learn to be silent".

The research team also proposed several future directions in the paper, including: dynamically adjusting the length of the abstract symbol sequence (allocating different lengths of "thinking budgets" according to the problem difficulty), and constructing a hierarchical symbol structure (letting some symbols represent reusable reasoning sub - programs).

Perhaps the most worthy of attention is that it opens a new window for AI reasoning monitoring. The abstract symbol sequence is easier to analyze structurally than the natural - language reasoning chain that spans thousands of tokens. Researchers believe that this provides new possibilities for the research on "Chain - of - Thought monitorability". In the future, it may be possible to judge whether the model is "thinking normally" by analyzing symbol patterns without understanding the semantics.

AI is Learning to "Cut Out the Fluff"

In the past two years, the improvement of AI reasoning ability has largely been achieved by "letting the model say more" - longer thinking chains, more intermediate steps, and more detailed self - verification. By 2026, this approach is encountering increasingly obvious cost bottlenecks.

The question raised in this paper by IBM Research is actually challenging a basic assumption: Does AI have to think in human language?

Their experimental results show that the answer may be no. A "sign language" consisting of 64 meaningless symbols can reproduce the performance close to that of the natural - language reasoning chain at one - tenth of the token cost in multiple tasks such as mathematical reasoning, general question answering, and multi - hop retrieval.

This is not a revolutionary change, nor is it without cost. But it at least shows that on the path of AI reasoning efficiency, there may be a direction that we haven't seriously explored before: let the model learn to "think silently".

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014), author: Panda. It is published by 36Kr with authorization.