StartseiteArtikel

Foreigners are stunned. When asking questions in English, DeepSeek still insists on thinking in Chinese.

机器之心2025-12-03 17:13
Is it true that Chinese uses fewer tokens?

Just the day before yesterday, DeepSeek launched two new models in one go, DeepSeek-V3.2 and DeepSeek-V3.2-Speciale.

These two major versions have significantly improved in reasoning ability. The DeepSeek-V3.2 version can compete head-on with GPT-5, while the Speciale version combines long-term thinking and theorem-proving abilities, performing comparably to Gemini-3.0-Pro. A reader commented, "This model shouldn't be called V3.2; it should be called V4."

Overseas researchers couldn't wait to use the new version of DeepSeek. While marveling at the significant improvement in DeepSeek's reasoning speed, they encountered something they couldn't understand:

Even when asking DeepSeek in English, it would still switch back to "mysterious oriental characters" during the thinking process.

This really confused overseas friends: they didn't ask in Chinese, so why does the model still think in Chinese? Is reasoning in Chinese better and faster?

There are two different views in the comment section, but most comments believe that "Chinese characters have a higher information density."

Researchers from Amazon also think so:

This conclusion is in line with our daily perception. To express the same text meaning, Chinese requires significantly fewer characters. If large models are related to semantic compression, then Chinese is more efficient in compression compared to widely used English. Perhaps this is the origin of the saying that "Chinese saves more tokens."

Large models with multilingual capabilities often encounter efficiency issues if they only use English for thinking. Not only Chinese, but using other non-English languages for reasoning can indeed lead to better performance.

A paper from Microsoft, "EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning," found that using non-English languages for reasoning not only reduces token consumption but also maintains accuracy. Even when translating the reasoning trajectory back to English, this advantage still exists, indicating that this change stems from a substantial shift in reasoning behavior rather than just a superficial language effect.

Paper title: EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Paper link: https://www.arxiv.org/abs/2507.00246

In this paper, the authors evaluated three state-of-the-art open-source reasoning models: DeepSeek R1, Qwen 2.5 (32B), and Qwen 3 (235B-A22B). The questions were presented in English, but the models were explicitly instructed to perform their reasoning steps in one of seven target languages: Chinese (zh), Russian (ru), Spanish (es), Hindi (hi), Arabic (ar), Korean (ko), and Turkish (tr). The final answer must be provided in English to ensure consistency in evaluation.

The relationship between the token quantity ratio and the number of questions with at least one correct answer in both English and the target language (at least 5 common cases). This ratio is calculated relative to the average number of English tokens per question for DeepSeek R1.

Across all evaluated models and datasets, using non-English languages for reasoning consistently achieved a significant 20 - 40% token reduction compared to English, and usually did not affect accuracy. The token reduction for DeepSeek R1 ranged from 14.1% (Russian) to 29.9% (Spanish), while Qwen 3 showed even more significant savings, with a reduction of up to 73% in Korean. These efficiency improvements directly translate into lower reasoning costs, lower latency, and reduced computational resource requirements.

From the experimental results, Chinese can indeed save reasoning token costs compared to English, but it is not the most efficient language.

Another research paper also supports a similar view. The research paper "One ruler to measure them all: Benchmarking multilingual long-context language models" from the University of Maryland and Microsoft proposed a multilingual benchmark OneRuler containing 26 languages to evaluate the long-context understanding ability of large language models (LLMs) up to 128K tokens.

Paper title: One ruler to measure them all: Benchmarking multilingual long-context language models

Paper link: https://www.arxiv.org/abs/2503.01996v3

The researchers constructed OneRuler in two steps: first, they wrote English instructions for each task, and then collaborated with native speakers to translate them into 25 other languages.

Experiments on open-weight and closed-source language models showed that as the context length increased from 8K to 128K tokens, the performance gap between low-resource languages and high-resource languages widened. Surprisingly, English is not the best-performing language in long-context tasks (it ranks 6th among 26 languages), while Polish ranks first. In cross-lingual scenarios where the instruction language and the context language are inconsistent, the performance can fluctuate by up to 20% depending on the instruction language.

Figure 4: NIAH performance of each model and language classified by language resource group in long-context tasks (64K and 128K). Gemini 1.5 Flash demonstrated the best long-context performance. Surprisingly, English and Chinese did not make it into the top five languages.

Since neither Chinese nor English is the language with the best large model performance, the way large models choose their thinking language is not entirely based on efficiency.

So the second view in the comment section, "The training data contains more Chinese content," seems more reasonable.

It is normal for domestic large models to use Chinese during the thinking process because they are trained with more Chinese corpora. For example, the new version 2.0 core model "Composer - 1" of the AI programming tool Cursor was suspected of being a repackaged Chinese model because its entire thinking process was in Chinese.

But the same thing doesn't make sense for GPT, as the proportion of English data in its training process is obviously higher.

A similar thing happened in January this year. Netizens found that the o1 - pro model from OpenAI would also randomly show Chinese thinking processes.

Perhaps this is the charm of human languages. Different languages have different characteristics, and all sorts of strange things can happen in large models.

There are more and more cases of large models using Chinese, and Chinese training corpora are also becoming more and more abundant.

Maybe one day, we can joke about large models like overseas friends do: "I'm not asking you to become Chinese. I'm saying that when the time is right, if you look in the mirror, you'll find that you're already Chinese."

This article is from the WeChat official account "Machine Intelligence" (ID: almosthuman2014). Author: Machine Intelligence focusing on large models. Editor: Cold Cat. Republished by 36Kr with permission.