The Past and Present of Token: From Philosophical to Technological and Economic Concepts

What exactly is a Token?

When mainstream models are all billed by tokens, enterprises set up special token budgets, and government policy documents also include "token transactions", tokens are becoming an undisputed new economic unit.

In March 2026, two seemingly unrelated things happened.

Jensen Huang, the CEO of NVIDIA, predicted at the GTC conference that the company's revenue by 2027 would reach at least $1 trillion.

During his speech, he also redefined the data center on the side, introducing it as "a factory that produces AI intelligent tokens".

In the same month, Liu Liehong, the director of China's National Data Bureau, said during his speech at the China Development Forum that "tokens are not only the value anchor in the intelligent era but also the settlement unit connecting technological supply and business demand".

Moreover, he officially set the Chinese translation for "token" as "ciyuan" (the Chinese term for token).

One is the helmsman of the world's largest chip company, and the other is the top official in China's data field. They described tokens as an economic unit in almost the same tone.

So, what exactly are tokens, which are now popular worldwide and may even become the currency of the new era?

What are Tokens?

In 1906, the American philosopher Charles Sanders Peirce was pondering a seemingly simple question: If there are 20 instances of the word "the" printed on a page, is it one word or 20 different words?

This was not Peirce's whim or a matter of nit - picking.

As a philosopher, he believed that the abstract concept of "the" actually represented a general rule or form.

He called it a "type"; each specific and visible "the" in the book was a specific manifestation of this type, which could be called a "token".

That is to say, the 20 instances of "the" are 20 different "tokens" of the same "type".

He pointed out that "the type itself does not exist, but it determines what specific things can exist."

This seemingly esoteric concept circulated in the philosophical circle for a long time, but no one at that time thought it would have any connection with computers in the future.

It wasn't until 1936 that George Zipf, a linguist at Harvard University, gave a mathematical explanation of tokens when studying word frequency.

At that time, when Zipf was counting the word frequency in various languages, he found an interesting phenomenon: the product of a word's rank and its frequency is almost a constant. For example, in Chinese, "de" is the most commonly used character, ranking first, and its character frequency is about 6%.

At this time, the rank (1) multiplied by the character frequency (6%) is approximately equal to 6%.

Next, the second - ranked character is "shi", with a character frequency of about 3%. 2 multiplied by 3% is also approximately equal to 6%. Then, the third - ranked character is "yi", with a character frequency of about 2%. 3 multiplied by 2% is also approximately equal to 6%.

It can be seen that the product of the rank and the character frequency here is approximately a constant.

Therefore, the character frequency of the first - ranked "de" is about twice that of the second - ranked "shi" and three times that of the third - ranked "yi".

This law of "frequency being inversely proportional to rank" was later named "Zipf's law".

No one expected that this seemingly boring mathematical theory would become an important theoretical basis for computer language processing thirty years later.

In the 1960s, the concept of "token" was finally applied in the computer world.

For example, when a programmer writes code like "int x = 5;", early computers would act like a meticulous "syntax disassembler", disassembling and understanding this string of characters one by one from start to finish.

In this process, the computer first recognizes that "int" is a keyword representing an integer type, then marks "x" as a variable name, then sees that "=" is an assignment symbol, and finally recognizes "5" as a specific numerical value.

Each such independent unit that is recognized and labeled with a clear meaning is a token.

In this way, the token finally completed its transformation from a humanistic concept to a machine language and became the basic unit for computers to "read" instructions and information.

From being the silent grammatical cornerstone supporting the digital world to being later endowed with new values and consensus, the meaning of tokens continues to expand.

In 2017, with the rise of blockchain and the ICO boom, the obscure token became gradually well - known to the world because of its "digital token" guise.

Although that boom gradually subsided and many projects quietly withdrew, the concept of token firmly remained.

It is no longer just a technical term but is mentioned again with the new identity of "a negotiable digital equity certificate".

It can be said that regardless of the background, the core of a token is always: to standardize complex things into the smallest units that the system can recognize, process, and transfer.

It is this consistent characteristic that makes tokens the most basic and important "language unit" in human - machine interaction today with the rise of large - scale language models.

So, when AI faces human language, how does it use this "ruler" to learn to "understand" and "think"?

The Underlying Logic of AI Learning to Think

First, we need to clarify that when AI understands human instructions, it is not just "reading" or "reasoning" as we imagine, but a precise "surgical operation" - "cutting".

This means that for any sentence you input, AI will perform a precise "disassembly operation".

After the instruction is issued, all text will be cut into a series of token fragments and then converted into computer data.

In other words, all the "thinking" and "reasoning" of the AI model are actually completed in the complex operations of these numbers and then "translated" into a language that people can understand.

This sounds simple, but the actual operation is extremely complex.

For example, the most common problem is the ambiguity dilemma of AI.

For instance, for the sentence "How much was the badminton racket sold for?", when the AI model tries to understand it, should it be split after "badminton racket" or after "auction"?

The former is an inquiry about the price of sports equipment, while the latter becomes an inquiry about an auction event. The semantics are completely different, and AI cannot judge based on characters alone.

Therefore, "what to cut and how to cut" the instructions becomes the most fundamental core problem for AI.

What's more troublesome is that if a word has never appeared in the training data, the model cannot recognize it and can only mark it as "unknown" and skip it, which means there is a bug in the system.

Therefore, how to enable the AI model to handle ambiguity and "recognize" word combinations it has never seen before has been a difficult problem that has plagued the field of computer language processing for many years.

Overcoming this problem came from a technical paper that had been forgotten for many years.

In 1994, American programmer Philip Gage published an article in a C - language technology magazine, introducing a compression algorithm called BPE (Byte Pair Encoding).

Gage's idea was very simple: by repeatedly scanning the text, the two most frequently adjacent characters (such as "th") are combined into a new symbol, and the compression is iterated round by round.

After repeated iterations, common phrases will be compressed smaller and smaller. The decompression end only needs to save this "packing comparison table", making the volume of the entire program extremely small.

However, because its compression efficiency was not outstanding, the industry didn't care about the change of a few KB of memory, so this algorithm didn't attract much attention at that time.

This paper was quickly forgotten, and it was forgotten for 22 years.

It wasn't until 2016 that Rico Sennrich, a researcher at the University of Edinburgh, accidentally retrieved this old paper when studying the word - segmentation problem in machine translation.

He acutely realized that BPE's frequency - based merging strategy was exactly the perfect solution for word segmentation: there is no need to pre - define a dictionary, and the data is allowed to "speak" for itself. High - frequency combinations gradually condense into tokens like a snowball.

In this way, even when facing rare words that have never been seen before, computer language can break them down into more detailed bytes, thus completely avoiding the "unknown" dilemma.

In 2019, when OpenAI released GPT - 2, it also borrowed this concept.

The R & D team set the starting point of word segmentation directly at the "byte" - the smallest unit of computer storage, unifying the representation of all languages at the bottom level, so that the model can theoretically process any language text.

A short article that had been buried for more than twenty years has thus become one of the underlying logics driving the trillion - level AI industry.

This result was probably unexpected even by Gage himself.

However, when this ability to "process all text" is combined with an efficiency - oriented algorithm, a new kind of "algorithmic hegemony" quietly emerges.

Algorithmic and Coding Hegemony

On the surface, the word - segmentation method used by AI today seems very "fair": the more a language is used, the more efficient and complete its processing will be; languages that are used less will be cut into more fragmented parts and are more "difficult" to process.

However, this "fairness" based on efficiency quietly divides the world's languages into two treatments: some languages are on the "fast track", while others are like walking on a gravel road.

To put it simply, since the core logic of the BPE algorithm is "frequency first", for the most commonly used language, relevant words will be more efficiently combined into tokens.

As the absolute mainstream on the Internet, English is naturally the most prioritized language for expression, and other languages are sorted according to their "digital visibility".

Therefore, an implicit "language tax" system has actually formed in the AI model: to express the same meaning, English uses the fewest tokens and has the lowest cost; Chinese usually requires 1.5 to 2 times as many tokens; for languages with fewer resources such as Zulu and Tibetan, the cost can be 5 to 10 times that of English.

This means that under the rule of charging by tokens, using English to communicate with AI is not only faster, but also allows for much more computing power to be invoked with the same budget compared to other languages.

This is not something new; it has always been the case in the information age.

From Morse code to keyboard design, almost every underlying change in information technology has defaulted to paving the way for English, forcing users of other languages to pay an additional "transcoding" cost.

Therefore, the efficiency gap of tokens is just a repetition of this historical rule in the AI era.

It is worth noting that once this unfairness at the "starting line" is written into the initial vocabulary of AI, it is almost impossible to correct.

Because the word - segmentation rules are the foundation for the AI model to understand the world, and the higher the building is, the more difficult it is to replace the foundation.

Fortunately, with China's rapid progress in the field of large models, even models dominated by English corpora have begun to significantly optimize the processing efficiency of Chinese.

This is very obvious in the model iterations of OpenAI.

For example, for the same Chinese sentence, it requires 38 tokens in GPT - 3, 26 tokens in GPT - 4, and only 15 tokens in GPT - 5.

This shows that through the evolution of several generations of GPT, the number of tokens required to process the same Chinese content has decreased by more than 60%, and the recognition efficiency of Chinese has been significantly improved.

Domestic large models such as Tongyi Qianwen and DeepSeek have incorporated high - frequency Chinese phrases and idioms as native tokens into the vocabulary from the very beginning of their design, thus achieving more efficient and "native - level" processing of Chinese with the same model scale.

In other words, in the AI era, whoever controls the "right to segment semantics", that is, the right to define the basic units of language, will largely control the expression efficiency and cost advantage of that language in the digital world.

This right to define tokens actually constitutes a kind of "basic currency - issuing right" in the digital age.

Its strategic significance is not inferior to mastering the design and manufacturing of chips.

This efficiency gap may seem like a hurdle, but in fact, it is more like a ticket: as long as you have enough computing power and data, you can completely avoid following others' old paths and build your own solid foundation.

To truly turn this advantage of "defining the basic units of language" into industrial discourse power, a complete set of hard support from energy, chips to computing power is needed.

China happens to be at the starting line on this path.

China Creates Token Hard Currency

If we were to draw a link for China's position in the global token economy, the starting point would be energy, and the end point would be the global AI service market.

Imagine this scenario: Wind turbines in the northwest Gobi convert wind energy into electricity, and the current flows into data centers along ultra - high - voltage lines; GPUs then convert electrical energy into computing power, continuously producing tokens.

These digital units finally flow to all parts of the world through submarine cables and are exchanged for API call revenues denominated in US dollars.

In fact, China's scale in this chain is already large enough to form its own momentum.

Public data shows that as of March 2026, China's daily average token call volume has reached 140 trillion, a more than thousand - fold increase in two years.

Global monitoring during the same period also shows that the weekly call volume of Chinese large models has exceeded that of the United States for several consecutive weeks, leading by more than twice and ranking first in the world.

So, why is China's token economy so strong?

It starts with cost, and the most critical variable is the electricity price.

In regions rich in hydropower such as Guizhou and Yunnan, and provinces with abundant wind and solar resources such as Gansu and Xinjiang, the industrial electricity price has been at a low level for a long time. The green electricity specially supplied to computing centers is even as low as 0.15 yuan per kilowatt - hour in some places.

In contrast, in most parts of Europe and the United States, the industrial electricity price is generally several times higher than that in China.

For example, generating one million tokens requires about 15 to 20 kilowatt - hours of electricity. Calculated at the low - cost green electricity price in northwest China, the cost is only a few yuan; for the same computing task, the corresponding electricity price in the international market is usually between 60 and 200 US dollars.

In comparison, China has built a cost moat from "electricity" to "tokens" with its advantages in energy and computing power costs.

More importantly, China has precisely matched a large amount of green electricity that is difficult to fully absorb with the continuously exploding computing power demand, forming a unique industrial closed - loop.

In 2025, China's annual power generation exceeded 10 trillion kilowatt - hours, accounting for nearly one - third of the global total.

Among them, new energy sources such as wind and solar power have experienced obvious "wind and solar curtailment" due to insufficient energy storage and limited power transmission.

As an adjustable large - load user, data centers can increase their operating load during peak periods of wind and solar power generation, efficiently absorbing this otherwise wasted green electricity.

In this way, not only is the energy cost reduced, but also the energy utilization efficiency is improved, forming a systematic advantage that other countries are difficult to replicate.

In recent years, the "Eastern Data and Western Computing" project has elevated this logic to the national strategic level, guiding data centers to be located in regions rich in renewable energy such as Guizhou, Inner Mongolia, and Ningxia.

This is equivalent to directly connecting the computing power centers to the "green electricity socket", efficiently converting the previously wasted wind and solar power into available AI computing power and continuously producing tokens.

Therefore, this AI competition, seemingly a contest between algorithms and models, is actually a new answer sheet for the in - depth integration of energy transformation and digital infrastructure.

And China happens to be at the intersection of this track.

Meanwhile, as AI moves from technological exploration to the depth of the industry, scenarios such as quality inspection and production scheduling in traditional manufacturing, risk control and compliance in financial services, and document processing in

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

From philosophical concepts to technological concepts, and then to economic concepts, the past and present of Token

What are Tokens?

The Underlying Logic of AI Learning to Think

Algorithmic and Coding Hegemony

China Creates Token Hard Currency