The domestic "100,000 GPU" cluster has started to be implemented.
In the current era of rapid development of artificial intelligence, computing power has become a key element of core competitiveness. The computing power level supported by the scale of graphics cards is one of the most important indicators determining the performance of large models. Generally, it is believed that 10,000 NVIDIA A100 chips represent the computing power threshold for developing large AI models.
In 2024, the construction of intelligent computing centers in China entered the fast - lane. The most obvious sign was the accelerated implementation of projects involving clusters of ten - thousand - GPU. A ten - thousand - GPU cluster refers to a high - performance computing system composed of 10,000 or more dedicated AI acceleration chips such as GPUs and TPUs. It deeply integrates cutting - edge technologies such as high - performance GPU computing, high - speed network communication, large - capacity parallel file storage, and intelligent computing platforms, integrating the underlying infrastructure into a super - powerful "computing power behemoth". With such clusters, the training of large models with hundreds of billions or even trillions of parameters can be efficiently completed, significantly shortening the model iteration cycle and facilitating the rapid evolution of AI technology.
However, as the concept of AGI continues to heat up, the industry's thirst for computing power has become more intense. The "ten - thousand - GPU clusters" are gradually struggling to meet the explosive growth in demand, and the "arms race" in the computing power field is intensifying. Nowadays, clusters of one - hundred - thousand - GPU have become the "strategic locations" for global top large - model enterprises. International giants such as xAI, Meta, and OpenAI have all laid out their plans, and domestic enterprises are also not willing to lag behind, actively participating in this computing power competition.
Building a One - Hundred - Thousand - GPU Cluster Poses Enormous Challenges
Globally, leading technology companies such as OpenAI, Microsoft, xAI, and Meta are competing to build GPU clusters with a scale of over 100,000 cards. Behind this ambitious plan lies a staggering capital investment. The cost of servers alone exceeds $4 billion. In addition, problems such as space limitations in data centers and insufficient power supply are also like roadblocks, impeding the progress of the projects.
In China, the procurement cost of GPUs alone for building a ten - thousand - GPU cluster can reach billions of yuan. Therefore, originally only a few large enterprises such as Alibaba and Baidu could deploy clusters of this scale. One can imagine how "money - burning" it would be to deploy a one - hundred - thousand - GPU cluster.
Besides the capital cost, building a one - hundred - thousand - GPU cluster also faces many technical challenges.
Firstly, there is an extreme test of power supply and heat dissipation. A one - hundred - thousand - H100 cluster requires approximately 150MW of power for its key IT equipment alone, far exceeding the carrying capacity of a single data - center building. Power distribution needs to be achieved through distributed deployment across multiple buildings in the park, and at the same time, problems of voltage fluctuation and stability need to be addressed. The heat - dissipation system needs to match the huge heat load. If the heat generated during the operation of high - density GPUs cannot be promptly dissipated, it will directly lead to equipment downtime. The energy consumption and maintenance cost of an efficient heat - dissipation solution also need to be optimized simultaneously. GPUs are very sensitive hardware. Even the temperature fluctuations within a day can affect the failure rate of GPUs, and the larger the scale, the higher the probability of failure. When Meta was training Llama 3, it used a cluster of 16,000 GPU cards, and a failure occurred on average every 3 hours.
In addition, different from the serial characteristics of traditional CPU clusters, the large - model training process requires all graphics cards to participate in parallel computing simultaneously, posing greater challenges to network transmission capabilities. If a fat - tree topology is used to achieve high - bandwidth interconnection of all GPUs, the hardware cost of four - layer switching will increase exponentially. Therefore, the "computing island" model is usually adopted: high - bandwidth is used within the island to ensure communication efficiency, and lower bandwidth is used between islands to control costs. However, this requires precise balancing of communication task allocation under different training modes such as tensor parallelism and data parallelism to avoid bandwidth bottlenecks caused by topological design flaws. Especially when the model scale exceeds one trillion parameters, the communication volume of the front - end network will increase sharply with the application of sparse technology, and the optimization of latency and bandwidth requires a fine - tuned trade - off.
Finally, compared with their American counterparts, Chinese large - model enterprises face an additional special difficulty. Due to well - known reasons, domestic enterprises cannot, like Elon Musk, adopt all NVIDIA solutions and instead need to use heterogeneous chips including domestic GPUs. This also means that even with the same number of 100,000 graphics cards, it is difficult for domestic enterprises to match the computing power scale of American enterprises.
Computing power is the core of large - model development, but the growth of computing power has changed from linear to planar. Building a one - hundred - thousand - GPU cluster is not only about increasing computing power but also involves technical and operational challenges. Managing a one - hundred - thousand - GPU cluster is fundamentally different from managing a ten - thousand - GPU cluster.
Accelerated Implementation of Domestic "One - Hundred - Thousand - GPU" Clusters
"There's no need to worry about the chip issue. Using methods such as stacking and clustering, the calculation results can be comparable to the most advanced level." These remarks by Ren Zhengfei, the president of Huawei, not only boosted the confidence of all sectors of society in China's AI development but also highlighted the crucial position of cluster computing in AI R & D and application. From the former "entry ticket" of ten - thousand - GPU clusters to the current new goal of one - hundred - thousand - GPU clusters, the construction of domestic intelligent computing centers is constantly reaching new heights.
In September last year, the second phase of the "Computing Ocean Plan", a single - cluster construction plan targeting ultra - large - scale computing power of one - hundred - thousand - GPU, was announced to be launched. The "Computing Ocean Plan", which takes the meaning of "embracing all rivers and gathering sand into a tower", aims to build a large - scale single cluster for model training. According to the introduction, the second phase of the "Computing Ocean Plan" was initiated by Beijing Parallel Technology Co., Ltd. (hereinafter referred to as Parallel Technology), and its partners including Beijing Zhipu Huazhang Technology Co., Ltd., Beijing Mianbi Intelligence Technology Co., Ltd., Wuhan Branch of China Mobile Communications Group Hubei Co., Ltd., Wuhan Branch of China United Network Communications Co., Ltd., Wuhan Branch of China Telecom Co., Ltd., Information Center of Wuhan University, and Inner Mongolia Xindong Jitai Technology Co., Ltd. participated in the launch ceremony. In Helinge'er, Inner Mongolia, the first - phase construction project of the "Computing Ocean Plan", covering an area of over 50 mu, was put into operation in May this year. The project is planned to have 4,000 high - power intelligent computing cabinets of 20kW, with a maximum capacity to support the construction of a single - body intelligent computing cluster of 60,000 cards. Less than 100 meters away from this project, the second - phase project of the "Computing Ocean Plan" has been planned. The second phase will rely on a single large cluster for unified management and scheduling, capable of accommodating a powerful computing power resource of up to 100,000 cards.
By the end of July 2024, Gansu Yisuan Intelligent Technology Co., Ltd. had invested 307 million yuan in Qingyang to build China's first domestic ten - thousand - GPU inference cluster. In June this year, Gansu Yisuan and its ecological partners planned to invest 5.5 billion yuan to build a "domestic one - hundred - thousand - GPU computing power cluster", providing computing power services of no less than 25,000P. It is expected to be completed and put into use by December 30, 2027. The one - hundred - thousand - GPU computing power cluster planned to be located in Qingyang will use all domestic chips and independent architectures, deeply integrating the energy advantages of Qingyang with the technological potential of the Yangtze River Delta, constructing a national linkage of "western computing power + eastern wisdom", creating an open computing power platform, and laying a "Chinese foundation" for large - model AI training and scientific computing.
ByteDance also has ambitious plans in the field of intelligent computing. In 2024, its capital expenditure reached 80 billion yuan, approaching the total of BAT (about 100 billion yuan). It is expected that in 2025, this figure will double to 160 billion yuan, with 90 billion yuan for AI computing power procurement and 70 billion yuan for data - center infrastructure construction and supporting hardware. According to the calculations of a third - party institution, based on the 400T (FP16) AI computing power card as the standard, ByteDance's current training computing power demand is approximately 267,300 cards, and the text inference computing power demand is approximately 336,700 cards. In the future, the inference computing power demand is expected to exceed 2.3 million cards.
Domestic AI Chip Companies Stand to Benefit
In this upsurge, domestic AI chip companies capable of building one - hundred - thousand - GPU clusters will also benefit.
At the Huawei Developer Conference 2025 (HDC 2025) held on June 20, Zhang Ping'an, an executive director of Huawei and the CEO of Huawei Cloud Computing, announced that the new - generation Ascend AI cloud service based on the CloudMatrix384 super - node was fully launched, providing powerful computing power for large - model applications. Through the cascading of 432 nodes, a super - computing cluster of 160,000 cards can be built to meet the training needs of large models with one hundred trillion parameters, breaking through the expansion limit of traditional architectures.
The new - generation Ascend AI cloud service of Huawei Cloud is based on the CloudMatrix384 super - node. It innovatively interconnects 384 Ascend NPUs and 192 Kunpeng CPUs through the new high - speed network MatrixLink in a fully peer - to - peer manner, forming a super "AI server", and the single - card inference throughput has jumped to 2300 Tokens/s.
The super - node architecture can better support the inference of the Mixture of Experts (MoE) large model, achieving "one card for one expert". One super - node can support 384 experts for parallel inference, greatly improving efficiency. At the same time, the super - node can also support "one card for one computing power task", flexibly allocating resources, improving parallel task processing, reducing waiting time, and increasing the effective computing power utilization rate (MFU) by more than 50%. In addition, the super - node can support the integrated deployment of training, inference, and computing power, such as "inference during the day and training at night", and the training, inference, and computing power can be flexibly allocated to help customers optimize resource utilization.
In addition, Baidu's Baige 4.0 has been able to achieve efficient management of one - hundred - thousand - GPU clusters through a series of product and technological innovations such as the HPN high - performance network, automated hybrid - training segmentation strategies, and self - developed collective communication libraries.
Tencent also announced last year that its self - developed Xingmai high - performance computing network had been comprehensively upgraded. Xingmai Network 2.0 is equipped with fully self - developed network equipment and AI computing power network cards, capable of supporting large - scale networking of over 100,000 cards. The network communication efficiency has been improved by 60% compared with the previous generation, and the large - model training efficiency has been increased by 20%.
Alibaba has also released news that Alibaba Cloud can achieve efficient collaboration among chips, servers, and data centers, supporting a cluster expandable scale of 100,000 cards and has served half of the large - model AI enterprises in China.
Computing Power Internet and the East - to - West Computing Initiative Unblock Market Bottlenecks
Currently, the problem of short supply of intelligent computing power in China is quite prominent. The growth rate of the large - model's demand for computing power far exceeds the improvement pace of the performance of a single AI chip. Relevant reports show that in 2023, China's demand for intelligent computing power reached 123.6 EFLOPS, while the supply was only 57.9 EFLOPS, and the supply - demand gap is obvious. Using cluster interconnection to make up for the performance shortcomings of a single card may be the most worthy - of - exploration and effective approach to alleviating the AI computing power shortage at this stage.
After the "one - hundred - thousand - GPU cluster" is built, how to fully tap its application value and make it play the greatest role in suitable scenarios such as AI training and big - data analysis, and how to prevent resource idling and waste are urgent problems to be solved. The construction of intelligent computing centers is just the beginning, and more importantly, is the subsequent effective utilization. That is to say, how to unblock market bottlenecks is the key. Against this background, to solve the relevant market bottlenecks, the Computing Power Internet and the East - to - West Computing Initiative have been proposed and attracted wide attention.
The Computing Power Internet is not a brand - new network. Instead, it is based on the existing Internet, connecting the scattered computing power resources in various places. With the help of standardized computing power identifiers and protocol interfaces, it forms an inter - domain resource interconnection network, achieving intelligent perception, real - time discovery, and on - demand access to heterogeneous computing power across the entire network. Simply put, it is a network serving the flow of computing power, aiming to further promote the interconnection of computing power, revitalize existing computing power resources, improve utilization efficiency, reduce usage costs, and provide users with a better experience. On May 17, the China Academy of Information and Communications Technology, together with the three major operators, jointly launched the construction of the "Computing Power Internet Test Network" and released the "Computing Power Internet Architecture 1.0". This initiative aims to interconnect the self - owned computing power of the three operators and the scattered social computing power across the country for general computing, intelligent computing, super - computing, and public computing power resources such as cloud, edge, and end, enabling users to conveniently "find, adjust, and use" computing power. In the future, users are expected to purchase and use computing power resources flexibly by the "card - hour", just like paying for electricity by the "kilowatt - hour", achieving the convenient service of paying for what they use.
The East - to - West Computing Initiative constructs a new - type computing power network system integrating data centers, cloud computing, and big data, guiding the computing power demand in the east to the west in an orderly manner, optimizing the layout of data - center construction, and promoting coordinated linkage between the east and the west. In February 2022, the state launched the construction of national computing power hub nodes in eight regions including the Beijing - Tianjin - Hebei region, the Yangtze River Delta, the Guangdong - Hong Kong - Macao Greater Bay Area, Chengdu - Chongqing, Inner Mongolia, Guizhou, Gansu, and Ningxia, and planned ten national data - center clusters, marking the official and full launch of the East - to - West Computing Initiative. Its core purpose is to enable the computing power resources in the west to more fully support the data operations in the east, empowering digital development. On the one hand, it can relieve the energy shortage problem in the east; on the other hand, it can open up a new development path for the west.
Through the coordinated promotion of the Computing Power Internet and the East - to - West Computing Initiative, it is expected to unblock market bottlenecks, optimize the allocation of computing power resources, and promote the continuous and healthy development of China's AI industry. On the one hand, the Computing Power Internet can achieve cross - regional and cross - industry circulation of computing power resources, improving resource utilization efficiency; on the other hand, the East - to - West Computing Initiative can take advantage of the energy and land resources in the west to reduce computing power costs and relieve the pressure on data - center construction in the east. The two complement each other and jointly provide solutions to the imbalance between supply and demand of intelligent computing power in China.
If 2024 was the first year of ten - thousand - GPU clusters in China, 2025 will see the arrival of one - hundred - thousand - GPU clusters.
This article is from the WeChat official account "Semiconductor Industry Insights" (ID: ICViews). The author is Peng Cheng, and it is published by 36Kr with authorization.