With over 20 million installations and seamless compatibility with global mainstream GPUs and AI frameworks out of the box, OpenCloudOS has emerged as the top choice in the AI era.
Despite continuous and substantial hardware investments by enterprises, industry data shows that the effective utilization rate of GPUs has long remained below 30%. Even if enterprises increase their hardware procurement budgets tenfold, the actual improvement in computing power often fails to reach three times. Structural waste is becoming increasingly evident.
The root causes of this inefficiency are summarized by the industry into three categories: resource fragmentation, tidal load fluctuations, and conflicts between online and offline tasks. However, the deeper contradiction lies not only in the scheduling aspect but also in the fragmentation of the infrastructure that the entire industry is facing. On one hand, the scale of AI training and inference continues to explode. On the other hand, the underlying hardware forms, upper - layer model frameworks, compilation environments, and acceleration libraries are all in a state of "a hundred schools of thought contending," lacking unified standards. This ecological fragmentation forces developers to frequently adapt, optimize, and migrate between different hardware and frameworks, further reducing the overall efficiency of the cluster.
Against this backdrop, how to reshape the underlying software stack through a standardized system and how to achieve unified orchestration and efficient scheduling in a heterogeneous computing power environment have become the core topics of industry discussions. This is also the reason why the OpenCloudOS Operating System Ecosystem Conference this year has attracted so much attention.
1 Focusing on "Ease of Use" and "Security", Fully Compatible with North - South Software and Hardware Ecosystems
On December 6, the 2025 OpenCloudOS Operating System Ecosystem Conference was held in Beijing. Nearly 30 ecosystem enterprises, including AMD, Arm, Muxi, Hygon Information, and Tencent Cloud, shared their latest progress in technological innovation, best practices, and collaborative construction.
Since its establishment in 2021, the OpenCloudOS community has always adhered to the development path of full - link self - controllability, full - scenario compatibility, and full - ecosystem open - source. Thanks to Tencent Cloud's full integration of its years of accumulated kernel technology, cloud - native capabilities, and large - scale server operation experience, the community has now developed into one of the leading open - source operating system ecosystems in China. As of this year, the installed capacity of the OpenCloudOS operating system has exceeded 20 million nodes, serving more than 62,000 enterprise users, and more than 97,500 software and hardware adaptations have been completed.
In terms of ecosystem construction, the community has gathered more than 1,200 ecosystem partners and more than 400 in - depth partners, and has more than 180,000 developers. As more and more manufacturers participate in the community, the ecological map of OpenCloudOS has extended from traditional data centers to new scenarios such as cloud - native, edge computing, high - performance computing, and AI training and inference.
In the past few years, the community has established a compatibility certification system covering multiple architectures such as x86, Arm, RISC - V, and Loongson. Users can deploy underlying dependencies with a single click through standard yum/dnf commands, eliminating the need for complex compilation and debugging work. This has made OpenCloudOS one of the open - source operating systems with the most comprehensive adaptation scope in China. At the same time, more than a dozen derivative versions of operating systems, such as TencentOS, NTOS of Donghua, and Red Flag Linux, have been incubated, forming a positive cycle of open - source collaboration and commercial implementation.
At the technical level, as AI workloads become fully cloud - native, the underlying infrastructure is facing unprecedented complexity. Large - model images can easily reach dozens of gigabytes, causing a sharp increase in the cost of pulling and distributing. The AI software stack has a long dependency chain and is updated frequently, making environment configuration increasingly difficult. The rapid diversification of hardware forms has made driver installation, version compatibility, and performance optimization the heaviest operation and maintenance burdens for enterprises. The larger the node scale, the more obvious these problems become. Whether in terms of enterprise cost, delivery efficiency, or resource utilization, traditional operating systems and toolchains can no longer meet the needs of the AI era. These real - world pressures make it necessary and urgent to build a new - generation operating system capability system for AI.
Based on this, OpenCloudOS has carried out systematic technological upgrades around the native needs of AI, focusing on four major directions: lightweight, rapid distribution, automated maintenance, and ecological adaptation.
First, in the face of the high - cost burden caused by the expansion of the AI image structure, OpenCloudOS has introduced the ability to miniaturize images. By automatically removing redundancy and slicing software packages with the self - developed chisel tool, combined with static and dynamic dependency analysis, the volume of AI images is significantly compressed, reducing the cost of construction and transmission.
Second, to address the problem of long - time consumption for pulling large - model images, OpenCloudOS has built an image acceleration system. Based on stargz - snapshotter, it realizes lazy loading, introduces fuse passthrough on the kernel side to reduce access overhead, and accelerates model startup by optimizing the pre - fetching strategy. At the same time, it uses chunk - level indexing to deduplicate image files, further reducing network and storage overhead.
In large - scale cluster deployment scenarios, OpenCloudOS has also strengthened its image distribution capabilities. Through enhanced P2P acceleration mechanisms such as sharding concurrency, out - of - order downloading, and Range request proxy, images can be quickly synchronized within the cluster, and it supports speed - limit strategies and RDMA acceleration, significantly reducing the time required for large - scale distribution.
To address the complex maintenance issues caused by the sharp increase in the number of heterogeneous hardware accelerator cards, OpenCloudOS provides automated hardware services that can automatically identify devices, match appropriate drivers, and support the coexistence of multiple versions, fundamentally reducing the operation and maintenance threshold of hardware such as GPUs in a cloud - native environment.
Facing the massive and rapidly iterating AI software stack, OpenCloudOS has built an automated adaptation process for the Agent, realizing full - link automation from version tracking, build testing to container encapsulation. Currently, it has adapted to more than a thousand AI software products and can automatically enable acceleration paths according to the hardware backend, providing users with an out - of - the - box and performance - optimized experience. In addition, OpenCloudOS also provides a complete upper - layer AI environment, including RPM sources, PyPI sources, and various AI container images, allowing users to complete environment deployment through simple commands and reducing repeated construction investment.
Through this series of upgrades around the entire AI link, OpenCloudOS has systematically built a closed - loop of operating system capabilities required for cloud - native AI applications. From image construction, pulling, and distribution to hardware management and software ecosystem coverage, it provides enterprises with an efficient, lightweight, automated, and sustainably evolving AI infrastructure foundation.
To support this series of future - oriented technological evolutions, having "advanced capabilities" alone is not enough. The real key is whether these capabilities can form a verifiable value closed - loop in industrial scenarios. The cooperation between enterprises such as Hygon chips, Zuoyebang, and Neusoft and OpenCloudOS is a model of this value implementation.
For many first - release versions of Hygon chips, the key software suites come from the OpenCloudOS community, achieving "compatibility and adaptation at the first release." Donghua Software has successfully launched two self - developed operating systems based on the OpenCloudOS operating system foundation, solving long - standing problems in business systems such as redundant dependencies, long vulnerability repair chains, and privilege over - stepping, and significantly improving system stability and security.
Zuoyebang has long faced the superimposed challenges of "resource fragmentation + infrastructure fragmentation + framework heterogeneity." OpenCloudOS, through a unified system foundation, makes the behavior of GPUs across regions consistent, the driver links consistent, and the framework versions consistent, enabling the scheduler to integrate computing power resources from a truly global perspective. From bottom - layer adaptation to upper - layer framework integration, the multi - version AI ecosystem built by OpenCloudOS no longer requires enterprises to "bet" on a certain type of hardware or a single framework, but enables all hardware to obtain the optimal solution in the same operating system ecosystem. This ability has become the key foundation for Zuoyebang to solve the problem of computing power utilization and promote the construction of a unified computing power pool.
2 Deep Evolution towards AI, the OpenCloudOS Infra Intelligent Base is Officially Released
As large models and various AI applications enter the stage of large - scale implementation, the core contradiction in the industry is shifting from "insufficient model capabilities" to "excessively high computing power complexity." The contradiction between the explosive growth of computing power demand and the inconsistent standards and fragmented software and hardware systems is becoming increasingly prominent, forcing developers to invest a large amount of time and manpower in cumbersome tasks such as driver adaptation, environment deployment, and framework compatibility, seriously hampering industrial innovation efficiency.
Against this backdrop, taking advantage of the OpenCloudOS Operating System Ecosystem Conference, the OpenCloudOS community, in collaboration with partners such as Ascend, Hygon, AMD, Muxi, Kunlunxin, vLLM, SGLang, Zuoyebang, and Tencent Cloud, jointly launched the "OpenCloudOS Infra Intelligent Base," aiming to build a unified AI computing power foundation and an open - technology system driven by industrial partners.
The logic behind this release is very clear: To promote the true engineering, large - scale, and low - cost popularization of AI in the industry, a unified, stable, highly compatible, and sustainably evolving "AI computing power foundation" must be established at the operating system level.
The fundamental reason why OpenCloudOS can bring so many partners to the same table is that it solves the common pain point of all participants - the huge repeated cost caused by the fragmented computing power ecosystem.
For chip manufacturers, without unified adaptation standards and a common software stack, they have to spend a large amount of cost on basic driver adaptation every time a new product is launched. For framework developers, facing different combinations of operating systems, drivers, and hardware, they need to repeat performance optimization and stability verification. For enterprise users, deploying an AI framework often requires crossing dozens of dependency, conflict, and configuration barriers. OpenCloudOS provides a unified interface, unified integration, and a unified runtime environment through the intelligent base, enabling different manufacturers to collaborate within the same ecosystem and fundamentally reducing the technical friction in the entire industry chain.
Based on this collaborative mechanism, the OpenCloudOS Infra Intelligent Base has built an AI infrastructure system covering the entire stack, including three core levels: "AI out - of - the - box, AI software support ecosystem, and AI hardware support ecosystem." Relying on the OpenCloudOS 9 version, the community has completed in - depth integration and verification of the official drivers and computing stacks of several mainstream AI acceleration chips at home and abroad. In the past, developers had to spend hours or even days manually downloading, compiling, and debugging driver programs, but now they can install all underlying dependencies with a single click through yum install or dnf install, significantly reducing the environment preparation cost.
Specifically, what capabilities can the OpenCloudOS Infra Intelligent Base provide?
At the software and framework layer, OpenCloudOS has completed in - depth adaptation, dependency cleaning, and performance optimization of nearly 20 mainstream AI frameworks and intelligent agent applications through containerization technology, and packaged them into standardized images that can be directly pulled and used. Traditionally, deploying an AI framework may require dozens of steps, but in the intelligent base system, it can be completed in three steps: "install container dependencies with a single click - start the pre - fabricated framework - start the service," reducing the deployment time from days or hours to minutes. This not only prevents developers from being slowed down by environment issues but also provides a replicable and scalable foundation for enterprises to deploy AI services on a large scale.
At the performance and scheduling level, the intelligent base has also brought significant improvements. The volume of container images is reduced by up to 94%, reducing storage and transmission costs. The distribution speed of images and models approaches the hardware limit. The self - developed FlexKV distributed KVCache system can reduce the latency of the first token by about 70% in high - concurrency scenarios. These system optimizations targeting the characteristics of AI workloads enable OpenCloudOS not only to "run AI" but also to "run AI efficiently, stably, and on a large scale."
At the same time, OpenCloudOS has extended its AI - ready capabilities to the cloud. The OpenCloudOS images available on the Tencent Cloud HAI platform already have CUDA components pre - installed. Users can obtain an out - of - the - box AI development and inference environment without manual configuration, achieving seamless collaboration from local to cloud. This capability enables enterprises to quickly build, verify, and launch AI services, further shortening the engineering cycle.
3 Conclusion
Looking back at the entire conference, the technological evolution and ecological expansion of OpenCloudOS in the past few years have shown a clear sense of direction at this moment: The infrastructure in the AI era is no longer a stack of single - point optimizations but a systematic project across chips, frameworks, and scenarios. Whether it is the underlying capabilities such as image miniaturization, on - demand loading, and P2P acceleration, the unified support of the intelligent base for diverse computing power, or the automated adaptation of more than a thousand AI software and frameworks, these seemingly independent technical actions ultimately converge to a common goal - to enable developers, hardware manufacturers, and industry applications to truly stand on the same set of "usable, easy - to - use, stable, and controllable" operating system foundation.
The significance of this conference is not limited to the release of new technological capabilities or ecological plans but the announcement of a new AI infrastructure paradigm: In an era of explosive computing power, diverse models, and rapid framework iteration, true innovation lies not in the improvement of single - point performance but in the improvement of the collaboration efficiency and system resilience of the entire industry chain.
OpenCloudOS is making this goal concrete - through sustainable technological paths, standardized ecological interfaces, and an open - co - construction community mechanism, it is making the AI infrastructure more inclusive, reliable, and capable of large - scale implementation.
This article is from the WeChat official account “InfoQ” (ID: infoqchina). Author: Dongmei. Republished by 36Kr with permission.