36氪_让一部分人先看到未来

Compared with single-point optimization of the GPU, the inference all-in-one machine based on the whole-system optimization can reduce the cost of implementing large models by more than ten times.

When the highly anticipated GPT-5 experienced several delays, OpenAI's newly released o1 model timely restored the confidence of the industry and thereby opened a new competitive direction for the large model field - When reasoning models are prevalent, how can the industry start from the Infra level to reduce the computing power cost in the reasoning stage?

In mid-September, OpenAI's long-awaited "Strawberry" finally appeared under the new name "o1". It changed the previous technical paradigm and raised the reasoning ability of the model to a new height through the expansion of the reasoning stage (Inference time scaling law).

This technical path, which OpenAI calls the "new paradigm", has brought a series of changes to the large model field. One particularly important point is that it has pushed the necessity of computing power construction at the reasoning level to a more prominent central position. To achieve the desired effect, the industry must solve the impossible triangle problem between the reasoning effect, efficiency, and cost.

From the perspective of AI Infra, what emerges behind this is a new opportunity for startups. In this regard, computing power layer players represented by Quijing Technology have begun to promote an all-in-one product that is not to balance training and reasoning at the expense of one another, but to focus on optimizing the cost of the reasoning scene. This may be a new and significant chapter in the technical narrative of the large model field.

Furthermore, compared with the traditional solution in the current industry that mainly optimizes the GPU computing power utilization rate at a single point, Quijing Technology's large model knowledge reasoning all-in-one machine adopts the industry's first full-system reasoning architecture. It releases the storage power as a supplement to the computing power through the "computing power conversion by storage" technology, reducing the demand for computing power; at the same time, it adopts the idea of "heterogeneous collaboration", closely linking the HBM/DRAM/SSD and CPU/GPU/NPU full-system heterogeneous devices, breaking through the limitation of the video memory capacity, and fully releasing the storage power and computing power of the full system.

This innovative solution breaks through the theoretical optimization limit of the previous solutions, achieves the goal of integrating all heterogeneous computing power resources of the machine, and makes the reasoning throughput increase by more than 10 times, greatly reducing the landing cost of the large model.

As a solution that takes into account high performance, low cost, and high efficiency, Quijing Technology's large model knowledge reasoning all-in-one machine will strongly promote the large model from the "training freedom" at the R & D level to the "reasoning freedom" at the landing level.

From Training to Reasoning: The Paradigm Shift of Scaling Law

The Pre-Training (pre-training) stage Scaling Law in the traditional GPT 1 to 4 era relies on the common growth of model parameters, training computing power, and data scale. However, as the high-quality data that objectively exists in the Internet world is gradually exhausted (assuming that the computing power is still not limited), this triangular relationship has reached a bottleneck state and can no longer quickly and effectively improve the model performance.

However, the emergence of OpenAI o1 has at least dispersed this layer of fog at this stage. By combining technologies such as reinforcement learning, the model internalizes the CoT ability into the model, and improves the reasoning performance of the model by decomposing complex problems into multiple consecutive and related simple problems and imitating the way humans solve problems step by step.

Behind this, the characteristic of OpenAI o1 is that it consumes a large amount of reasoning computing power for complex thinking chain generation in both the training and subsequent deployment stages. This characteristic makes o1 not only perform well in the pre-training stage, but also can further improve the performance by increasing the reasoning computing amount in the post-training stage, that is, the so-called Inference time scaling law.

At this point, the second curve of the Scaling Law on the journey to AGI in the AI large model field is officially established.

In this new stage, the focus of computing power consumption has shifted from the training layer to the reasoning layer. The reasoning layer is directly responsible for the application effect after the model is landed, and ensuring the reasoning computing power has become the core demand for the sustainable development of the large model industry. At the same time, once a certain technical paradigm becomes a new industry trend, the reasoning computing power demand of the entire industry from R & D to application landing will experience an explosive growth. To ensure that the model achieves a scalable effect in the application field, the efficiency and cost optimization of the reasoning computing power becomes a key task.

This means that the new paradigm of computing power construction centered on reasoning has officially arrived.

How to Reform the Costly Traditional Computing Power Cluster Construction?

How important the change of the reasoning computing power layer brought by the emergence of OpenAI o1 is, and its significantly limited computing power cost is a clear example. In the case where each ChatGPT Plus paying user can already use GPT-4o 4480 times per week, the use of o1-preview and o1-mini is still limited to only 30 times and 50 times per week, respectively.

In addition, the answer time of o1 has also increased from the second-level of the GPT series to tens of seconds or even longer, and the amount of text that needs to be processed behind it has also greatly increased. Both of these mean higher computing power costs. Affected by this, OpenAI o1 still drove the stock prices of NVIDIA and Microsoft to rise together when it was still in the Reuters stage.

This high reasoning cost makes it more difficult for enterprises to build a balance between effect, efficiency, and cost when deploying large models, forming the so-called "impossible triangle". Generally speaking, the "impossible triangle" is to achieve a good effect through a more powerful large model, provide a lower response delay in the actual deployment, and at the same time achieve a lower cost. In fact, this is also the pain point of the large model landing in various industries today.

One possible reason behind this is that the computing power cluster solution under the traditional computing power construction paradigm is mainly designed for the training scene, and it is costly for the reasoning stage.

Take the NVIDIA HGX solution as an example. This is one of the computing power construction solutions widely used in large model training. Because its design focus is to support large-scale model training, it usually involves multiple high-end GPUs and high-speed interconnection technologies. For reasoning tasks, the expense of these high costs may not be economical.

In other words, under the duality of technology, the advantage of the traditional computing power construction paradigm in the training stage is not prominent in the reasoning stage, and it may even become a drag.

In this regard, some industry views have compared the expensive "GPU super node" solution in the traditional computing power construction method to the "AI mainframe". This metaphor draws on the historical lessons of past technological innovations and points out that only when we shift from relying on expensive large computers from companies such as IBM to low-cost x86 server clusters based on full-system optimization represented by Google can we truly open the era of big data.

Similarly, in order to enable a wide range of enterprises to deploy large model solutions more effectively, it is crucial to provide a solution with advantages in both cost and deployment efficiency. Therefore, it is particularly urgent to quickly find an "x86 server" solution with obvious advantages in both cost-effectiveness and deployment efficiency in the era of large models, and it has also become an inevitable trend in the industry.

Large Model Knowledge Reasoning All-in-One Machine, A Highly Feasible Solution

Looking at the startups in the current AI Infra layer in China, this track is not without attempts to give solutions. In fact, as a new player entering the game in early 2024 and just completing an angel round of financing in the first half of the year, Quijing Technology provides a new choice: Large Model Knowledge Reasoning All-in-One Machine.

Previously, the technical idea of the industry focused more on optimizing the GPU utilization rate, but under the new market demand background, this is not enough to solve the gap of multiple orders of magnitude. Especially in the case where there is a significant gap between domestic GPU products and NVIDIA, the imbalance between computing power performance and the demand for reasoning tasks is more serious.

Therefore, the new product upgrade idea of Quijing Technology chooses to start from the architecture. Its first-of-its-kind full-system reasoning architecture can fully collaborate with multiple devices such as storage, CPU, GPU, and NPU, thereby breaking through the theoretical optimization limit of the previous solutions, achieving the goal of integrating all heterogeneous computing power resources of the machine, making the reasoning throughput increase by more than 10 times, and greatly reducing the landing cost of the large model.

Quijing Technology's "Large Model Knowledge Reasoning All-in-One Machine" product, through the way of software and hardware integration, supports the local deployment of large models above the 10-billion level, and opens API interfaces to support the flexible invocation of third parties. At the same time, it provides an enterprise user interface to provide "out-of-the-box" privatized large model deployment and reasoning services. Its core advantages are high performance, low cost, and high efficiency, basically solving most of the concerns of enterprises in landing large models.

This effect benefits from Quijing Technology's unique technical perspective on the industry. In Quijing Technology's view, the optimization of large model reasoning cannot only focus on the GPU, and the disk, memory, and CPU can also provide "heterogeneous computing power" for the large model. Behind this, two core technical strategies are derived as support.

The first is Quijing Technology's industry-first "computing power conversion by storage" technology.

The early large model reasoning architecture regarded each reasoning as an independent request, lacking the "memory" ability required for efficient processing. Although there have been subsequent technical updates, large models still mainly rely on "rote memorization".

In response to this problem, Quijing Technology innovatively designed the "Fusion Attention" idea to utilize the storage space. Even in the face of new problems, reusable parts of the content can be extracted from the historically relevant information and fused with the on-site information for online calculation. This technology significantly improves the reusable historical calculation results, thereby reducing the amount of calculation.

In this way, especially in the RAG scenario, "computing power conversion by storage" can reduce the response delay by 20 times and improve the performance by 10 times.

On this basis, Quijing Technology's first "full-system heterogeneous collaboration" architecture design has also become an important technical support. This architecture is the first reasoning framework that allows supporting 1Million ultra-long context on a single GPU card, and the first to run a 200-billion-parameter MoE super-large model on a single GPU, etc.

At present, Quijing Technology has jointly with Tsinghua University to open source the personal version of the heterogeneous collaborative reasoning framework, named KTransformers, on GitHub, and has attracted widespread attention and discussion in open source communities such as Hugging Face. Industry partners are also highly interested in this, and many well-known large model companies have taken the initiative to extend an olive branch to jointly launch large model reasoning-related project construction with it.

The full-system heterogeneous collaborative architecture adopted by Quijing Technology's AI knowledge reasoning machine all-in-one is the full commercial version. It has stronger collaborative performance on the basis of the open source version, and adds strategies such as multi-card high-concurrency scheduling and RAG support for the team.

Looking back at the development process of the industry, the main optimization directions of AI Infra include the training and reasoning processes of the model. The training stage includes parameter tuning, data training and other links, which is the "mainstream direction" followed by many large factories and startups. However, Quijing Technology has previously predicted that as the model architecture gradually converges to Transformer and its variants, the technical barriers and room for improvement in training optimization are shrinking. Therefore, in contrast, the optimization of the reasoning end contains greater potential.

It can be said that OpenAI o1 has opened a new technical paradigm for large models, and Quijing Technology has caught the corresponding technical trend and explosive opportunity in the AI Infra layer.

Facing the new technical cycle, the innovation of the reasoning computing power construction paradigm represented by Quijing Technology will have the opportunity to truly break the predicament of the growth of computing power demand. By proposing a solution with high performance, low cost, and high efficiency, it will strongly promote the large model from the "training freedom" at the R & D level to the "reasoning freedom" at the landing level.

This article is originally produced by「36氪品牌」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

OpenAI o1 triggers a new revolution in reasoning computing power, and TrendTech releases new products to help enterprises achieve efficient reasoning.

From Training to Reasoning: The Paradigm Shift of Scaling Law

How to Reform the Costly Traditional Computing Power Cluster Construction?

Large Model Knowledge Reasoning All-in-One Machine, A Highly Feasible Solution