DeepSeek forces vLLM to upgrade. Amidst the cut - throat competition in the chip market and the widespread influence of MoE across numerous models, an exclusive response from the core maintainer of vLLM: How to maintain the dominant position in inference with PyTorch.
The story of vLLM began with a group of enthusiastic students and researchers at the Sky Computing Lab of the University of California, Berkeley. In 2023, they open - sourced the core PagedAttention technology. In just over a year, vLLM's GitHub Stars exceeded 40,000 and quickly grew to the current 65,000. Now, it has become the preferred inference engine for global technology companies.
Behind this success, Neural Magic played a crucial role. Founded by MIT researchers, this company stands out in the highly competitive AI optimization field with its unique strategy of "free platform + open - source tools". By making in - depth contributions to vLLM, Neural Magic has not only built a mature enterprise - level inference stack but also continuously promoted model optimization research and maintained a pre - optimized model library that can be directly integrated with vLLM.
It was its deep accumulation and engineering strength in the vLLM open - source community that caught Red Hat's attention. In November 2024, Red Hat officially acquired Neural Magic and incorporated the core team, including Michael Goin, the core maintainer of vLLM. Michael has over a decade of experience in optimizing inference performance and maximizing CPU/GPU efficiency. In the vLLM community, he focuses on kernel tuning, model compression, and system optimization.
After Red Hat became an important participant, many changes took place in the field of large AI models. How did vLLM respond to these changes and challenges during this period? How did Red Hat help vLLM maintain its competitive edge? We interviewed Michael Goin, the chief engineer at Red Hat and a core contributor to vLLM, and Zhang Jiajue, the chief architect of the Red Hat Asia - Pacific CTO Office and the CTO of Greater China. They detailed the recent development of vLLM and shared some of their thoughts during this period.
Michael Goin, chief engineer at Red Hat and core contributor to vLLM
From Llama to DeepSeek
As the "kernel team" of the vLLM project, Michael's team has always focused on integrating and developing high - performance inference kernels, which has supported the entire project to stay ahead during rapid iterations.
With the release of various models, the development pace of vLLM has continued to accelerate. Especially after the release of DeepSeek R1, the team shifted its focus from optimizing the efficiency of the Llama series models to fully investing in optimizing the features related to the DeepSeek model.
To quickly respond to the new features of DeepSeek, the development cycle of the entire 0.7.2 version was very tight. In addition, it efficiently supported Qwen 2.5 VL and introduced the Transformers backend, allowing users to directly run any Hugging Face model. The subsequent 0.7.3 version was an even larger - scale update. Many contributors participated in a short period, and the development process was efficient and intense.
This version not only enabled optimizations such as multi - Token prediction (MTP) and MLA attention for DeepSeek but also extended the support and tuning for AMD hardware. In addition, expert parallelism was not common before DeepSeek, so the team promoted the evolution of vLLM from supporting tensor parallelism and pipeline parallelism to supporting expert parallelism. Michael also systematically integrated a series of high - performance tools open - sourced by DeepSeek, such as DeepGEMM, DeepEP, and expert parallel load balancing, into the vLLM ecosystem.
For inference scenarios, the team has continuously expanded its high - performance kernel library, covering customized Triton, CUTLASS, CUDA kernels, HIP kernels, etc., as well as various quantization support and many customized kernel implementations.
The complexity of DeepSeek actually brought opportunities for optimization and generalization to the team. Michael pointed out that the team transformed the technologies originally mainly used in the private environment of DeepSeek into sustainable and generalized implementations, enabling them to serve more models based on the MoE architecture. He emphasized that some evolutions of vLLM were driven by DeepSeek, not because the DeepSeek model itself runs faster, but because the series of advanced technologies it open - sourced have improved the entire ecosystem.
In this process, DeepSeek revealed a feasible path for the efficient deployment of large models, and the vLLM team replicated and generalized these experiences to build a more powerful inference framework. "We collaborated with DeepSeek to combine excellent algorithms with the implementation of the underlying framework, building a more efficient inference framework. This is truly a combination of strengths," Michael summarized.
In addition to leading the integration of DeepSeek V3, Michael also led the team to complete the adaptation and optimization of multiple models such as GPT - OSS, Qwen, and Kimi.
How a Framework Supports Various Hardware
Another core mission of the vLLM team is to build an open and efficient hardware inference ecosystem. They not only widely support various mainstream chips but also deeply participate in the architecture design and performance optimization of new hardware, promoting the entire community to evolve towards multi - hardware compatibility.
In the past few months, Michael has been working with NVIDIA to promote the support for the Blackwell chip and optimize the performance related to B200. Team members have also maintained close collaboration with the AMD team to ensure AMD's performance in vLLM. Michael has worked closely with the Google TPU team for more than a year and completed multiple version releases. Recently, Michael, as the top decision - maker, participated in designing the overall support architecture for Moore Threads.
Taking the cooperation with Moore Threads as an example, we can see the in - depth participation of the Red Hat team: In the very early stage of the project, Michael discussed the design direction of the support framework with the Moore Threads team. He led the high - level architecture, while community contributors in the team delved into the details and even went to Shanghai for face - to - face technical docking. The two sides also created a channel on Slack and formed a cross - company "online joint working group" to ensure the continuous and efficient progress of the support work.
The entire process reflects the team's rigorous investment in ecosystem building: They first pointed out the implementation direction for hardware partners; after Moore Threads completed the corresponding work, they conducted code reviews and iterative optimizations together. For example, they helped Moore Threads refactor the initial support solution through the plug - in mechanism to make it more elegant and maintainable. On GitHub, a large number of revision suggestions (RC) have been carefully reviewed by the team. Now, Michael has a long list of items to be reviewed.
This in - depth collaboration ultimately benefits both parties. As Zhang Jiajue said: "For Moore Threads, they found an elegant way to get community support for their hardware, which means less future maintenance work. For the community, we have also promoted the development of an ecosystem that supports different hardware."
The Importance of PyTorch
In the era of heterogeneous computing, the core strategy for vLLM to widely support chips from NVIDIA, AMD, Google TPU, and many domestic manufacturers is to deeply embrace PyTorch as the "greatest common divisor" connecting the upper - layer framework and the underlying hardware.
From the perspective of the technology stack, PyTorch sits on top of the hardware, and vLLM sits on top of PyTorch. This means that as long as hardware manufacturers provide good support for PyTorch, most of the work of adapting to vLLM is already done. The model definitions in vLLM are almost entirely written based on PyTorch, with only a few key modules such as the attention mechanism retaining customizable spaces for replacement.
PyTorch itself provides an SDPA attention implementation, and vLLM also supports more than a dozen other hardware backend attention implementations on this basis, such as NVIDIA's FlashAttention and FlashInfer, AMD's ROCm Attention and Triton Attention, Google TPU's Pathways Attention, and Ascend NPU's Attention.
It is through this unified PyTorch abstraction layer that vLLM can integrate the acceleration implementations of various hardware. As long as hardware suppliers provide integrated or distributed versions suitable for PyTorch, most (about 90%) of the work is naturally completed. The remaining about 10% mainly involves customizing and optimizing the less efficient parts in PyTorch, such as fusing MoE, matrix multiplication quantization, and specific attention implementations.
Michael explained that the reason why vLLM deeply relies on PyTorch is that almost all hardware suppliers have sufficient reasons to develop based on PyTorch: It is used not only for training but also for inference and is deeply integrated with most open - source software.
He further said that the main competitor of PyTorch is Google's JAX, but JAX has a relatively low degree of open - source. For example, its XLA compiler backend is not open, and its actual ecological popularity is far less than that of PyTorch. Since PyTorch is regarded as the best abstract framework from machine learning to the hardware layer, vLLM closely relies on its infrastructure and expands its functions around efficient large - language - model inference, which also partly explains why vLLM chose to join the PyTorch Foundation.
Zhang Jiajue also pointed out that PyTorch is so widely used that any hardware manufacturer actively adapts to the PyTorch ecosystem. Domestic chip manufacturers also integrate and adapt through the PyTorch path.
In short, vLLM does not directly face the complex hardware technology stack but relies on the mature and open intermediate layer of PyTorch to collaborate with hardware manufacturers and the community. This not only reduces the complexity of multi - hardware support but also allows the entire ecosystem to continuously evolve and optimize on a unified basis.
Is NVIDIA's So - Called Moat Still Solid?
Naturally, we need to face a deeper question: If CUDA is the "engine" for GPU acceleration and PyTorch is the "framework" to call it, how can emerging hardware manufacturers catch up to achieve the same level of efficiency and ease of use as NVIDIA CUDA?
In Michael's view, this is a challenging proposition. The core difficulty is that even if functional compatibility can be achieved at the PyTorch layer, the efficiency often cannot match NVIDIA's CUDA ecosystem that has been polished for decades. "CUDA is not a directly transferable language for other hardware," he pointed out. This is essentially a long - term cumulative gap between hardware abstraction and software ecosystem.
However, there are still paths.
Michael pointed out that at the hardware abstraction layer, using a domain - specific language like Triton is a solution: Write an algorithm in Triton once, and it can run on multiple hardware platforms. But this model also has limitations: Even if the software can ultimately support all hardware backends, kernel developers still need to invest a lot of manual debugging and kernel development work and conduct in - depth tuning for specific hardware to achieve high efficiency.
Zhang Jiajue analyzed that there are multiple technical paths to achieve the same capabilities as CUDA. For example, Moore Threads chose the path of fully compatible CUDA API. In addition, one can also use domain - specific languages to compile and run across hardware through different backends. Triton is an emerging language for writing GPU operators. But in essence, this is still a model that requires a lot of manual optimization and adaptation.
However, a turning point is emerging. Michael keenly pointed out that new attention algorithms are constantly emerging, and other hardware suppliers may be able to surpass in these new technologies.
"They are very novel, and suppliers may be able to provide faster and more native support than CUDA. For example, the KDA algorithm proposed by Kimi was first supported through Triton. In the field of new algorithms, other manufacturers can sometimes respond more nimbly," Michael said.
As model suppliers continue to explore new architectures more efficient than the standard Transformer, hardware manufacturers have gained greater flexibility and rapid response space. Just as Michael's metaphor: It's like a sports competition, and everything is back on the same starting line.
Multi - Modal Support
Against the background of the continuous integration of software and hardware ecosystems, vLLM has not stopped at optimizing single - modality inference. When the wave of multi - modal AI swept in, the team upgraded vLLM from a pure - text inference engine to a unified service platform supporting full - modality generation and understanding. It can be said that the multi - modal model architecture has now changed the architecture of vLLM.
"Whether it is text - to - image generation, document understanding, or other generation tasks, they all rely on large - model inference at the underlying level, so they can all be processed through vLLM," Michael pointed out.
To this end, the team completely refactored the vLLM v1 version. One of the key innovations is multi - modal prefix caching. Traditionally, vLLM reused the key - value cache of text tokens through Page Attention. Now, this mechanism has been extended to any modality input such as images and audio. Now the team maintains a multi - modal cache, which has greatly improved the processing efficiency of repeated requests.
To further support large - scale inference deployment, the team implemented the encoder decoupling technology, decoupling the visual and audio encoders from the language model backbone. This not only conforms to the structural characteristics of multi - modal models but also provides extreme flexibility and resource utilization for ultra - large - scale inference scenarios.
In December this year, this evolution reached a milestone: vLLM - Omni, as its first "full - modality" inference framework, was officially released. It turned the unified generation of text, images, audio, and video from a concept into production - level code that can be implemented. Omni is not a simple encapsulation on the original framework but introduces a completely decoupled pipeline architecture, allowing different stages to allocate resources on demand and be connected through unified scheduling. An omni - modality inference request generally goes through three types of components: the modality encoder, the LLM core, and the modality generator, and works collaboratively between different GPUs/nodes through pipeline scheduling.
This evolution has greatly expanded the application boundary of vLLM. Now, as an inference engine and server, vLLM has a wide range of support: It can not only run text - generation models but also support multi - modal understanding and generation, embedding models (for RAG and vector databases), agent programming (driving tools such as Claude Code). Even at the enterprise level, it can be applied to discriminative tasks such as document understanding, OCR, recommendation systems, customer service, programming assistance, and defect detection. In addition, in the training process of reinforcement learning, the finally deployed inference model, thinking model, or tool - calling model can also be built on or embedded in vLLM.
"The core role of vLLM is an efficient inference engine and server," Michael summarized. "This is similar to the logic of a Web server hosting various web applications (such as HTML or JavaScript pages). vLLM needs to carry a variety of models and applications and strive to provide excellent performance in various usage scenarios, whether it is dealing with a thousand or a hundred thousand users."
From unifying the hardware abstraction layer to defining the full - modality inference architecture, vLLM is steadily advancing its vision: to become the most general and efficient inference infrastructure in the AI era.
How to Maintain Competitive Edge
As vLLM has gradually matured in the past two and a half years, a trend has become increasingly obvious: Whether last year or this year, many companies have started to contribute more modifications back to the upstream.
"This is because vLLM itself has undergone a large number of improvements, and these improvements are also beneficial to their privately developed versions. So they hope to synchronize with the upstream more frequently. They start to be willing to upstream their customized changes to the project and prefer to directly use the upstream vLLM rather than develop a very different private version. We have seen this happen in multiple cases," Michael explained.
The core driving force for this virtuous cycle lies in "speed".
"Our upstream version has a unique advantage: We cooperate with many leading model laboratories and companies, quickly collect their feedback, fix bugs, and then put the fixes back into the community for more people to see," Zhang Jiajue added. The cooperation list of vLLM includes DeepSeek, Qwen, ByteDance, Tencent, LinkedIn, Amazon, Mistral, Azure, and Snowflake.
"Understand how they may use vLLM and what improvement requirements future model architectures may pose to vLLM. By developing these functions, we can ensure that vLLM always remains competitive and keeps up with the industry development," Zhang Jiajue said.