Exposed High - Risk Vulnerability in NVIDIA's Inference Server Puts Cloud AI Models at Direct Attack Risk

It's more reassuring to have our own people discover vulnerabilities first.

As one problem is solved, another arises.

The NVIDIA Triton Inference Server has been exposed by the security research firm Wiz Research to have a set of high-risk vulnerability chains.

This set of vulnerabilities can be exploited in combination to achieve remote code execution (RCE). Attackers can read or modify data in shared memory, manipulate model outputs, and control the behavior of the entire inference backend.

Possible consequences include model theft, data leakage, response manipulation, and even system out of control.

Currently, NVIDIA has released a patch, but all systems before version 25.07 are unprotected. Users need to update the Triton Inference Server to the latest version.

One vulnerability can have far - reaching consequences

How serious is the impact of this vulnerability chain?

According to Wiz, this vulnerability chain may allow unauthenticated remote attackers to control the NVIDIA Triton Inference Server, which may lead to the following series of severe consequences:

First, there is Model Theft. Attackers can precisely locate the shared memory area and steal proprietary and expensive AI models.

Second, there is Data Breach. Once attackers control the memory during model runtime, they can read the model's input and output in real - time and intercept sensitive data involved in the model processing (such as user information or financial data).

Next, there is Response Manipulation. Attackers can not only read but also write. They can manipulate the output of AI models to produce incorrect, biased, or malicious responses.

Finally, there is System Out of Control caused by Pivoting. Attackers use the compromised server as a springboard to further attack other systems within the organization's network.

It can be said that one Triton vulnerability is enough to destroy the four pillars of an AI platform: models, data, outputs, and systems.

What kind of vulnerabilities are so dangerous?

This vulnerability chain consists of three vulnerabilities:

CVE - 2025 - 23320: When an attacker sends an oversized request that exceeds the shared memory limit, an exception is triggered, and the error message returned will expose the unique identifier (key) of the backend internal IPC (Inter - Process Communication) shared memory area.

CVE - 2025 - 23319: Using the above identifier, attackers can perform an out - of - bounds write.

CVE - 2025 - 23334: Using the identifier, an out - of - bounds read can be achieved.

These three vulnerabilities are interlinked, forming a complete attack chain:

First, attackers obtain the unique identifier of the internal shared memory of the Triton Python backend through the error message leakage vulnerability of CVE - 2025 - 23320.

After obtaining this identifier, attackers can use the two vulnerabilities CVE - 2025 - 23319 and CVE - 2025 - 23334 to perform out - of - bounds write and out - of - bounds read operations on the shared memory area.

Specifically, attackers abuse the shared memory API to read and write the backend's internal memory data structures without restrictions.

Finally, after obtaining read and write permissions to the backend shared memory, attackers can interfere with the normal behavior of the server and then achieve full control of the server.

Possible attack methods include but are not limited to:

– Destroy the data structures in the backend shared memory, especially those containing pointers (such as MemoryShm, SendMessageBase), to achieve out - of - bounds read and write.

– Fake and manipulate messages in the IPC message queue, causing local memory corruption or exploiting logical vulnerabilities.

The "perfect" attack path from the initial information leakage to a full - scale system intrusion is largely related to Triton's architecture.

Universality is a double - edged sword

Although this time the vulnerabilities are concentrated in Triton's Python backend, the "Python backend" is not exclusively for the Python framework.

NVIDIA's Triton is a general - purpose inference platform designed to help developers simplify the deployment and operation of AI models on various frameworks (such as PyTorch, TensorFlow, ONNX).

To achieve this, Triton adopts a modular backend architecture, where each backend is responsible for executing models of the corresponding framework.

When an inference request arrives, Triton automatically identifies the framework to which the model belongs and sends the request to the corresponding backend for execution.

However, at different stages of inference, even if the model mainly runs on a certain backend (such as the PyTorch backend), it may internally call the Python backend to complete certain tasks.

In other words, even if the main model runs on TensorFlow or PyTorch, as long as there are customized steps in the process, the Python backend may be called for execution.

Therefore, the Python backend is not only used for models of the Python framework but is more widely used in Triton's inference process, which also makes it a potential security weak point with a larger scope of influence.

In addition, the core logic of Triton's Python backend is implemented in C++.

When an inference request arrives, this C++ component communicates with a separate "stub" process, which is responsible for loading and executing the specific model code.

To enable smooth communication between the C++ logic and the stub process, the Python backend uses a complex Inter - Process Communication (IPC) mechanism for inference data transmission and internal operation coordination.

This IPC is based on named shared memory (usually the shared memory area under the /dev/shm path), and each shared memory area has a unique system path identifier, which is the identifier key mentioned above.

This design enables high - speed data exchange but also brings a key security risk: the security and privacy protection of the shared memory name are very important. Once the name is leaked, it may be exploited by attackers.

In summary, because of its flexibility, the general - purpose platform has become a security vulnerability, that is, one vulnerability can have far - reaching consequences.

Fortunately, although the vulnerability chain is very destructive, it is currently only in the laboratory and has not been found to be used in actual attacks.

After receiving the report from Wiz Research, NVIDIA quickly fixed these three vulnerabilities and released the updated Triton Inference Server version 25.07.

It can only be said that it's more reassuring to have vulnerabilities discovered by our own people first.

Reference Links:

[1]https://www.theregister.com/2025/08/05/nvidia_triton_bug_chain/

[2]https://www.wiz.io/blog/nvidia-triton-cve-2025-23319-vuln-chain-to-ai-server

[3]https://thehackernews.com/2025/08/nvidia-triton-bugs-let-unauthenticated.html

This article is from the WeChat official account “QbitAI”, author: henry. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A high-risk vulnerability in NVIDIA's inference server has been exposed, and cloud AI models are directly at risk of being attacked and left unprotected.

One vulnerability can have far - reaching consequences

What kind of vulnerabilities are so dangerous?

Universality is a double - edged sword