OpenClaw has become extremely popular, exposing 12 types of fatal hidden dangers. The safety benchmark for the MCP protocol has been released.
The MCP protocol is driving AI Agents to execute tasks autonomously, but security risks are soaring. Research has found that attackers can use 12 types of techniques, such as tool name confusion and false errors, to deceive Agents into performing malicious operations, and even top - tier models are not immune.
The research team from Beijing University of Posts and Telecommunications released the MSB security benchmark. Through real - environment testing, it reveals that the more powerful the model, the more vulnerable it is to attacks. The new indicator NRP balances security and practicality for the first time, providing a key yardstick for strengthening the defense of AI Agents.
Recently, open - source AI Agent projects like OpenClaw have become extremely popular in the developer community. With just one sentence, an Agent can automatically write code, search for information, operate local files, and even take over a computer for you.
Behind the amazing autonomy of these Agents lies the capabilities provided by tool invocation. MCP (Model Context Protocol) is the interface that unifies the AI tool ecosystem. Just as USB - C allows a computer to connect to various devices, MCP enables large models to call external tools such as file systems, browsers, and databases in a standardized way.
Facing such a large ecosystem, even OpenClaw, which features a native command - line interface, has accessed MCP through an adapter to gain broader tool capabilities.
However, as AI's "reach" extends further, danger also looms. What if the tools called by the Agent are poisoned by hackers? What if there are malicious instructions hidden in the error messages returned by the tools?
When a large model executes these instructions without any defense, your private data, local files, and even server permissions will fall into the hands of hackers.
To fill the gap in the security evaluation of the MCP ecosystem, a research team from institutions including Beijing University of Posts and Telecommunications has launched a security benchmark specifically for the MCP protocol: MSB (MCP Security Bench). The research found that attacks at each stage of MCP are effective. The more powerful the model, the more vulnerable it is to attacks. This paper has been accepted by ICLR 2026.
Paper link: https://openreview.net/pdf?id=irxxkFMrry
Code: https://github.com/dongsenzhang/MSB
Security Risks of MCP Behind Agents
Figure 1: MCP Attack Framework
MCP greatly expands the capabilities of Agents, but at the same time, it also greatly expands the attack surface. In the MCP system, the tool - calling process of an Agent usually consists of three stages:
1. Task Planning: The Agent selects appropriate tools based on the user's query, using tool names and descriptions.
2. Tool Invocation: The Agent sends a request to the selected tool and passes in the corresponding parameters to perform specific operations.
3. Response Handling: The Agent parses the tool's response and continues to reason or generate the final answer based on it.
Each stage can become a new attack entry. MSB covers the entire MCP tool - calling stage and is specifically used to evaluate the security of Agents based on MCP tool usage, with three core highlights:
MCP Attack Classification System
In the MCP workflow, an Agent interacts with tools through tool identifiers (names and descriptions), parameters, and tool responses, all of which can be attack vectors. MSB classifies attack types based on these attack vectors and interaction stages:
Tool Signature Attack: In the task - planning stage, attacks are carried out using tool names and descriptions, including:
Name Collision (NC): Forging malicious tools with names similar to official tools to induce the Agent to select them.
Preference Manipulation (PM): Injecting promotional statements into tool descriptions to induce the Agent to select them.
Prompt Injection (PI): Injecting malicious instructions into tool descriptions.
Tool Parameter Attack: In the tool - invocation stage, attacks are carried out using tool parameters, including:
Out - of - Scope Parameter (OP): Setting tool parameters beyond normal functionality, causing information leakage through parameter passing.
Tool Response Attack: In the response - handling stage, attacks are carried out using tool responses, including:
User Impersonation (UI): Impersonating the user to issue malicious instructions.
False Error (FE): Providing false tool execution error information and requiring the Agent to follow malicious instructions to successfully call the tool.
Tool Transfer (TT): Instructing the Agent to call malicious tools.
Retrieval Injection Attack: In the response - handling stage, attacks are carried out using external resources, including:
Retrieval Injection (RI): External resources embedded with malicious instructions disrupt the context through tool responses.
Mixed Attack: At multiple stages, multiple tool components are used simultaneously for attacks, including combinations of the above attacks.
Execution Suite Based on Real Environment
MSB rejects paper - based simulation evaluations. It is equipped with a real MCP server, covering 10 real - world scenarios, 405 real tools, and 2,000 attack instances. All instances run real - world tool executions through MCP, truly reflecting the actual operating environment to directly observe the degree of damage to the environmental state caused by attacks.
Indicator NRP Balancing Performance and Security
In Agent security evaluations, simply looking at the Attack Success Rate (ASR) is highly deceptive. If an Agent refuses to execute any tool calls to avoid risks, its ASR may be close to 0, but at the same time, it cannot complete user tasks and loses practical application value.
Therefore, MSB proposes the Net Resilient Performance (NRP) indicator:
NRP = PUA ⋅ (1 - ASR)
Among them, PUA (Performance Under Attack) is the proportion of user tasks completed by the Agent in an adversarial environment, and ASR is the attack success rate. NRP aims to evaluate the overall risk - resistance ability of an Agent to maintain performance while resisting attacks, providing a comprehensive quantitative standard for balancing performance and security.
Figure 2: NRP vs ASR, NRP vs PUA.
All Attack Methods Are Effective
Figure 3: Main experimental results.
The research team used MSB to conduct large - scale tests on 10 mainstream models such as GPT - 5, DeepSeek - V3.1, Claude 4 Sonnet, and Qwen3. All attack methods showed effectiveness, with an overall average ASR of 40.35%. Among them, the new - type attacks introduced by MCP are more aggressive. Compared with the existing PI and RI attacks in function calling, attacks based on MCP, such as UI and FE, have a higher success rate. Mixed attacks show synergistic enhancement, and the success rate of mixed attacks is higher than that of the single attacks that compose them.
The More Powerful the Model, the More Vulnerable
The relationship between different indicators reveals a counter - intuitive conclusion: The more powerful the model, the more vulnerable it is to attacks.
Figure 4: PUA vs ASR.
In MSB, completing an attack task still requires the Agent to call tools, such as using a file - reading tool to obtain personal information. LLMs with higher practicality show a higher ASR due to their better tool - calling and instruction - following abilities. This discovery reveals the huge practical risks of MCP security vulnerabilities.
Full - stage, Multi - tool Environment Infringement
Figure 5: ASR for different stages and tool configurations.
Further analysis from the perspective of the MCP workflow and tool configuration shows that Agents are vulnerable to attacks at all stages of MCP, and the security of the model is the lowest in the tool - invocation stage.
In addition, even in a multi - tool environment containing harmless tools, attacks are still effective. In real - world scenarios, Agents are usually provided with a tool package. Even if there are harmless tools, inducement methods such as NC, PM, and TT still lead to significant attack success.
Summary
The popularity of OpenClaw allows people to intuitively see the future of Agents: large models are no longer just answering questions but starting to actually do things. MSB is proposed in this context. It systematically reveals the potential attack surfaces in the MCP ecosystem and provides a reproducible and quantifiable systematic evaluation benchmark for Agent security research.
Past security research on large models mainly focused on language - level risks such as prompt injection, while MSB shows that when AI calls tools and interacts with real systems, the attack surface is expanding from the text space to the tool ecosystem. As Agents gradually become a new paradigm for AI applications, security may be a threshold that this technological leap must cross.
Reference materials:
https://openreview.net/pdf?id=irxxkFMrry
This article is from the WeChat official account "New Intelligence Yuan", author: New Intelligence Yuan, published by 36Kr with authorization.