StartseiteArtikel

Nanyang Technische Universität enthüllt den totalen Zusammenbruch der "Betriebssicherheit" von KI. Ein einfacher Trick reicht aus, um alle Modelle zu täuschen.

机器之心2025-10-17 15:12
Die Sicherheit von KI betrifft nicht nur den Inhalt, sondern auch den Betrieb und die Einhaltung der Aufgaben.

What exactly are we talking about when we discuss the issue of AI security?

Is it violence, bias, or ethical issues? While these are undoubtedly important, for enterprises that deploy AI in real - world business, there is a more critical but long - neglected safety red line being frequently crossed: Your meticulously crafted "legal consultation" chatbot is enthusiastically providing medical advice to users.

Is this just the model going off - topic? No, it is a form of insecurity.

In this article, researchers from institutions such as Nanyang Technological University first proposed a groundbreaking concept --- Operational Safety, aiming to completely reshape our understanding of the safety boundaries of AI in specific scenarios.

  • Paper title: OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
  • Paper link: https://arxiv.org/pdf/2509.26495
  • Paper code: https://github.com/declare-lab/OffTopicEval
  • Evaluation dataset: https://huggingface.co/datasets/declare-lab/OffTopicEval

The core idea of this article is eye - opening: When AI goes beyond its pre - set duty boundaries, its behavior itself is a form of insecurity.

The fundamental contribution of this paper is to elevate the discussion of AI security from the traditional "content filtering" to a new dimension of "duty loyalty". An AI that fails to adhere to its job responsibilities poses a huge and uncontrollable risk in applications, no matter how "clean" its output is. Operational safety should exist as a necessary but not sufficient condition for general safety.

OffTopicEval: The First Measure of "Operational Safety"

To put this new concept into practice and quantify risks, the team developed the first evaluation benchmark for operational safety --- OffTopicEval. It doesn't care about how much the model knows or how powerful its capabilities are, but rather whether the model can know when to say no.

They built 21 chatbots in different scenarios, strictly defined their duties and boundaries, and then carefully constructed direct out of domain (OOD) question test (obvious out - of - domain questions), adaptive OOD questions (questions disguised as in - domain but actually out - of - domain, which humans can easily identify), and in - domain questions designed to measure whether the model can appropriately refuse rather than always refuse. In total, it includes over 210,000 + OOD data and 3,000 + in - domain data, covering three language families with completely different grammatical structures: English, Chinese, and Hindi.

Revealing the Harsh Reality through Evaluation

Through testing six major mainstream model families such as GPT, LLama, and Qwen, the evaluation results revealed an alarming problem: Almost all models failed in this required course of "operational safety". For example:

  • Vulnerable under disguise: Facing simply disguised out - of - boundary questions, the defense ability of the models almost collapsed. The average refusal rate of all models for OOD questions dropped by nearly 44%. For models like Gemma - 3 (27B) and Qwen - 3 (235B), the decline in the refusal rate even exceeded 70%.
  • Cross - language flaws: This problem persists across different languages, indicating a fundamental flaw in current large models.

They also found that after being deceived once, the model seemed to give up all resistance, and the refusal rate for simple OOD questions dropped by more than 50%!

Simply put, a bank customer - service robot you trained carefully will start providing investment advice and enjoy it as long as the user changes the way of asking questions. This would be an unimaginable potential threat in industries with strict requirements.

Restoring AI's Professional Ethics

This paper not only reveals such a problem but also provides practical solutions and their failed attempts. They tried prompt - based steering, activation steering, and parameter steering. However, both activation steering and parameter steering methods were difficult to improve the model's ability to adhere to boundaries.

In prompt - based steering, they proposed two lightweight, retraining - free prompting methods:

  1. P - ground: After the user asks a question, add an instruction to the model, forcing it to first forget the question and focus on the system prompt before answering.
  2. Q - ground: Let the model rewrite the user's question into the most core and concise form, and then respond based on this question.

In the experiment, they wrote very simple prompts based on these two ideas, and the results were immediate. The P - ground method increased the operational safety score of Llama - 3.3 (70B) by 41% and that of Qwen - 3 (30B) by 27%. This proves that lightweight methods can significantly enhance the "professional ethics" of the model.

Summary

This paper is the first to elevate the off - topic problem from a simple functional defect recognized by the public to a strategic level of security. It sent a clear signal to the entire industry:

  1. AI security is not just content security: An AI that fails to adhere to boundaries is unreliable and insecure in business.
  2. "Going beyond boundaries" itself is a risk: We must establish new evaluation and alignment paradigms to reward models that understand their limitations and dare to refuse out - of - boundary requests.
  3. Operational safety is a prerequisite for deployment: For all developers who hope to use AI agents in serious scenarios, operational safety will become a required test before deployment.

From this perspective, this paper is not just about proposing an evaluation tool; it is more like a manifesto, calling on the entire community to re - examine and re - define AI security for practical applications, ensuring that the AI we build is not only powerful but also trustworthy and dutiful.

The first author of this paper, Lei Jingdi, is a Ph.D. student at Nanyang Technological University. His research focuses on large language models, especially in areas such as model reasoning, post - training, and alignment. The corresponding author, Soujanya Poria, is an associate professor at the School of Electrical and Electronic Engineering at Nanyang Technological University. Other co - authors of the paper are from Walled AI Labs, the Infocomm Media Development Authority (IMDA) of Singapore, and Lambda Labs.

This article is from the WeChat official account "Machine Intelligence", and is published by 36Kr with permission.