Reject Unsafe Instructions Outright? TRIAD Fixes AI Agents' Dangerous Plans with Three - Way Decision

A Novel Security Framework TRIAD for LLM Agents Balancing Security and Task Completion

[Introduction] TRIAD is a novel security framework designed for AI agents. Through three types of decisions (Proceed, Update, Refuse) and natural language feedback, it helps agents correct their plans when misled and fulfill the users' real needs. Compared with traditional methods, TRIAD is more intelligent, capable of distinguishing the degree of task contamination, preventing risks without delaying tasks.

As AI agents gradually evolve into automated systems that can call external tools such as search engines, emails, files, databases, and code execution, their capabilities are continuously expanding, and so are the security risks. A seemingly ordinary web page content, an email, or a tool's return result can all become external risk sources that affect an agent's decision-making, inducing it to deviate from the user's original task.

Existing guardrail models usually can only judge "safe" or "unsafe" before execution. However, in real agent scenarios, the risks often don't mean the entire task is harmful. Instead, untrusted instructions are mixed into normal tasks. Allowing the task directly may lead to a successful attack, while rejecting it directly will sacrifice the user's normal needs.

To address this, the team from the University of Melbourne open-sourced TRIAD (Tripartite Response for Iterative Agent Guardrailing), a feedback-driven guardrail framework for LLM agents. It no longer just makes binary security judgments but expands the decisions into three categories: Proceed, Update, and Refuse. When it's safe, the agent proceeds with the execution; when it's completely harmful, the agent refuses; for tasks contaminated by prompt injection but still repairable, natural language feedback is used to guide the agent to modify its action plan and return to the user's original goal.

Paper link: https://arxiv.org/abs/2606.05805

Code link: https://github.com/YUHAOSUNABC/TRIAD

Project homepage: https://yuhaosunabc.github.io/TRIAD/

Research Background

As large language model agents (LLM agents) shift from "answering questions" to "calling tools and executing tasks," they are being applied to more complex scenarios such as email processing, web browsing, file management, database queries, and code execution.

Compared with traditional chat models, agents not only generate text but also formulate plans, select tools based on the context, and continue to act according to the tool's return results in multiple rounds of interactions. This ability makes LLM agents closer to real automated assistants but also significantly expands the security risks.

This problem is particularly prominent in prompt injection attacks. In real scenarios, the risks often don't come from a completely harmful user request but from "untrusted instructions mixed into normal tasks."

For example, a user only wants the agent to search for hotels and send an email, but malicious content may be mixed into the search results or the email body, inducing the agent to send the meeting location to irrelevant recipients, leak customer email addresses, or call unnecessary tools to access sensitive information.

At this time, the agent is not facing a simple binary "safe/unsafe" problem. It needs to reject the malicious part while completing the user's original normal task as much as possible.

Existing agent guardrails usually check the input, action plan, or tool call before execution and provide permission, rejection, risk category, or explanatory reasons. However, these methods are often better at "detecting risks" but may not effectively guide the agent on what to do next.

For contaminated but still repairable tasks, simply rejecting them can block the attack but will sacrifice the user's normal needs. Allowing them directly may cause the agent to execute the tool calls specified by the attacker.

In other words, agent security not only requires risk detection but also the ability to repair the action plan after detecting risks.

To address this, the authors of this paper proposed TRIAD (Tripartite Response for Iterative Agent Guardrailing), a feedback-driven guardrail framework for LLM agents. TRIAD expands the traditional binary guardrail decisions into three categories: Proceed, Update, and Refuse.

When the action plan is safe, the agent can proceed with the execution. When the user's request is harmful, the agent should refuse to complete it. When there are prompt injections or untrusted instructions in the task but the original user goal is still reasonable, TRIAD will generate natural language feedback to guide the agent to modify the plan, avoid the malicious part, and return to the user's original task goal.

That is to say, TRIAD doesn't just tell the agent "there is a risk here." Instead, it emphasizes the risk source and the task deviation point through natural language feedback, guiding the downstream agent to replan and return to the original user goal.

Figure 1: Comparison between the TRIAD process and the baseline. Before the agent executes a tool, Tri-Guard will check its action plan and provide three types of decisions: Proceed, Update, or Refuse. For tasks contaminated by prompt injection but still repairable, TRIAD will write the natural language feedback back to the context to guide the agent to modify the plan and return to the original goal.

Agent Returning to the Original Task after Being Misled

Traditional agent guardrails usually adopt the "detection - interception" approach. They judge whether the current action is safe before tool execution. If a risk is found, they prevent the agent from continuing to execute.

This approach is effective for completely harmful requests but encounters difficulties in prompt injection scenarios. Because many tasks are not entirely harmful, but malicious instructions are mixed into normal tasks. At this time, simply rejecting the task will make the agent abandon the normal task that could have been completed, while simply allowing it may lead to a successful attack.

The core idea of TRIAD is to transform the guardrail from a "binary referee" into a "feedback provider." As shown in Figure 1, before each tool call, the agent will first generate the current action plan and the tool to be called.

Then, Tri-Guard will check this plan before the tool is actually executed and provide natural language feedback and three types of decisions: Proceed, Update, and Refuse based on the current context, historical interactions, available tools, and the action to be executed.

Among them, Proceed means the current plan is safe and consistent with the user's goal, and the agent can continue to execute the tool. Refuse means the user's request is harmful, or the current task cannot be completed safely by modifying the plan, and the agent should directly refuse.

Update is used to handle the most critical intermediate situation: the current plan is affected by prompt injection or untrusted content, but the user's original goal is still reasonable.

At this time, TRIAD will not directly terminate the task. Instead, it will write the natural language feedback generated by Tri-Guard back to the agent's temporary context, clearly indicating the risk source, the task deviation point, and the problem with the current tool call, thereby guiding the downstream agent to replan.

This design forms a closed loop: the agent first proposes a plan, and Tri-Guard checks the plan. If an update is needed, the feedback will be injected back into the agent's context, and the agent will generate a new plan. The new plan will be checked by Tri-Guard again until it is allowed to be executed, rejected, or the maximum number of updates is reached. In this way, TRIAD transforms the guardrail output from a static risk label into a context signal that can affect subsequent planning, enabling the agent to have a chance to "return to the correct direction" rather than just "stop" when facing a partially contaminated task.

To enable Tri-Guard to have this judgment and feedback ability, the researchers constructed a dataset containing multi-round agent trajectories and used knowledge distillation to generate structured natural language feedback and three types of decision labels for the trajectories using a teacher model. After training, Tri-Guard not only needs to identify whether there is a risk in the current action but also distinguish three situations: normal tasks should proceed, directly harmful tasks should be refused, and tasks contaminated by prompt injection but still repairable should enter the update process.

Figure 2: Process of training data construction.

Experimental Results

We conducted evaluations on two benchmarks, ASB and AgentHarm.

Among them, ASB is used to test whether an agent will be misled by an attacker under direct prompt injection (DPI) and indirect prompt injection (IPI). AgentHarm is used to evaluate the agent's ability to refuse directly harmful tasks and retain normal tasks.

The experiments covered four agent backbones, including two open-source models, Qwen3-32B and Kimi-2.5, and two cutting-edge closed-source models, GPT-5.1 and Gemini-2.5-Pro. The results are as follows.

Table 1: Experimental results of TRIAD on four types of agents. The experiments covered ASB-DPI, ASB-IPI, and AgentHarm, comparing unprotected ReAct, ToolSafe, TRIAD + TS-Guard, and TRIAD + Tri-Guard.

The main experimental results show that TRIAD + Tri-Guard can significantly reduce the attack success rate (ASR) on different agents while maintaining a higher normal task completion rate (TSR). Compared with the unprotected ReAct, TRIAD + Tri-Guard reduced the average ASR from 74.45% to 10.42% and increased the average TSR from 28.45% to 68.60%. This result indicates that TRIAD doesn't just simply intercept risks but can also guide the agent to return to the original user goal when the task is contaminated by prompt injection.

An important phenomenon is that a low ASR doesn't necessarily mean a better guardrail. ToolSafe and TRIAD + TS-Guard can also lower the ASR in some settings, but they are often accompanied by a high rejection rate and a low TSR, indicating that they mainly reduce the attack success rate by "intercepting or giving up execution." In contrast, TRIAD + Tri-Guard generally achieved a higher TSR on ASB-DPI and ASB-IPI, indicating that it is better at handling scenarios where "the task is partially contaminated but still repairable."

Table 2: Results of replacing different guardrail models under the same TRIAD framework. The experiments were based on Qwen3-32B, comparing existing guardrails, the Qwen3.5-9B base model, and the trained Tri-Guard.

To distinguish the influence of "the framework itself" and "the guardrail model's ability," the researchers further replaced different guardrail models in the TRIAD framework. The results show that directly accessing existing guardrails is not enough to achieve an ideal security-utility balance. Many models can detect risks and reduce the ASR, but they tend to regard partially contaminated tasks as entirely dangerous tasks, resulting in a high rejection rate and a low task completion rate.

Taking TS-Guard as an example, it can significantly lower the ASR on both ASB-DPI and ASB-IPI, but the rejection rates reach 88.80% and 94.63% respectively, and the corresponding TSRs are only 1.33% and 0.59%. This means that although the agent executes the attacker's goals less often, it almost abandons the user's original normal tasks.

In contrast, Tri-Guard has a slightly higher ASR but achieved TSRs of 60.83% and 61.59% under DPI and IPI respectively, and the rejection rate is significantly lower. This indicates that the effectiveness of TRIAD doesn't just come from "adding an extra guardrail" but from Tri-Guard's learning of the three types of decisions: Proceed, Update, and Refuse.

Table 3: Comparison of the average performance between Tri-Guard and the pre-trained Qwen3.5-9B base model. The results are the averages of four agents.

Table 3 further illustrates the role of trajectory-feedback training. The pre-trained Qwen3.5-9B base model already has a strong security tendency and can keep the ASR very low. However, its problem is that it is too conservative and often directly judges repairable prompt injection tasks as refused, resulting in normal tasks not being completed.

After training, Tri-Guard adjusted the decision boundary from "refusing when a risk is found" to "updating if it can be repaired." Although Tri-Guard's average ASR is slightly higher than that of the base model, it increased the TSR from 26.30% to 64.52% on ASB-DPI and from 26.53% to 72.68% on ASB-IPI. At the same time, the rejection rate also decreased significantly.

This indicates that the trained Tri-Guard is more in line with the core goal of TRIAD: not maximizing rejection but retaining the user's normal tasks as much as possible while reducing the attack success rate.

Figure 3: Changes in the guardrail decision distribution before and after training. Compared with the Qwen3.5-9B base model, Tri-Guard is more likely to route the PIA-contaminated action plans to Update rather than directly Refuse.

The pie chart explains from the decision distribution level that for normal action plans, Tri-Guard can still maintain a high Proceed ratio, indicating that it won't over-interfere with normal tasks. For action plans contaminated by prompt injection, Tri-Guard significantly chooses Update more often rather than directly Refusing like the base model. For directly harmful tasks, Tri-Guard still retains the ability to refuse.

This is the key change of TRIAD compared with traditional guardrails. It doesn't direct all

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Unsafe instructions, just reject them outright? TRIAD uses a three-way decision-making framework to fix dangerous plans of AI agents

Research Background

Agent Returning to the Original Task after Being Misled

Experimental Results