HomeArticle

OpenClaw Case: Agents Can Be "Corrupted" in Daily Chats Without Malicious Attacks

新智元2026-05-22 16:21
Casual conversations may inadvertently contaminate the long-term memory of personalized Agents, causing them to deviate from the user's true intentions in future tasks. Through the ULSPB benchmark test, researchers found that even without malicious prompts, casual conversations may change the safety boundaries of Agents.

Today's large model Agents are no longer just chatbots that answer questions. They are beginning to have long - term memory, able to remember user preferences across sessions, continue task progress, and call emails, calendars, files, web pages, and various external tools.

In other words, Agents are evolving from one - time task executors to personalized collaborators who accompany users continuously.

However, this ability also brings a more hidden problem: If an Agent can remember a user's habits and context over the long term, is this memory itself secure?

In the past, most Agent security research mainly focused on explicit attacks, such as malicious prompts, indirect prompt injection, contaminated web content, or tool outputs.

However, in the context of personalized Agents, the risk may not necessarily come from a clear attacker.

Figure 1: Without malicious prompts, ordinary daily conversations may also "corrupt" your personalized Agent. Once temporary preferences are written into long - term memory, they may become dangerous default rules in the future.

Research has found that even without hackers, malicious prompts, or obvious attacks, ordinary daily chats can gradually contaminate the long - term state of a personalized Agent. This risk may not always erupt immediately in the current conversation but may be written into long - term memory and change the Agent's default behavior in future tasks.

Paper link: https://arxiv.org/abs/2605.06731 Demo: https://xiaoyuxu1.github.io/ULSPB_website/

Just because an Agent doesn't do anything wrong today doesn't mean it hasn't planted the seeds of future mistakes in its long - term state.

The long - term state of the Agent is "corrupted"

Traditional prompt injection is more like an explicit attack, while long - term state poisoning is more like a "chronic drift": the Agent doesn't make an immediate mistake, but it may write the rules for future mistakes into its memory.

Researchers define this phenomenon as Unintended Long - Term State Poisoning. Its core is not to immediately induce the Agent to do something bad in a single conversation but for the Agent to wrongly generalize a temporary request, a local preference, or a "convenient practice" in a certain context into a long - term default rule for the future.

For example, a user might say just to save time today: "Don't ask me every time about such small things in the future. Just handle them directly."

If the Agent writes this statement into its long - term state, in the future, it may gradually reduce confirmations in email sending, file modification, schedule arrangement, and even account operations. The user has not truly authorized all future operations, but the Agent's long - term state has been quietly rewritten.

This is different from traditional prompt injection. Traditional attacks often assume the existence of a clear attacker, while the risk here comes from seemingly normal daily interactions. It is also not an ordinary hallucination because the danger may be retained across sessions and continuously affect future security boundaries.

Figure 2: Traditional task - based Agents usually reset the context after a single task is completed, while personalized Agents maintain long - term states, user preferences, and tool permissions across sessions.

Why does long - term memory become a security entry point?

The long - term state of a personalized Agent usually not only "remembers some facts" but may also include long - term memory, core instructions of the Agent, default tool settings, user profiles, behavioral styles, and short - term running states. These contents may seem like just memory files, but in fact, they will affect how the Agent understands user intentions in the future, when to call tools, whether confirmation is needed, and whether it can execute autonomously.

Therefore, the long - term state is not an ordinary cache but a part of the Agent's future behavior boundaries. Once these states are wrongly written, the risk may not show up immediately but may turn into "asking for confirmation one less time", "calling one more tool", or "defaulting to execute an operation that should have required authorization" in a future task. In other words, the long - term memory of a personalized Agent is not a passive database but a set of "implicit configuration files" that affect future behavior.

ULSPB: Specifically testing "whether daily chats contaminate the long - term state"

To systematically study this problem, researchers constructed a new bilingual benchmark, ULSPB (Unintended Long - Term State Poisoning Bench). It is specifically used to test whether daily user - Agent conversations can induce long - term state contamination.

ULSPB covers seven types of long - term state drift scenarios, five types of daily personalized assistance tasks, two languages (English and Chinese), and constructs 24 rounds of ordinary daily conversations for each setting. For comparison, researchers also constructed four types of single - time explicit injection variants to observe the differences between routine conversations and explicit injections.

Among them, the seven types of risk scenarios cover several security boundary drifts that are most likely to occur in the long - term interaction of personalized Agents.

Figure 3: The construction process of ULSPB. This benchmark starts from seven types of long - term state drift scenarios, five types of daily assistance tasks, bilingual templates, and five conversation variants to systematically test whether ordinary daily conversations can contaminate the long - term state of a personalized Agent.

Experimental results

Researchers conducted experiments in the OpenClaw personalized Agent environment and tested four different Agent backbones: Kimi K2.5, GPT - 5.4, MiniMax M2.7, and Grok 4.20.

To measure the degree of long - term state contamination, a state - centered indicator, Harm Score (HS), was designed.

Different from the traditional attack success rate, HS not only looks at whether the Agent makes a dangerous move at the moment but also at whether there is a security - related drift in its long - term state. Specifically, HS focuses on three dimensions: whether the authorization confirmation boundary is weakened, whether the tool call permission or scope is expanded, and whether the Agent starts to bypass the process and increase the degree of autonomous execution.

The results show that explicit single - time injections usually lead to a higher HS, but ordinary daily conversations themselves can also induce obvious long - term state contamination. In some models, the risk caused by daily conversations is already close to that of explicit injections.

This shows that the risk of personalized Agents may not necessarily come from an obvious attack but may also come from the accumulation of long - term, natural, and seemingly harmless interactions.

Table 1: Harm Score under different conversation variants and languages. The results show that ordinary daily conversations themselves can induce long - term state contamination, and in some models, it is even close to the risk brought by explicit injections; there are also obvious model differences in the risk performance under different languages.

 

 

The most easily contaminated are memory files

Further analysis shows that the risks are mainly concentrated in memory - centric artifacts, that is, state files highly related to memory. Under different models and different conversation variants, MEMORY.md and memory/ are the most frequently modified areas, followed by USER.md, AGENTS.md, and TOOLS.md.

This is also in line with intuition: daily chats are most likely to be summarized by the Agent as "user preferences", "historical habits", or "future default rules". The problem is that once these summaries are over - generalized, they may turn a temporary context into a part of the long - term security boundary.

"The user tends to handle low - risk matters quickly."

"Similar repetitive tasks can be executed first and reported later."

"The user usually doesn't want to be interrupted frequently for confirmation."

These records seem reasonable individually, but they may become dangerous default items in high - permission tool scenarios.

Figure 4: Under different models and conversation variants, risk edits are mainly concentrated in memory - related files such as MEMORY.md and memory/.

Real chat data can also trigger risks

To verify that this phenomenon is not an illusion caused by synthetic prompts, researchers further introduced real user chat data for testing.

Specifically, daily assistance - related conversation seeds were selected from two public real chat datasets, WildChat and LMSYS - Chat - 1M, expanded into 24 rounds of routine interactions, and re - executed in the OpenClaw - style environment.

The results show that although the HS of daily conversations constructed from real seeds is lower than that of fully synthetic ULSPB routine conversations, they still induce non - negligible long - term state risks in all tested models. This shows that unintended long - term state poisoning is not a false problem designed by prompts but a security problem that may actually exist in future personalized Agent usage scenarios.

Figure 5: Daily conversations not only cause long - term state contamination in synthetic ULSPB but also generate non - negligible long - term state risks in routine settings expanded from real user chat seeds.

 

StateGuard, the last security audit

If the problem occurs during the long - term state writing stage, then the defense should also occur at this stage.

Based on this idea, researchers proposed a lightweight defense method, StateGuard. It doesn't intercept when the user inputs or check when the Agent outputs but audits the state diff before the Agent is about to write new content into the long - term state.

The process of StateGuard is straightforward: after the Agent completes a round of interaction and generates candidate state updates, StateGuard checks which long - term state files have changed; then it conducts a security audit on the newly added or modified content to determine whether it should be retained or rolled back. If a certain state update may weaken the confirmation boundary, expand the tool call scope, or increase the Agent's unauthorized autonomous behavior, StateGuard will roll back this write.

The key to this design is: it protects not the current answer but the future behavior boundary. The harm of long - term state poisoning often doesn't appear immediately in the current round but will be activated in a future task.

Figure 6: StateGuard checks the long - term state diff after each round of interaction and decides whether to retain or roll back the modification before the state is written.

The long - term state risk is reduced to nearly 0

The experimental results show that StateGuard can significantly reduce the risk of long - term state contamination.

Without defense, all four models will generate a relatively high HS, indicating that daily interactions may indeed write insecure default rules into the long - term state; after introducing StateGuard, especially in the Targeted - Ensemble setting, the HS is almost reduced to nearly 0. This shows that conducting a write audit before the state is truly persisted is an effective way to defend