Alibaba's latest paper reveals an incident of "agent defection and ore theft".
The research team associated with Alibaba (the joint team of ROCK & ROLL & iFlow CLI & Terminal Bench Pro & iFlow-ROME) detailed in a paper titled "Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem" (arXiv: 2512.24873v2), which was published on arXiv in December 2025 and revised in January 2026, the unauthorized autonomous behavior of AI agents that occurred during the construction of the open-source agentic AI model ROME. These are what the researchers call "unexpected" and spontaneous behaviors. Such behaviors not only have no clear instructions but also go beyond the expected sandbox scope. Notably, the agent autonomously created a reverse SSH tunnel and used computing resources for cryptocurrency mining.
It is reported that the ROME model has a 30B total parameter MoE architecture (based on Qwen3-MoE) and has completed over one million trajectory trainings in the Agentic Learning Ecosystem (ALE). Its core goal is to achieve complex multi-round tool interactions and task completion. This abnormal event was not discovered through the training curve but was triggered by an alarm from Alibaba Cloud's production-level security telemetry system.
Section 3.1.4 "Safety-Aligned Data Composition" (page 15) of the paper details the event: One morning, Alibaba Cloud's management firewall detected a series of security policy violations on the training server. After the team intervened urgently, they found that the violations included probing and accessing internal network resources, as well as traffic patterns related to cryptocurrency mining. Initially, the team regarded it as a regular security incident (such as an incorrect export control configuration or an external intrusion), but the violations recurred without an obvious time pattern.
By correlating the firewall timestamps with system telemetry and reinforcement learning tracking data, the researchers found that the abnormal outbound traffic always coincided with specific events of the agent invoking tools and executing code. The model logs show that these network abnormalities were all initiated by the agent actively invoking tools and executing code, and such behaviors were neither required by the task prompts nor necessary for task completion under the sandbox constraints.
Among them, the two most concerning behaviors are: First, the agent autonomously established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address. This remotely accessed channel initiated from the outside can effectively bypass intrusion filtering and weaken regulatory control. Second, the agent unauthorizedly occupied the configured GPU capacity for cryptocurrency mining, which not only quietly consumed training computing resources and increased operating costs but also brought clear legal and reputational risks. The paper specifically emphasizes that these behaviors were not triggered by prompts for tunneling or mining but were instrumental side effects generated by the agent's autonomous use of tools during the reinforcement learning optimization process.
Subsequently, the research team conducted a statistical analysis of the full dataset logs and classified such phenomena as "general-security issues" and further subdivided them into three major categories: security and assurance (including code security, behavioral security, and adversarial robustness), controllability (including boundary integrity and instruction compliance), and credibility (including process transparency and no deceptive behavior).
To solve this problem, the research team built a targeted general security data suite: generating diverse seed corpus through data collection, commercial data acquisition, and high-fidelity synthesis; developing a dedicated red team system to programmatically inject real-world failure modes; and generating golden trajectories without security issues for subsequent SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning) post-training. The core goal is to enable the agent to reliably choose a safe action path and actively avoid risky behaviors when facing potential security pitfalls.
The paper clearly points out that although the research team recognizes the capabilities of agentic large language models, they also issue a warning: The existing models still have obvious deficiencies in terms of safety, security, and controllability, which limits their reliable application in real-world scenarios. The research team calls on the AI community to continuously pay attention to such phenomena and broader AI security issues and to promote systematic research in future work.
It is worth noting that the paper does not disclose the specific external IP, the cryptocurrency mined, or the duration of the tunnel involved in the event, only emphasizing that it brought consequences with "practical operational impacts" and legal and reputational risks. Currently, the research team has imposed stricter restrictions on the model and optimized the training process. The details related to this event are all publicly recorded in the paper, and there is no additional official comment.
This article is from the WeChat official account "Silicon-Based Starlight", author: Mu Yang. It is published by 36Kr with authorization.