Why has the Doubao Phone become an internet sensation with its Super Agent? Let's hear what AI scholars have to say.
The AI on your phone has never felt so human.
In the past week, a mobile phone that has swept through the tech circle doesn't come from any major hardware manufacturers but is associated with ByteDance's Doubao.
The engineering prototype equipped with the Doubao Mobile Assistant has gone viral online, allowing many people to truly feel for the first time that Agents are within reach. On e-commerce platforms like Taobao, the price of this phone has been hyped up to nearly 5,000 yuan.
The Doubao Mobile Assistant, released earlier this month, is currently still in the technical preview version. Different from most AI assistants that exist as independent apps, it embeds the AI Agent into the system's underlying layer, enabling the phone to achieve a comprehensive breakthrough in edge-side AI capabilities and bringing a brand-new interaction mode and multimodal experience. In the view of many tech practitioners, the Doubao Mobile Assistant has pushed the perception of AI tools to a new height. It is no longer just an auxiliary tool or an external app but a "super butler" deeply integrated with the mobile operating system.
After all, with just one sentence, the Doubao Mobile Assistant can truly execute complex cross-app instructions. In addition to the common capabilities of Agents on other phones, such as ordering meals, keeping accounts, and modifying settings, the Doubao Mobile Assistant can tackle relatively vague and complex long-chain requirements.
The Doubao Mobile Assistant can complete multi-requirement, long-chain tasks such as "marking restaurants on the map, searching for museums, and booking tickets on travel platforms" without interruption.
Such performance makes people exclaim, "Is it a bit too intelligent?"
Meanwhile, the ongoing discussions about the Doubao Mobile Assistant have also led to some different views and questions: "Will 'AI operating the phone' really become the norm for people using phones in the future? What did the Doubao Mobile Assistant do right to create such an AI phone?"
After in-depth understanding of the technical foundation behind the Doubao Mobile Assistant and communicating with four academic experts, we have a more comprehensive and clear understanding of how it reconstructs the interaction paradigm and promotes the realization of system-level GUI Agents.
Why is it so difficult to install a system-level Agent on a phone?
In the past two years, both emerging AI hardware startups and mainstream mobile phone manufacturers at home and abroad have shown an obvious trend: exploring a deeper integration of native AI capabilities into device systems. One of the most important forms is the introduction of AI Agents.
Among them, as an AI system driven by a multimodal vision model, the GUI Agent can understand screen content, conduct autonomous reasoning, and perform interactions similar to human operations on the UI, such as reading information, clicking buttons, and inputting content, under instructions given in natural language, thus completing specific tasks.
As the capabilities of GUI Agents continue to strengthen on the edge side, system-level GUI Agents characterized by higher integration and deeper system permissions have gradually become the core goal for the next stage. This requires not only efficient task execution but also the ability to understand context and coordinate the flow between multiple apps.
However, achieving such a system-level implementation is not easy. From an academic and engineering implementation perspective, it generally needs to overcome the following four levels of obstacles:
Firstly, the perception layer: The Agent needs to identify all interactive elements on the screen, such as icons, buttons, and text boxes, within milliseconds. It also needs to have the ability to resist dynamic interference because app interfaces are complex, and pop-up ads, floating layers, and dynamically loaded content can generate visual noise. The GUI Agent needs to have "pixel-level" precise positioning ability and understand the "functional semantics" behind the icons.
Secondly, the planning layer: It mainly involves the information flow across apps, including multiple steps such as app switching, context memory extraction, and clipboard operations. During the execution process, there may also be unexpected situations such as network congestion, login failures, and unexpected pop-ups. Once the traditional script (workflow) is broken, it may not be able to continue. The GUI Agent needs to maintain the logical coherence across multiple apps and have self-reflection ability, such as finding an alternative way when the path is blocked.
Thirdly, the decision-making layer: The GUI Agent must have strong generalization ability. It cannot only work on the interfaces it has seen but also be able to perform similar operations in similar apps it has never seen before. At the same time, in addition to clicking, mobile phone operations also include other fine-grained operations such as long-pressing, swiping, and zooming, which puts forward higher requirements for the Agent's feedback loop and means that the decision-making process must be more timely and accurate.
Fourthly, the system layer: Firstly, it is the response speed. Users cannot tolerate long periods of thinking. Secondly, there are permission barriers. Under strict sandbox mechanisms such as Android, it is not easy to obtain screen information or perform operations on other apps. The GUI Agent needs to break the data silos within the operating system while ensuring data privacy, security, and low latency.
The four levels of obstacles together constitute the most core challenges in the implementation process of system-level GUI Agents. When talking about the difficulties faced in system-level cross-app operations, Liu Bang, an associate professor at the University of Montreal and the MILA Laboratory mentioned the problems of interface understanding and element positioning in the perception layer, as well as long-chain task planning and state management in the planning layer. Real user tasks often require dozens of steps, span multiple apps, and may also encounter situations such as pop-ups, network delays, permission requests, verification codes, and asynchronous loading. The Agent must remember what it has done before, its current state, what may happen next, and be able to handle failures or exceptions.
Zhang Chi, the head and assistant professor of the General Artificial Intelligence (AGI) Laboratory at Westlake University pointed out two capabilities crucial for the productization of GUI Agents: context memory and reasoning speed. Dr. Zhang Weinan, a professor and doctoral supervisor at the School of Computer Science, Shanghai Jiao Tong University believes that current major AI companies often focus on one or a few apps and cannot obtain the maximum data access and control permissions. Therefore, they cannot align with the user's context or perform operations that users can complete.
Shen Yongliang, a researcher and doctoral supervisor under the Hundred-Talent Program at Zhejiang University summarized several difficulties, including long-chain planning, reasoning speed, and how lightweight models manage short-term and long-term memory. These are also the core bottlenecks currently widely concerned in the academic community.
For such a full-chain reconstruction project that involves AI technology, terminal hardware, operating systems, and ecological collaboration, immaturity in any link may affect the Agent's journey towards true productization. In the past two years, the academic and industrial circles have started to focus on releasing the capabilities of Agent carriers, including research on general GUI Agents such as AppAgent, Mobile-Agent, and UI-TARS, as well as Rabbit-style general Agents that rely on visual recognition and accessibility control and system-level Agents built by mobile phone manufacturers at the OS layer.
Through these attempts, AI has begun to be able to operate the mobile phone screen and complete some specific tasks like humans, but there are still many problems, such as the permission opening of different apps, the low success rate of long-chain complex tasks, long waiting times, and the lack of ability to handle unexpected UI situations. These all limit the stability and practicality of system-level GUI Agents.
The Doubao Mobile Assistant learns from others' strengths and adopts the path of "GUI Agent + system-level permissions". On the one hand, through deep system integration on the phone, it has obtained Android system-level permissions, along with stricter usage restrictions. These permissions are only invoked after the user actively authorizes them. This allows the Doubao Mobile Assistant to simulate user clicks, swipes, typing, and cross-app operations. On the other hand, with the help of visual multimodal capabilities, that is, identifying the screen UI, understanding the interface content, parsing the user's intention, and executing the plan, the Doubao Mobile Assistant independently decides "where to click next, what to input, and which app to jump to." In Liu Bang's words, this is equivalent to a "ghost finger + brain + decision-making system."
Zhang Chi emphasized the system-level integration ability of the Doubao Mobile Assistant. Through continuous enhancement of basic capabilities and the integration of various technical solutions (such as system function interface calls), it can provide a better GUI Agent experience. Zhang Weinan said that the Doubao Mobile Assistant breaks down the barriers between apps through the GUI Agent and has made significant progress in aligning with the user's context and operation space. "As the first AI phone jointly designed by a mobile phone manufacturer and a large model company, its design logic is more subversive than that of traditional mobile phone manufacturers' AI transformation designs."
Shen Yongliang also highlighted the Doubao Mobile Assistant's native GUI visual operation. Through in-depth cooperation with mobile phone manufacturers, it has obtained system-level operation permissions and can directly send instructions to the system kernel to simulate human finger clicks and swipes. This visual operation based on the system's underlying layer is fundamentally different from third-party apps that rely on accessibility services. It has strong universality, a more stable execution process, and is more like a human. It shows a good balance in reasoning speed and task completion rate and has considerable long-context processing ability.
Overall, the Doubao Mobile Assistant is building a general Agent layer that integrates "visual understanding, large model reasoning, and system-level native execution," enabling generalizable UI operations across different apps and interface forms.
From multiple dimensions such as compatibility, cross-app automated execution, long-chain task processing, and multi-task scheduling, the Doubao Mobile Assistant has demonstrated capabilities superior to traditional script-based automation or accessibility interface solutions. These provide a more robust foundation for the realization of higher-order system-level GUI Agents.
UI-TARS: The Self-developed System-level GUI Agent Engine behind the Doubao Mobile Assistant
I believe you have been bombarded with various demonstrations of the Doubao Mobile Assistant. Whether it's booking air tickets across apps, automatically comparing prices, or smoothly completing a whole set of complex processes on the phone, these capabilities indicate that the phone is no longer just a tool waiting for your clicks but has begun to have the ability to actively complete tasks.
Behind these capabilities is UI-TARS, a self-developed and open-source model successively launched by ByteDance in 2025. It is reported that the Doubao Mobile Assistant uses the closed-source version of UI-TARS, which not only outperforms its open-source version in performance but also has been highly optimized for mobile use.
UI-TARS can be traced back to January this year, laying the foundation framework for ByteDance in the direction of GUI Agents. In April, the team further released an advanced version, UI-TARS-1.5, which integrates advanced reasoning capabilities brought by reinforcement learning, enabling the model to think and deduce before performing actions. UI-TARS-2, launched in September, has pushed this system to a new stage.
UI-TARS includes a data flywheel mechanism for scalable data generation, a stable multi-round reinforcement learning framework, a hybrid GUI environment that integrates the file system and the terminal, and a unified sandbox platform that supports large-scale rollouts.
Firstly, it alleviates the problem of data scarcity. At present, large-scale pre-training and reinforcement learning are already very mature in fields such as dialogue and reasoning. However, when it comes to GUI tasks that require long-chain operations, they are difficult to directly scale. This is because GUI scenarios, unlike text and code, cannot easily collect massive amounts of data. Instead, they must record the complete operation trajectory, including each step of reasoning, clicking, interface changes, and feedback. Such data is not only difficult to obtain and costly but also particularly difficult to collect on a large scale.
UI-TARS has designed a scalable data flywheel mechanism to continuously improve the model's capabilities and data quality through repeated training. In each cycle, the latest model will generate new agent trajectories, which will then be filtered and assigned to the most suitable training stage. High-quality outputs will be promoted to later stages (such as SFT), while lower-quality outputs will be recycled to earlier stages (such as CT). Through multiple iterations, this dynamic reallocation method ensures that each training stage uses the data most suitable for it, forming a self-reinforcing closed loop: better models generate better data, and better data, in turn, trains stronger models.
Secondly, it needs to solve the problem of scalable multi-round reinforcement learning. It is difficult to conduct reinforcement learning in an interactive environment because agents can hardly know in time whether they are doing things right: rewards usually come slowly and sometimes not at all; the training process is also prone to instability.
To break through this bottleneck, UI-TARS has built a training framework specifically for long-chain scenarios, which includes using asynchronous rollouts with state-keeping capabilities to maintain context consistency; avoiding training bottlenecks caused by long-tail trajectories through streaming updates; and combining reward shaping, adaptive advantage estimation, and value pre-training with an enhanced proximal policy optimization (PPO) algorithm to further improve the training effect.
Thirdly, it breaks through the limitations of pure GUI operations. Many real-world tasks cannot be completed solely by clicking on the interface. For example, data processing, software development, and system management often require more efficient methods, such as directly operating the file system, using the terminal, or invoking external tools. If an agent can only rely on GUI interactions, its ability will be very limited. Therefore, a truly advanced GUI Agent must be able to seamlessly integrate graphical operations with these system resources, enabling it to perform more real and complex workflows.
To this end, UI-TARS has built a hybrid GUI center environment that allows agents not only to perform operations on the screen but also to invoke the file system, terminal, and other external tools, thus solving a wider range of real-world tasks. This means that in the training system of UI-TARS, the agent's operation space has expanded from simple clicks, inputs, and scrolling to a higher-dimensional action set that can freely combine GUI operations and system commands. For example, it can drag files in the file manager or directly process text, extract compressed packages, and run scripts through shell commands. This can be said to be a key step for system-level GUI Agents to move towards real-world applications.
Finally, even with rich interaction capabilities, deploying a large-scale RL environment remains an engineering bottleneck. This is because the system needs to run millions of interactions repeatedly in browsers, virtual machines, and simulators, while ensuring that the results are reproducible, errors can be recovered, and the training process is not affected. However, in reality, such environments are often slow, expensive, and prone to crashes. It is almost an extremely difficult engineering task to run large-scale RL stably in the long term.
To support large-scale training and evaluation, UI-TARS has built a unified sandbox platform, one of whose core innovations is a shared file system: This allows the GUI Agent to perform continuous cross-tool operations, such as downloading a file through the browser and immediately processing it with shell commands, in the same container instance. This sandbox not only maintains the stability and reproducibility required for complex tasks but also supports high-throughput training on distributed computing resources while providing a consistent environment for data annotation, evaluation, and reasoning.
Relying on these four technologies, UI-TARS provides truly implementable basic capabilities for system-level GUI Agents, enabling the Doubao Mobile Assistant to stably execute cross-app, long-chain complex tasks in real mobile operating systems and achieve a leap from dialogue intelligence to