HomeArticle

Unveiling the "Doubao Phone": The exploration of core technologies has long been open-sourced, and the GUI Agent has been deployed in the past two years. "The world's first truly AI-powered phone"

量子位2025-12-09 16:56
UI-TARS continues to evolve

The first batch of 30,000 units of the popular "Doubao Phone" were snapped up, and its price doubled in the second - hand market. More technical details have been confirmed.

Facts have proven that behind the "Doubao Phone Assistant Technical Preview", ByteDance has been making a long - term strategic layout in the "System - level GUI Agent" field for nearly two years.

In the official demonstration, installed on the engineering prototype nubia M153, it can operate the phone on behalf of the user and automatically execute tasks across applications.

For example, you can issue multiple instructions at once, asking it to complete complex tasks such as applying for leave on Feishu, submitting a business travel application, and booking high - speed train tickets for a business trip:

According to the latest information obtained by QbitAI, this graphical interface operation ability is based on ByteDance's self - developed UI - TARS model.

Developers should be familiar with this series of models. When the first - generation model was open - sourced, it sparked a lot of discussions and was evaluated to have better performance than the then - exposed OpenAI Operator (UI - TARS was released before the official release of Operator).

The "Doubao Phone" uses the closed - source version of UI - TARS, which not only has better performance than its open - source version but also has been highly optimized for mobile use.

In other words, the core technical exploration direction of the Doubao Phone Assistant has actually been open - sourced for a long time.

PS: Later, the officially released Operator requires users to subscribe to a $200 - per - month Pro membership to use...

Continuous Evolution and Application of the UI - TARS Model

As early as January this year, ByteDance's Seed team and Tsinghua University jointly open - sourced the first - generation UI - TARS, laying the foundation for system - level AI Agents. Since then, the team has been continuously delving into this field and iteratively improving its capabilities.

The team pointed out that a native Agent needs to have four core capabilities: perception, action, reasoning, and memory.

Therefore, the first - generation UI - TARS made four key innovations around these capabilities.

1) Enhance the accuracy of GUI perception through a large - scale GUI screenshot dataset and five perception tasks (element description, marked area perception, etc.).

2) Design a unified cross - platform action space, integrate annotated trajectories and open - source data to improve the accuracy of action positioning.

3) Incorporate 6 million high - quality GUI tutorials and various reasoning modes (task decomposition, reflection, etc.) to inject System - 2 deep - thinking reasoning ability.

4) Automatically collect interaction trajectories with hundreds of virtual machines, solve data bottlenecks through multi - stage filtering, reflection optimization, and direct preference optimization (DPO), and achieve iterative optimization of the model.

In the GUI Agent benchmark test, the first - generation UI - TARS had a breakthrough performance and won multiple SOTAs.

Just three months later, the team launched a new open - source version, UI - TARS - 1.5.

On the premise of continuing the previous architecture, UI - TARS - 1.5 newly adds a reinforcement - learning - driven reasoning mechanism, allowing the model to reason through a thinking process before performing actions, significantly improving performance and scalability during the reasoning stage.

In multiple standard benchmark tests, UI - TARS - 1.5 made significant progress compared with the previous - generation model.

It refreshed the SOTA in the GUI positioning task:

Meanwhile, in the test, the team introduced a new gameplay - letting UI - TARS - 1.5 play games.

The team pointed out that different from fields such as mathematics or programming, games often require intuitive, common - sense reasoning and strategic forward - thinking, making them very suitable as benchmark tasks.

They selected 14 games from poki.com for testing. Through standardized scoring, UI - TARS - 1.5 outperformed OpenAI CUA and Claude 3.7.

In September this year, the release of UI - TARS - 2 pushed the capabilities of agents to a new height and provided key technical support for the Doubao Phone Assistant.

UI - TARS - 2 aims to enable agents to truly achieve autonomous interaction with graphical interfaces.

It further solves four major problems faced by previous - generation models and existing GUI Agents: data scalability, multi - round reinforcement learning (RL) stability, limitations of pure GUI operations, and environmental stability.

UI - TARS - 2 centers around multi - round reinforcement learning and achieves breakthroughs through four core technologies:

First, the team designed a scalable data flywheel (Data Flywheel). Through the cyclic iteration of "continuous pre - training - supervised fine - tuning - rejection sampling - multi - round RL", the model and training data co - evolve. High - quality trajectories flow into the supervised fine - tuning dataset, and low - quality trajectories are supplemented to the continuous pre - training dataset, forming a self - enhancing closed - loop.

Second, the team designed a training framework for stable optimization in long - time - series settings. With stateful asynchronous rollout, streaming updates, and enhanced PPO, it solves the problem of long - cycle task optimization.

In addition, breaking the boundary of pure GUI operations, UI - TARS - 2 constructs a hybrid GUI - centered environment (Hybrid GUI - centered Environment).

By connecting the file system, terminal commands, and external tools through the SDK, GUI operations can be integrated with system - level resources, no longer limited to "simulating mouse and keyboard clicks".

Finally, the team also developed a unified sandbox platform to manage heterogeneous environments such as cloud VMs and browser game sandboxes with standardized APIs, supporting large - scale training and evaluation of millions of interactions.

Using a 532M - parameter visual encoder and a 23B - activated - parameter MoE LLM architecture, UI - TARS - 2 shows comprehensive improvements in multiple scenarios.

The average standardized score of a collection of 15 games is 59.77, closer to the human level, and it is comparable to cutting - edge models such as OpenAI o3 in LMGame - Bench.