StartseiteArtikel

A post-2000 intern using large models "stripped down" Doubao's phone, revealing the truth in a thousand-word hands-on test.

新智元2025-12-10 14:49
Doubao Phone: OS-level AI assistant, cross-APP operation, igniting a new wave of AI phones.

[Introduction] What amazing features are hidden in the wildly popular "Doubao Phone"? In a popular post, an LLM engineer dug out its technical secrets through black-box testing and paper deductions.

An AI phone has become a sensation across the internet.

With just a single spoken word, it can complete tasks such as automatic price comparison and order placement across different apps, reply to WeChat messages, book airline tickets, and plan travel routes within just a few seconds...

Taylor Ogan, a well - known overseas entrepreneur, exclaimed, "This is another DeepSeek moment! This is the world's first real smart phone."

Needless to say, it is the recently hard - to - find "Doubao Phone".

The Bilibili blogger "Six - point Chaochao" was greatly impressed after the experience and praised it as "the most impressive product of this year".

Even more astonishing is that even when the phone is locked, the "Doubao Phone" can operate smoothly in the background.

In the test by "Dianwan Technology AK", the "Doubao Phone" not only easily passed the "big test" on Bilibili but also completed the tasks at an incredibly fast speed -

It answered 1 question in 3 seconds and 100 questions in 5 minutes!

Then the question arises: What kind of black technology has made the "Doubao Phone" popular all over the world overnight?

Coincidentally, while browsing Xiaohongshu, we accidentally found a very interesting post - "I didn't reverse - engineer the 'Doubao Phone', but I want to say something."

Original Xiaohongshu post link: http://xhslink.com/o/93GCQttMFgO

Updated blog link: https://www.notion.so/GUI-Agent-2c17a860b5e680e3b6e4efece19d1457

A Popular Post Deciphers the "Doubao Phone" from an Engineering Perspective

The blogger of this post, "Xiaoshi", is currently an intern engineer in the field of large models and shared his feelings purely from an academic perspective.

After hands - on testing, through black - box testing and arXiv logical deductions, he gave a relatively scientific explanation from an engineering perspective.

Right from the start, he hit the core of the "Doubao Phone":

This is not just an app. ByteDance has built an OS - level shadow system at the Android Framework layer.

Next, the blogger shared his insights from the following seven aspects.

1. Two Modes: System 1 (Intuition) vs. System 2 (Reasoning)

ByteDance has split the Agent into two stacks: one is the standard mode, and the other is the Pro mode.

This is not just a difference in model size but two completely different pipelines, similar to System 1 and System 2 in human cognition.

Here, the author set a "trap" in the test -

Select a full - screen screenshot of the JD.com homepage and give Doubao the instruction "Click the search button".

Standard Mode (Fast): Naive Simulation

It mainly relies on a shallow visual - language model (VLM) and responds extremely quickly, with a perceived latency of less than 500ms.

He speculated that it might use the distilled version of Doubao - 1.5 - UI - TARS, and the short prompt can achieve faster results through compressed IO tokens.

However, its drawback is that it has a typical "intuitive" reaction and will stupidly click the button in the picture.

Pro Mode (Slow but Robust): Deep Reasoning + Tool Invocation

In the same test, the Pro mode clearly had a "pause + thinking" process - it refused to click and suggested switching to a browser.

He speculated that this might follow the full - version route of Doubao - 1.5 - UI - TARS and has undergone more post - training alignment.

At the same time, it also indicates that the Planner has intervened and has the ability of self - reflection.

Moreover, only in the Pro mode can complex multi - hop retrieval and direct invocation of System APIs be observed.

Additional information: According to our latest understanding, the Doubao Phone Assistant uses the closed - source version of UI - TARS 2.0, which has significantly better performance than the open - source version and is specially optimized for mobile phone usage scenarios.

2. Hybrid Perception Router

The interference of environmental noise is the core challenge for the current implementation of Agents.

XML + Vision dynamic routing is the most direct solution provided by Doubao, whether it is the standard or Pro version of UI - TARS.

On the homepages of Gaode/Baidu Maps, where there are various complex icons and road conditions, the blogger asked Doubao to "click the construction icon next to the darkest red and most congested road section".

This is a test scenario for executing complex instructions in an OpenGL - rendered interface.

To our delight, the AI elegantly completed this task.

In such scenarios, the Android "accessibility tree" is often empty or only contains a SurfaceView container without any child node information.

This confirms the existence of the visual route because the VLM has the ability of pixel - level "open - vocabulary positioning".

It truly understood "dark red, next to, construction icon", which includes complex information such as color semantics, spatial relationships, and object detection.

Therefore, he speculated that this might constitute a "dynamic routing" selection: the standard UI uses XML, and the non - standard UI uses vision (screenshot but power - consuming).

3. OS - level Virtualization: Parallel Runtime

Many netizens who have used it firsthand must have a deep understanding of this -

You can let Doubao compare prices and shop while still browsing videos and answering calls without any interruption.

The Agent can run long - term tasks in the background without being interrupted even when the phone switches to other apps.

The blogger speculated that the Agent might run on a "shadow screen", achieving "input isolation": you can make a call on the physical screen while the Agent runs on the logical screen.

This "dual parallel universe" structure completely solves the pain point of the Agent seizing the foreground and causing the phone to freeze.

4. Heuristic Engineering: Prompting "Wait"

After each operation, regardless of how fast the current page is rendered, the Agent will forcibly introduce a fixed delay of 1000ms - 5000ms in the system prompt.

This design is similar to the "waiting polling" in Cursor CLI.

From an engineering perspective, this approach is to counter the common asynchronous loading/skeleton screens in apps, sacrificing time for "success rate", which is a compromise but effective.

5. "Physical Isolation" in Privacy Design: Activity Hierarchy

Returning to the privacy issue that concerns most people, there are concerns that the Doubao Agent might record the screen 24/7 for monitoring. However, after the blogger's test, he found -

The visual pipeline is filtered.

If Doubao were really using the VLM to analyze the screen, the phone would probably be too hot to use.

He enabled the Picture - in - Picture mode on Bilibili and then let the Agent operate on the main screen. When he took a screenshot in the middle, he found that the AI only captured the interface of the main app and did not capture the floating window at all.

This proves that it does not read the output stream of the physical screen but targets and captures based on the "activity hierarchy". That is to say, at the physical level, Doubao isolates video calls and the secure keyboards of financial apps, which is a carefully designed security feature.

The blogger believes that the code logic of the Doubao Phone Assistant is a safe and reliable design, which includes isolation mechanisms, fuse strategies, and local processing.

Although the code can be transparent, what about the people who write and manage the code? This concern is understandable.

However, this problem is inherently difficult to solve completely. In the blogger's view, if the Agent can solve 80% of daily trivial matters on one's behalf, it is acceptable to hand over desensitized data that does not involve core privacy.

6. Memory and Tool Usage: Speculations about the MCP Protocol

In the Pro mode, data invocation is accurate.

Tool Invocation Architecture

In the test, when the blogger gave a vague instruction "What are the mathematical characteristics of the verification code", the Agent did not perform a brute - force OCR on the full screen. Instead, the Client sent a request to the Server, and the authorization part of the entire system might have formed a RAG - MCP.

List Memory (Sliding Window)

When scrolling through a long list (ListView), the Agent's behavior is very similar to that of the E2E testing framework Playwright: scroll the screen → perform DOM Diff → extract incremental information → splice.

This method solves the problem of cross - screen context.

7. Resilience

In the last test, the blogger asked the Agent to read the latest email in Outlook, but it failed.

At this time, instead of reporting an error and exiting, the Agent automatically downgraded to read the second email, tried to extract the preview information of the first email on the list page, and then made a combined report.

This shows that its planner focuses on the "task goal" rather than the specified operation sequence. This ability of dynamic planning is what reasoning should be about.

After the experience, the blogger shared his true feelings - "It really made me feel that 'reasoning' has stepped out of the papers."

When I saw the Agent think for a moment after crashing in Outlook and then turn to read the email list preview, it was a very wonderful feeling.

It is no longer a simple script that mechanically executes click(x, y) but begins to show a certain degree of resilience.

He said that for researchers, this phone is more like an SOTA - level demo from the industrial world. It is not perfect, but it really works.

All in all, the "Doubao Phone" has made many compromises in terms of speed, but from an architectural perspective, it may be the most reliable solution for mobile phones at present.

From the blogger's analysis, we got a key glimpse into the engineering implementation behind the "Doubao Phone".

When we dug deeper into ByteDance's open - source library, we found that the GUI operation ability of the Doubao Phone Assistant has been opened to the industry through the open - source version of the UI - TARS model.

Open - source link: https://github.com/bytedance/UI - TARS

To put it simply, UI - TARS integrates screen visual understanding, logical reasoning, interface element positioning, and operations into one model.

It can perform various complex operations such as collecting information, processing documents, booking tickets, and comparing prices, and can even think and act in games.

It is worth mentioning that UI - TARS has been updated at an extremely fast pace, with three iterations this year alone:

  • January 2025, the first - generation UI - TARS;
  • April 2025, UI - TARS - 1.5;
  • September 2025, UI - TARS - 2.