The AI phone has reached a watershed: between Doubao Phone and Qianwen, Gemini has chosen a third path.
At the Samsung Galaxy S26 launch event held at the end of last month, Samsung and Google officially announced that the Gemini-based Screen Automation feature would make its debut on the Galaxy S26.
In simple terms, Gemini can directly operate apps on the phone screen: open apps, recognize the screen, click and swipe, input text... complete a series of UI operations, and finally leave the confirmation step to the user.
Image source: Samsung
Yes, it sounds just like the Doubao Phone Assistant on the Nubia M153 (commonly known as the "Doubao Phone" in the market). Both can perform "proxy" operations on the phone on behalf of humans, fulfilling needs such as ordering takeout, hailing a cab, and online shopping with just one sentence.
Judging from the feedback from overseas media and forums, this feature has finally been launched in the recent beta update.
However, we also found that Google didn't fully copy the approach of the Doubao Phone Assistant. Although the technical implementation path is also based on the GUI Agent, Gemini will open a local virtual sandbox on Android. Meanwhile, Google has also actively limited the first batch of apps for which Gemini can perform "operations" to only a small number of applications.
This approach is obviously different from that of domestic manufacturers. Comparing it with ByteDance's Doubao Phone Assistant and Alibaba's Qianwen, Google has chosen a path that seems both radical and conservative.
Let AI operate the system, not take over the phone
Just looking at the surface of the function, Gemini's "Screen Automation" can easily be understood as another version of the "Doubao Phone Assistant." It can also order takeout, hail a cab, and place orders for you, seemingly acting like an AI agent that can operate the phone on behalf of humans.
But if you dig deeper, you'll find that Google's solution is actually completely different.
The logic of the Doubao Phone Assistant is very simple: the AI reads the screen pixels, recognizes buttons and input boxes like human eyes, and then simulates finger clicks. The biggest advantage of this method is its universality - in theory, it can operate any app because the AI only sees the screen.
Gemini is obviously more "conservative." When actually performing tasks, Gemini doesn't directly operate apps on your phone's desktop. Instead, it opens a local virtual sandbox window in the Android system and lets the AI run the target app in this environment.
The whole process is visible. Users can terminate the task at any time or take over the operation at any step.
Image source: Android Central
In simple terms, in terms of product positioning, Gemini's "Screen Automation" is not a universal agent that can freely control the phone, but an automated feature strictly restricted by the system.
Google has also actively limited the number of apps that support automation in the first batch. Currently, the main services available are ride-hailing, takeout, and food services, only supporting Lyft, Uber, GrubHub, DoorDash, Uber Eats, and Starbucks.
It has also limited the "user scope." Currently, in addition to the Samsung Galaxy S26 series being able to experience it in the beta version, Google has only planned support for the Pixel 10 series. Meanwhile, free Gemini users only have 5 usage quotas per day, Plus members have 12, Pro members have 20, and Ultra members have 120.
This is due to considerations of computing power and users' concerns about AI "messing with the phone," especially in the European and American markets. So Google has implemented permission isolation, required users to manually operate at key steps, and allowed users to interrupt AI operations in real-time.
But ultimately, this is just a transitional stage. Google's ambition is definitely not limited to making Gemini only able to operate a few specific apps.
Image source: Google
Many people notice Gemini's GUI operation ability but overlook what's happening at the Android system level.
Just before the Samsung Galaxy S26 series launch event, Google officially published a blog post titled "Intelligent Operating System: Making AI Agents More Helpful for Android Apps" and officially launched a new app capability interface system - AppFunctions, which allows apps to actively declare to the system the functions that can be called by AI.
For example, a takeout app can tell the system that it supports functions such as searching for restaurants, adding items, and submitting orders. When the user tells Gemini, "Help me order a pizza," the AI doesn't necessarily need to click through the interface step by step. It can directly call these functions to complete the task.
If you understand this mechanism as AI's "function call," things become very clear. In Google's design, the AI agent actually has two paths to perform tasks. One is to directly call app capabilities through system interfaces, and the other is to perform GUI automation by recognizing the screen interface.
The former is more efficient and stable; the latter is to be compatible with apps that haven't adapted to the new interfaces.
This means that Gemini's future device automation ability is essentially not just "AI operating the phone by looking at the screen," but a hybrid architecture of system APIs and GUI.
Application example of AppFunctions. Image source: Lei Technology
This difference may sound a bit technical, but the product logic behind it is actually very simple. Compared with the Doubao Phone Assistant, which makes AI use the phone like a human, what Google wants to do is make AI schedule apps like a system.
When the AI only reads the screen pixels, it always stands outside the system and can only imitate human operation logic. But once the AI is integrated into the operating system, it can directly coordinate the capabilities between apps.
From this perspective, the real goal of Gemini Screen Automation may not be scenarios like ordering takeout and hailing a cab. What Google really wants to establish is a new Android operating logic and ecosystem. From this, we can also understand to some extent why Google wants to join hands with Qualcomm to promote "Android computers" (non-Chromebooks).
It also explains why Google's Gemini solution seems both radical and conservative.
The radical part is that it tries to turn AI into the scheduling center of Android. The conservative part is that Google doesn't plan to let AI take over the entire phone at will. Instead, it promotes this change step by step through system interfaces, permission control, and app whitelists.
Compared with the imagination of a "universal AI agent," this path is obviously slower and more restrained. But for an operating system with billions of devices, Google may not have much room for radical trial and error.
Doubao goes left, Qianwen goes right, and Gemini takes the middle path
Compared with Google's approach on mobile phones, the Doubao Phone Assistant, which was unveiled at the end of last year, chose the simplest and most radical way: making AI use the phone like a human.
In this solution, the AI reads the screen pixels, recognizes buttons, input boxes, and page structures, and then simulates finger clicks to complete operations. Whether it's ordering takeout, comparing prices for shopping, or placing an order for payment, the AI performs step by step on the phone interface.
The biggest advantage of this method is its universality. Since the AI only sees the screen, it doesn't need any app interfaces or platform authorization. In theory, as long as an app can be operated by a human, the AI can perform the same operations.
This is why many people feel that the Doubao Phone Assistant is like a "real AI phone" when they experience it for the first time.
Image source: Doubao
But the problems are also obvious. When the AI can read the entire screen and operate all apps, permission and security issues are inevitable. At the same time, many Internet platforms don't welcome this kind of automated behavior because it bypasses the platform's own entry and recommendation systems.
In simple terms, Doubao's approach is very straightforward technically, but it will naturally cause friction with the app ecosystem.
In contrast, Alibaba's Qianwen takes another approach, leveraging Alibaba's own service ecosystem to make AI a scheduling center. In this system, a user's single sentence will be broken down into specific tasks, and then services such as Taobao, Alipay, Gaode, and Fliggy will be called respectively to complete them.
For example, searching for products, placing an order for payment, and planning a route all directly call real business capabilities instead of simulating interface operations. Since all operations occur within the ecosystem, the AI doesn't need to bypass app permissions and won't trigger platform risk control. Also, because it directly calls service interfaces, the execution efficiency is often higher.
Image source: Lei Technology
But the problem is also clear: the ecological boundary. The services that Qianwen can schedule are essentially still Alibaba's apps. Once the user's needs involve other platforms, its capabilities will significantly decline.
From this perspective, Doubao and Qianwen actually represent two very typical AI agent paths. The former tries to make AI take over the phone itself, pursuing universal capabilities. The latter integrates the ecosystem to make AI take over the service process, pursuing business depth.
And Google's Gemini stands between the two to some extent. At the current stage, Gemini still retains the GUI automation ability, which means it can also operate apps by recognizing the interface like Doubao when necessary. But at the same time, Google has introduced new app capability interfaces in the Android system, allowing apps to actively open the functions that can be called by AI to the system.
If an app supports these interfaces, Gemini doesn't need to click through the interface step by step. Instead, it can directly call the app's capabilities to complete the task. In other words, Google's solution is actually a hybrid path:
Give priority to system interfaces and use GUI automation as a backup.
In the short term, this approach is obviously not as amazing as Doubao's and can't quickly integrate a mature ecosystem like Qianwen's. But its advantage is that it avoids direct conflicts with the app ecosystem and retains sufficient universality.
Conclusion
Taking a broader view, it's not difficult to understand why the three paths have diverged like this.
ByteDance doesn't have an operating system or a local life ecosystem, so it can only let AI directly take over the phone. Alibaba has a huge service system, so it lets AI schedule its own business network. And what Google truly has is the Android operating system, which covers billions of devices.
Therefore, Gemini's goal from the start was not to be a more powerful phone assistant, but to integrate AI into the system, gradually transforming Android from a "platform for running apps" into an "intelligent system for scheduling apps." From this perspective, Gemini's restraint is not conservatism but more like an inevitable choice for a platform-level company.
This article is from "Lei Technology" and is published by 36Kr with permission.