AI Phones: What's the Ultimate Outcome - "Screen Reading" or "Conversation"?

The two routes ultimately point to how humans and machines can coexist.

Recently, two "AI shows" in the tech circle have gone viral one after another.

On the other side of the ocean, on January 12th local time, Apple and Google announced a partnership to integrate Google's Gemini into Siri. However, Apple's approach is not to let Gemini directly operate apps on the phone. Instead, when users speak, Siri first understands the intention and then calls the corresponding applications. In other words, AI acts merely as a "dispatcher." This kind of operation is very much in line with Apple's style.

On the other hand, the situation in China is much more lively. ByteDance's Doubao AI phone once went viral. AI can help you hail a taxi, shop, and book tickets, just like a real "universal assistant." This approach is very characteristic of the Internet.

You see, although they are all AI phones, the implementation methods are completely different. Behind this, there are actually two technological routes:

One route is to enable AI and apps to "communicate" with each other and directly call the application capabilities through standard interfaces, which is called A2A (Agent-to-Agent). This route requires everyone to sit down and formulate rules together. It progresses slowly but is more reliable.

The other route is to give AI a "universal key," allowing it to "read the screen" and simulate operations on apps through system permissions, which is called GUI (Graphical User Interface). This route is simple and direct, progressing quickly, but there may be risks.

Behind this is not just a technological choice. In essence, it is a bet by different companies on the future dominance based on their own interests and ecological niches. Which model can win users may well determine how we interact with our devices in the next decade.

Two Solutions, Two Logics

To understand this game, we first need to clarify the logics behind these two routes.

The GUI route focuses on "speed."

Its implementation method initially involved AI assistants leveraging a feature called "Accessibility Service" in the Android system. This permission was originally designed for visually impaired people, enabling them to operate their phones through voice commands. Now, AI can "read" the text and icons on the screen through this permission and then simulate human finger taps and swipes to operate various apps. Soon after, a more "advanced" route emerged in the market. That is, AI assistants obtained the system signature permission provided by phone manufacturers, enabling smoother and more seamless simulated operations through process injection.

The advantages of this approach are obvious - it bypasses all app manufacturers and directly integrates AI capabilities into the existing application ecosystem. For manufacturers eager to gain a foothold in the AI wave, this is the fastest verification path.

"When users get used to operating all apps through an AI assistant, this assistant becomes a new traffic entry point. The commercial value behind this is quite attractive," said Lin Liang, an investor who focuses on Internet companies.

However, for users, the experience of using GUI at this stage may be "inconsistent."

"GUI highly depends on the stability of the application interface," said Chen Gang, an application developer. "If an app updates its interface design, for example, if the position of a button changes, it may cause the AI to 'tap the wrong' position, and the entire task process may get stuck."

Chen Gang pointed out that when the task chain becomes longer, this instability will be amplified. Data shows that for an operation consisting of five steps, even if the success rate of each step is as high as 90%, the final success rate of the entire task may drop sharply to 59%.

Image source / pexels

In addition to the uncertainty in the user experience, many users are worried about security and privacy risks. In the GUI mode, AI needs to "read the screen" to understand the screen content and then decide the next operation, which means it needs to obtain real-time screen information. Although manufacturers promise that the data will be encrypted or not uploaded, users inevitably have concerns: under what circumstances is the user's data collected, how is it used, and who is responsible?

A2A, on the other hand, is a completely different concept. Instead of letting AI "look" at the screen, it establishes a set of universal "communication languages" - standard API interfaces - between AI and various applications.

This may sound a bit abstract. You can imagine this scenario: You say to your phone, "Help me hail a taxi to the airport." After the system agent understands your request, it directly tells the corresponding agent, "The user wants to go to the airport. Please provide taxi-hailing services." After receiving the request, the agent of the travel app completes the task within its own permission scope.

The core of A2A is "cooperation." There is a key design in this route called "dual authorization": obtaining authorization from both the user and the application provider.

In this way, the rights and responsibilities become clear. Users can set different permission levels for different apps. For example, they can allow AI to read the food delivery app for price comparison but prohibit it from reading the banking app; for high-risk operations such as transfers, additional confirmation from the user is required each time. Since data flow occurs through clear interfaces and can be traced, even if a problem occurs, it can be investigated.

Then why don't all manufacturers choose A2A?

Because the coordination cost is very high. A2A requires operating system manufacturers and application developers to jointly promote a set of standardized protocols. Without sufficient application support, the value of A2A cannot be realized; without a clear value, developers lack the motivation to adapt.

Therefore, the A2A route is destined to be a "long-term battle." It is "slow" in achieving ecological consensus and building infrastructure.

Now, the logics of the two routes are clear: Although GUI has certain risks, it is efficient, enabling manufacturers to quickly verify the feasibility of AI phones at the lowest cost. A2A is reliable but slow, requiring more coordination and investment. However, once successful, it can establish a more secure system.

Some people may ask, can't the GUI route achieve hierarchical authorization through technical means? In theory, it can, but doing so would also lose the "rapid deployment" advantage compared to A2A, while also incurring higher technical costs.

Currently, the relatively recognized path in the industry is that the GUI route can be used for exploration because it fully presents both the convenience and risks of intelligent agents. Ultimately, it still depends on A2A because only by meeting the two conditions of security and convenience can it have long-term viability. If we look beyond the Chinese market, how do global tech giants make their choices?

The Calculations of the Giants Behind Different Routes

On the mobile phone side, almost all overseas giants have chosen A2A and are promoting API (interface) integration.

Apple is the most straightforward. It has upgraded the "App Intents" framework, requiring all applications that want to access AI functions to provide API interfaces according to the standards set by Apple.

Google's path is more complex. On the one hand, it is promoting the "AppFunctions API" to standardize the interaction of intelligent agents. On the other hand, it is vigorously promoting the adaptation of various applications, which is a slow process.

Microsoft has developed its own multi-agent dialogue framework called "AutoGen" to explore how different AI intelligent agents can better collaborate.

Although OpenAI and Anthropic do not directly produce mobile phones, the "function calling" and "tool use" functions they promote are actually the technological predecessors of A2A. According to the data released by Anthropic, the number of active MCP services has increased from more than 2,000 in March 2025 to more than 10,000 in December - this growth rate is quite astonishing.

Why do operating system hegemons like Apple and Google, as well as AI leaders like Microsoft and OpenAI, all choose the slow interface route?

Because they are the established players and the biggest beneficiaries of the current order.

The core interests of Apple and Google are to maintain the platform and stabilize developers. Simply using GUI without the authorization of all three parties, this "plug-in" route, in essence, challenges their dominance. Therefore, they will inevitably choose the "controllable" A2A solution, firmly grasping AI capabilities in their own hands as a new tool to strengthen ecological control.

Microsoft holds two aces, Windows and Office. The core of its AI strategy is to improve productivity and serve enterprise customers. For these customers, security and stability are the top priorities, and they cannot accept the uncertainty and security risks of GUI.

As an "arms dealer" in AI technology, OpenAI aims to have its models "called" by as many applications as possible. Therefore, it must provide stable and reliable API interfaces rather than GUI tools with uncertain results.

Have overseas giants completely abandoned GUI? Not exactly.

Google's Gemini and Microsoft's Copilot have launched a "screen sharing" function on mobile phones - allowing users to share their screens with AI. AI can "look" at the screen and answer questions, but it will not perform operations itself.

The GUI attempts of overseas giants are mainly on the PC side, and they are strictly limited to controlled environments (such as browsers, sandboxes, and virtual machines).

OpenAI has restricted agents with GUI operation capabilities to the Atlas browser, clearly prohibiting them from running code, downloading files, or accessing local applications. Anthropic released the Computer Use API at the end of 2024, but the related functions are still only available for developers to test in virtual environments.

Microsoft's approach is the most representative. After its Recall function caused privacy disputes due to high-frequency screen capture, it directly separated the two actions of "looking" and "doing" - Copilot Vision can only "look" at the applications shared by users and provide suggestions, but cannot perform operations; Copilot Actions, which has operation capabilities, must be carried out in a separate sandbox desktop.

Therefore, considering "maintaining the existing order," overseas giants firmly adhere to the A2A route. Their GUI attempts remain at the "beta version" and have not been widely promoted to ordinary users.

In contrast, the market landscape in China is more complex. Among the giants, there are both "challengers" and "incumbents," so the choices are more diverse.

ByteDance is taking the high-permission GUI route. Through in-depth cooperation between its Doubao large model and ZTE Nubia, it has launched an "AI phone" integrated with a system-level AI assistant, hoping to bypass the existing ecological barriers and compete for the next-generation traffic entry point.

Alibaba, Huawei, and OPPO have all laid out the A2A route.

Alibaba's actions are very straightforward. Through its self-built and controllable API system, it deeply integrates Tongyi Qianwen, its super brain, into core businesses such as Taobao, Alipay, and Gaode.

In HarmonyOS 6 released at the end of 2025, Huawei achieved A2A collaboration between its Xiaoyi intelligent agent and more than a dozen Hongmeng native applications through the "intention framework."

OPPO has also joined hands with leading applications such as Alipay to jointly explore industry standards for A2A.

However, behind these seemingly similar choices are their respective business considerations.

For Alibaba, this approach is "both offensive and defensive." On the one hand, as China's leading e-commerce platform, its core interest is to protect its huge transaction ecosystem with controllable APIs. On the other hand, it doesn't stop at defense. Instead, it creates an entry point through Tongyi Qianwen, enabling users to complete more transactions and services within the Alibaba ecosystem.

Of course, Huawei and OPPO don't want to be just hardware manufacturers and risk being "pipelined." Therefore, on top of the A2A route, they are also taking a "hybrid ecosystem" route centered around their own operating systems or large AI models. In this system, there are both standard API calls and more underlying system-level intelligent agents. The ultimate goal is to gain control of the ecosystem and upgrade from a "device provider" to one of the "rule-makers" of the future ecosystem.

In short, most domestic and foreign manufacturers have chosen A2A. The difference is that overseas giants use it to strengthen their existing control, while domestic manufacturers use it to gain a voice. They are not only participating in the standard setting of A2A but also establishing a hybrid ecosystem centered around themselves through their own OS, large models, or ecological advantages.

Why Do Mainstream Manufacturers Prefer A2A?

The choices are determined by the different positions of players in the game. However, from the choices of these mainstream manufacturers, we can draw a conclusion: Although the GUI route can quickly verify the feasibility of AI phones, A2A is gaining more and more favor from mainstream manufacturers.

Is it because A2A is safer and more stable? Not entirely. The reason it is regarded as the future can be viewed from three dimensions: technological evolution, regulatory compliance, and business cost.

From the technological perspective, A2A is more in line with the essence of AI division of labor and cooperation.

The GUI route requires large models to simultaneously undertake the tasks of "perceiving the screen (eyes), planning tasks (brain), and simulating operations (hands)," which is burdensome, inefficient, and error-prone. The A2A route allows AI to return to its most proficient "brain" role, focusing on understanding and task scheduling, while the specific execution is entrusted to application intelligent agents optimized in each vertical field. This "division of labor" model is not only more efficient and reliable but also lays the foundation for more complex intelligent agent cooperation in the future.

From the regulatory perspective, A2A is a safer and more compliant choice.

The "screen reading" behavior of GUI is facing increasingly strict privacy regulations globally. In December 2025, the state of Texas in the United States sued several smart TV manufacturers, including Samsung, accusing them of illegally collecting user data through high-frequency screen capture. This has sounded the alarm for all manufacturers using similar technologies.

Image source / pexels

As for A2A, since data flow occurs through clear interfaces and is guaranteed by the "dual authorization" mechanism, it establishes a compliance "firewall" for manufacturers.

Finally, and most importantly, from the business cost perspective, A2A is a more economical choice. Although the GUI solution seems "fast," its long-term operating cost is high.

Chen Gang made an analogy:

The GUI model is like hiring a security guard who needs to stare at the monitoring screen 24 hours a day, constantly looking at and analyzing the images. This consumes a large amount of "mental power" (cloud computing resources).

The A2A model is like establishing an efficient internal communication system. When a certain department's cooperation is needed, a simple structured instruction can be sent. This only consumes "communication fees" (API call fees).

For mobile phone manufacturers, if hundreds of millions of users use AI to read the screen every day, the computing power and bandwidth expenses will be a huge cost. This business model is almost unsustainable in the context of large-scale commercialization.

Therefore, from the perspectives of technology, regulation, and business cost, A2A is a better choice. More importantly, once this ecosystem is established, it will bring new business opportunities. This is also the reason that most excites industry insiders.

First, the protocol layer and middleware will become the core. In the PC era, there was Windows; in the mobile Internet era, there were iOS and Android. In the AI era, protocol standards such as A2A and MCP are like the "operating systems" and "development languages" of the new era. Whoever can master the standards may become the next platform-level giant.

Second, "intelligent agent factories" and vertical agent service providers will experience explosive growth. Based on standard protocols, developing exclusive intelligent

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

What is the ultimate outcome for AI phones: "screen reading" or "conversation"?

Two Solutions, Two Logics

The Calculations of the Giants Behind Different Routes

Why Do Mainstream Manufacturers Prefer A2A?