Goodbye GUI, the "LLM-friendly" computer usage interface from a team of the Chinese Academy of Sciences is here.
Large model agents can automatically operate your computer. The ideal is full, but the reality is skinny.
Almost all existing LLM agents are plagued by two major core "pain points":
Low success rate: For slightly complex tasks, the agent "capsizes" and often gets stuck at a certain step, not knowing what to do.
Poor efficiency: To complete a simple task, the agent needs to have dozens of rounds of "intense tug-of-war" with the system, which takes a long time and makes people anxious.
Where exactly does the problem lie? Is it that the current large models are not smart enough?
The latest research from the team of the Institute of Software, Chinese Academy of Sciences gives an unexpected answer: The real bottleneck lies in the graphical user interface (GUI) that we have been using for more than 40 years and are extremely familiar with.
Convert the "imperative" GUI to a "declarative" one
Yes, it's the GUI that became popular in the 1980s and completely changed the way of human-computer interaction. It has always been tailored for humans, and its design philosophy is completely contrary to the ability model of LLM.
The research team pointed out the core problem of the GUI: When using the GUI, the functions of the application cannot be directly accessed, but must rely on navigation and interaction.
For example, the GUI function controls are hidden behind layers of menus, tabs, and dialog boxes. To access the controls, you need to click on menus, drop-down boxes, etc. for navigation so that the controls appear on the screen. Secondly, the use of many controls (such as scroll bars and text selection) requires repeated adjustments and observation of feedback, forming a high-frequency "observe - operate" cycle.
The research team pointed out sharply that behind the imperative design of the GUI, there are four "key assumptions" about human users:
- Good eyesight: Humans are good at visual recognition and can quickly locate the positions of buttons, icons, and menus.
- Fast actions: Humans can perform the "observe - operate" cycle quickly and easily.
- Small memory capacity: Humans have limited temporary memory space, so the interface should be simple and only display a small number of options at a time.
- Lazy to think: Humans have a high cognitive cost for learning and recalling specific rules (such as programming language syntax), but are good at doing "multiple-choice questions".
However, these assumptions are completely mismatched with the capabilities of LLM:
LLM has poor eyesight. Its visual ability is limited, and it is very difficult for it to accurately identify information on the screen.
LLM has a slow reaction. One inference may take several seconds or even several minutes, and the waiting time is too long.
LLM has excellent memory. Its large context window allows it to easily handle a large amount of information and is not afraid of a large number of options at all.
LLM is an expert in formatting. Outputting precise structured instructions is its forte.
As a result, when using the GUI, LLM is forced to take on the roles of both the "brain" (strategy) and the "hands" (mechanism) at the same time. It not only needs to plan tasks based on semantics but also handle the cumbersome and unfamiliar low - level operations. This is not only inefficient but also causes a heavy cognitive burden and is prone to errors.
This "imperative" interaction method is like taking a taxi to a place, but you cannot directly tell the driver the destination. Instead, you must instruct him step by step on how to drive: "Turn left 200 meters ahead, then go straight for 50 meters, and turn right at the traffic light...". Once you say a wrong step or the driver misunderstands, all previous efforts will be wasted. This is exactly the dilemma that current LLM agents are facing.
So, is it possible that when LLM "takes a taxi", it only needs to tell the final destination, and the remaining route planning and specific driving operations can be automatically completed by an "experienced driver"?
This is the core idea of this research: Convert the interface from "imperative" to "declarative". To this end, based on the accessibility mechanism of the GUI and the operating system, the research team proposed a new abstraction - the declarative interface (GOI).
The essence of GOI lies in "policy - mechanism separation":
Policy: What needs to be done, that is, the high - level semantic planning and function arrangement of the task. For example, for the task of "setting the background of all slides to blue", the functions of "blue" and "apply to all" need to be used in sequence. This is what LLM is good at.
Mechanism: How to do it specifically, that is, the low - level navigation and interaction. For example, "click the 'Design' tab -> click 'Format Background' -> click 'Solid Fill' ->...". Or keep dragging the scroll bar back and forth to move to the appropriate position. This is what LLM is not good at but can be automated.
GOI takes over the cumbersome and error - prone "mechanism" part and only provides LLM with three simple and direct "declarative" primitives: access, state, and observation.
Now, LLM no longer needs to issue every micro - operation instruction nervously like a novice driver. Instead, it is more like a commander in full control: It only needs to issue high - level instructions such as "access 'blue' and 'apply to all'" or "set the scroll bar to 80%" through GOI, and GOI will automatically complete all the intermediate GUI navigation and interaction.
In this way, LLM can finally be liberated from the quagmire of the GUI and focus on its most proficient semantic understanding and task planning. More importantly, the entire process does not require modifying the source code of the application and does not rely on the API provided by the application.
How does GOI decouple the "policy" and the "mechanism"?
The implementation of GOI is divided into two stages: offline modeling and online execution.
Step 1: Offline "map drawing". In the offline stage, GOI will automatically explore the accessible controls of the target application (such as Word), analyze the changes of interface elements before and after clicking, and thus construct a complete "UI Navigation Graph".
But challenges follow: Complex applications are full of circular paths and "merged nodes" (that is, multiple paths can reach the same control), and different paths will trigger different functions of the same control.
The cleverness of GOI lies in that it converts this complex graph into a "forest" structure with clear paths and no path ambiguity through a set of algorithms for cycle removal and "selective externalization" based on cost. This ensures that no matter which function LLM wants to access, there is a unique and definite path to reach it.
Step 2: Online execution. In the online stage of task execution, LLM no longer needs to output a fine - grained GUI navigation and interaction sequence.
Instead, it is a compressed text - based "map" provided by GOI that is very friendly to the LLM context window. When LLM needs to execute a task, it only needs to call the three declarative primitive interfaces provided by GOI:
Access: Through the visit interface, directly declare the ID of the target function control to be accessed. GOI will automatically calculate the path and complete the navigation.
State: Through interfaces such as set_scrollbar_pos(), select_lines(), or select_controls(), directly declare the final state that the control needs to reach. For example, directly set the scroll bar to the 80% position without simulating dragging.
Observation: Through interfaces such as get_texts(), directly obtain the structured information of the control without LLM having to perform pixel - level screen content recognition.
These interfaces do not rely on the "API" exposed by specific applications but are directly implemented based on the general accessibility of the GUI and the operating system.
Experimental results: From "mechanistic" errors to "strategic" errors
To verify the real capabilities of GOI, the research team conducted a comprehensive evaluation on the OSWorld - W benchmark test set that includes Word, Excel, and PowerPoint.
The results show that GOI brings an overwhelming performance improvement. Under the core setting of using the GPT - 5 inference model, the success rate soared from 44% to 74%.
In addition, for more than 61% of the successful tasks, the agent completed the user's core intention "in one go" with only one LLM call.
What's more interesting is the failure analysis.
For the baseline using the GUI, 53.3% of the failures are due to mechanistic errors, such as errors in locating and identifying controls through visual information, navigation planning errors, and errors in interacting with controls. This is like a person failing because they don't know the way.
After introducing GOI, 81% of the failures are concentrated at the strategic level, such as incorrect semantic understanding of the task, incorrect semantic analysis of the picture content, or incorrect perception of the control functions.
This means that GOI has successfully liberated LLM from the cumbersome mechanisms and reduced the possibility of failure due to mechanistic reasons. LLM no longer easily makes "silly mistakes" and focuses more on its own semantic understanding ability. This is like LLM mislocating the destination instead of failing because it doesn't know the way.
The team said that the proposal of GOI points out a clear direction for designing a more suitable interaction paradigm for large models.
This work not only provides a solution for improving the performance of existing agents but also inspires us to think:
Should future operating systems and applications natively provide such a "LLM - friendly" declarative interface, thus paving the way for more powerful and general AI agents?
Paper address: https://arxiv.org/abs/2510.04607
This article is from the WeChat official account "QbitAI". Author: The team of the Institute of Software, Chinese Academy of Sciences. Republished by 36Kr with permission.