The University of Hong Kong collaborates with Dark Side of the Moon and other parties to open - source OpenCUA: Everyone can create their own customized computer intelligent agents.
Just now, a paper from multiple institutions such as XLANG Lab at the University of Hong Kong and Dark Side of the Moon was uploaded to arXiv. In this paper, a fully open - source framework for building and expanding CUA (Computer - Using Agents) was proposed. Specifically, the framework includes:
- An annotation tool for capturing human demonstrations of using a computer
- AgentNet, the first large - scale dataset covering 3 operating systems and over 200 applications/websites
- A workflow for converting demonstrations into "state - action" pairs with long chain of thought reasoning
Using this framework, they also built a flagship model OpenCUA - 32B, which achieved a success rate of 34.8% on OSWorld - Verified, setting a new open - source SOTA and even surpassing GPT - 4o in this benchmark test.
Even better, they fully disclosed the relevant code, data, and models!
Paper title: OpenCUA: Open Foundations for Computer - Use Agents
Paper link: https://arxiv.org/abs/2508.09123
Project page: https://opencua.xlang.ai/ (including tools, models, and datasets)
Notably, this research has a total of 6 co - first authors. The project leader is Tao Yu, an assistant professor of computer science at the University of Hong Kong. Additionally, Yang Zhilin, the founder and CEO of Dark Side of the Moon, and Yang Diyi, an assistant professor in the Department of Computer Science at Stanford University, are also on the author list.
Now, let's take a detailed look at this research.
The OpenCUA Framework
The following figure shows an overview of the OpenCUA framework.
Specifically, the OpenCUA framework includes the following: The AgentNet Tool, as shown in the top - left corner, can capture cross - operating system user interactions through screen videos and operation processes. The top - right corner shows that the original demonstrations are processed into "state - action" trajectories containing reasoning and history. The bottom - right corner shows the AgentNet dataset and benchmarks, which cover diverse tasks and provide offline evaluations with gold - standard actions. Finally, in the bottom - left corner, after training, the OpenCUA model can perform computer operation tasks in real - world environments.
AgentNet Data Collection
The goal of OpenCUA is to expand the data of using desktop computers to different computer environments and user scenarios. Naturally, the first thing the team had to do was to collect demonstrations that conform to natural user behavior and minimize additional restrictions on the way users interact with computers to improve the scalability of data collection.
To this end, they developed the AgentNet Tool and collected the AgentNet dataset, which is also the first large - scale desktop agent task dataset.
AgentNet Tool
The AgentNet Tool is a cross - platform annotation application that can record user interactions on Windows, macOS, and Ubuntu. It can capture screen videos, mouse/keyboard operations, and relevant metadata, thus enabling the collection of real computer usage demonstrations, and this method can be scaled up on a large scale.
Annotation and verification of the AgentNet Tool
The team processed the original user demonstrations to obtain clean, trainable "state - action" trajectories. The generated trajectories contain "inner - monologue - style" thinking and operation history, which are suitable for training vision - language models.
The original demonstrations include high - frequency screen recordings and fine - grained interaction signals (such as mouse movements, clicks, scrolling, key presses, etc.). A typical task may generate tens of thousands of low - level action records, which are too dense and result in low training efficiency. To solve this problem, the team proposed two technical solutions:
1. Action Reduction
This is a rule - based method developed by the team, which can reduce dense action signals to fewer but more meaningful operations while retaining necessary information.
- Compress atomic operations into high - level operations;
- Mouse movements are regarded as pre - conditions for clicks/drags, and only the start and end positions are retained;
- Scroll events are merged by direction, and the number of scroll wheel turns is accumulated;
- Continuous key presses are merged into text input strings, and shortcut key combinations (such as CTRL + C) are abstracted as "hotkey actions";
- Common multi - step gestures (such as dragging and double - clicking) are also integrated into a single action.
The reduced action sequence is aligned with the pyautogui action space (see Table 1 for details).
Table 1: Human operations and corresponding agent action functions
2. State - Action Matching
To pair each action a_i with a representative state s_i, the team extracted key frames from the screen recordings to capture the system state before the action occurred. However, if the key frames are directly aligned with the mouse click timestamps, future information may be leaked (for example, the mouse is already hovering over a button, making the prediction too easy).
To avoid this problem, when processing mouse clicks, they trace back to the stage before the mouse starts moving and search forward for the last frame with obvious visual changes as the starting state of the action. After the task is completed, a termination frame and the corresponding "end action" are appended.
AgentNet Dataset and Test Benchmarks
Finally, they obtained the AgentNet dataset and the AgentNetBench benchmark test set.
The dataset covers diverse open - domain tasks from over 140 applications and more than 190 websites. The tasks involve multi - application collaboration processes, professional tool operations, and the use of non - general functions. The benchmark provides task instructions, step history, and multiple gold - standard actions for each step, facilitating efficient offline evaluation.
Figure 4: Domain distribution of tasks in the AgentNet dataset
The dataset contains a total of 22,625 manually annotated computer - using tasks, of which about 12,000 are from Windows, 5,000 from macOS, and 5,000 from Ubuntu, supporting screen resolutions ranging from 720p to 4K. The average number of steps per trajectory is 18.6, reflecting the complexity of the tasks themselves.
As shown in Table 2, compared with existing GUI datasets, AgentNet is the first desktop - level trajectory dataset with authenticity, complexity, diversity, and multi - modality features.
Table 2: Comparison between the AgentNet dataset and existing GUI datasets
To achieve stable, fast, and environment - configuration - independent evaluation, they also built AgentNetBench - an offline evaluation benchmark for computer - using agents.
This benchmark consists of 100 representative tasks selected from the AgentNet dataset, covering the Windows and macOS platforms, and the task content spans multiple application domains.
The team said that each task has been manually reviewed to clarify the task goal and eliminate redundant operations. Notably, considering that there are naturally multiple reasonable operation paths in computer operation tasks, they also manually provided multiple valid action options for each step to improve the flexibility and authenticity of the evaluation.
The OpenCUA Model
Based on the above dataset, the team developed the OpenCUA agent model, which combines reflective chain - of - thought reasoning, multi - image history, and cross - domain data. The model can perform computer operation tasks in real desktop environments of multiple operating systems.
Notably, they also designed a novel processing workflow to enhance the reflective long chain of thought (reflective long CoT) for each task step: The "generator" and the "reflector" will iteratively generate and verify each component between the observed information and the ground - truth actions during the reasoning process.
Experimental Results and Analysis
The experiments were conducted based on multiple open - source vision - language models, including: KimiVL - A3B, Qwen2 - VL - 7B - Instruct, Qwen2.5 - VL - 7B - Instruct, and Qwen2.5 - VL - 32B - Instruct.
Among them, KimiVL - A3B uses a Mixture of Experts (MoE) architecture, with a total of 16B parameters. The activated parameters during training and inference are 3B, and it has certain computer operation capabilities, such as object localization and task planning.
Qwen2 - VL and Qwen2.5 - VL are general - purpose vision - language models (VLM). Qwen2.5 - VL performs better in digital agent tasks, especially in understanding high - resolution scenarios.
The team performed supervised fine - tuning on the above models to obtain multiple OpenCUA model variants: OpenCUA - A3B, OpenCUA - Qwen2 - 7B, OpenCUA - 7B, and OpenCUA - 32B.
Then, they evaluated these models on the following multiple benchmarks, including the online evaluation benchmark, the offline agent evaluation benchmark, and the GUI localization ability evaluation benchmark.
Online Agent Evaluation
- OSWorld - Verified: OSWorld initially collected and organized 369 manually constructed tasks, covering a large number of applications, and was accompanied by corresponding environment configurations and evaluation scripts. The OSWorld team recently verified these tasks, fixed the items that could not be tested due to expired dependencies, evaluation errors, or unclear instructions, and released the improved benchmark as OSWorld - Verified. The evaluation results were obtained through the public evaluation platform deployed by the OSWorld team on the AWS infrastructure, and the results are listed in Table 3.
- WindowsAgentArena (WAA): This benchmark includes 154 Windows - centric tasks, covering native Windows applications and several open - source programs that appear in OSWorld, which can effectively reflect the online performance of agents on the Windows system.
Table 3: Evaluation results of OSWorld - Verified
From the results, OpenCUA - 32B achieved the best performance among all open - source models, with an average success rate of 34.8%, far ahead of various previous baseline models. At the same time, it significantly narrowed the performance gap with closed - source agents and even surpassed OpenAI CUA. This result fully demonstrates the advantages of the OpenCUA training process in scalability and performance.
Offline Agent Evaluation
The offline evaluation used AgentNetBench, which is an offline evaluation benchmark for CUA created by the team and contains 100 representative tasks covering multiple domains on Windows and macOS. The results are shown in the following table.
Table 4: Performance of each CUA on AgentNetBench
It can be seen that OpenCUA - 32B has the best overall performance, but OpenAI CUA has an obvious advantage in the success rate of Function actions.
GUI Localization Ability Evaluation
The team also evaluated the model's ability to map natural language instructions to specific operations in the graphical user interface (GUI). Three benchmarks were used here: OSWorld - G, Screenspot - V2, and Screenspot - Pro
Among them, OSWorld - G includes 564 samples, systematically covering tasks such as text matching, interface element recognition, layout understanding, and fine - grained operation control, and provides annotations on the types of interface elements required to solve each task. Screenspot - V2 includes screenshots from mobile, desktop, and web ends, aiming to evaluate the GUI understanding ability in cross - platform scenarios. Screenspot - Pro focuses on high - resolution desktop environments, especially emphasizing the performance ability in professional application scenarios.