HomeArticle

Google joins the CUA battlefield and releases Gemini 2.5 Computer Use: Enabling AI to directly operate browsers

机器之心2025-10-08 15:00
Benchmark against OpenAI's Computer-Using Agent (CUA)?

Google's Computer Use model is here!

Early this morning, Google DeepMind made a major announcement of the computer use model Gemini 2.5 Computer Use based on Gemini 2.5.

Considering that Google had just released Chrome DevTools (MCP) a few days ago, the birth of Gemini 2.5 Computer Use isn't particularly surprising. Simply put, similar to OpenAI's Computer-Using Agent (CUA), this model from DeepMind allows AI to directly control the user's browser. Based on its visual understanding and reasoning abilities, the model can assist users in performing actions such as clicking, scrolling, and inputting in the browser.

Let's first take a look at two official demonstrations.

Prompt: From https://tinyurl.com/pet-care-signup, get all details for any pet with a California residency and add them as a guest in my spa CRM at https://pet-luxe-spa.web.app/. Then, set up a follow up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment.

Prompt: My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to sticky-note-jam.web.app and ensure notes are clearly in the right sections. Drag them there if not.

As we can see, whether it's collecting online information and performing actions or organizing messy notes, Gemini 2.5 Computer Use has completed the tasks very accurately and at a quite fast speed.

On relevant benchmarks, the performance of Gemini 2.5 Computer Use has also reached the SOTA level:

Meanwhile, its speed performance is also better than several other comparable models:

Currently, developers can obtain these capabilities through the Gemini API of Google AI Studio and Vertex AI. Users can also try it out in the demonstration environment hosted by Browserbase (only supporting a maximum of 5-minute processes and not allowing users to take over midway): https://gemini.browserbase.com/

MachineHeart made several attempts using this demonstration environment. Overall, Gemini 2.5 Computer Use has a high accuracy rate when completing simple tasks, but it's prone to failure when dealing with slightly more complex tasks.

For example, when performing a simple task like "finding the John Wick page on Wikipedia", the model performed very successfully.

However, as long as the task gets a bit more complex, the model fails. For instance, "finding the John Wick page on Wikipedia, summarizing its information, and providing a Chinese version." Additionally, tasks such as "opening the official website of the Nobel Prize and providing the schedule for this year's Nobel Prize announcements" and the following task were not successfully completed.

Prompt: Browse jiqizhixin.com, find reports about Gemini in the past six months, organize them into a Markdown file, and summarize them.

In addition, DeepMind has also released the system card for Gemini 2.5 Computer Use:

https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf

How Gemini 2.5 Computer Use Works

The core capabilities of this model are realized through the newly added computer_use tool in the Gemini API, and developers need to run it in a loop process.

Its input should include:

The user's request;

A screenshot of the current environment;

The history of recently executed actions.

In addition, the input can also specify whether to exclude specific functions from the default-supported UI actions and add custom functions.

The Workflow of the Gemini 2.5 Computer Use Model

After analyzing these inputs, the model will generate a response, usually a function call representing a UI action (such as clicking or inputting). In some operations (such as purchasing actions), the model will also request user confirmation. The client will then execute these actions.

After the actions are executed, the system will return the latest screenshot and the current URL as a function response to the model, restarting the loop.

This iterative process will continue until the task is completed, an error occurs, or it is terminated due to a security mechanism or the user's decision.

Google says that the current Gemini 2.5 Computer Use model is mainly optimized for web browsers, but it also shows strong potential in mobile UI control. However, it has not been optimized for desktop operating system-level control.

Security Mechanism Design

Google also shared their design of the security mechanism for this model in a blog post.

Google said, "Building agents responsibly is the only way to make AI beneficial to everyone. AI agents that can directly operate computers bring unique risks, including malicious use by users, unexpected model behavior, and prompt injection and fraud in the web environment. Therefore, we attach great importance to security protection in the design."

In the Gemini 2.5 Computer Use model, Google has directly incorporated security mechanisms into the training phase to address three main types of risks (see the system card for details).

In addition, Google also provides developers with security control options to prevent the model from automatically performing potentially high-risk or harmful operations, such as:

Compromising system integrity;

Endangering security;

Bypassing captchas;

Controlling medical devices.

The control measures implemented by Google include:

Per-step Safety Service: During the inference phase, an independent security service evaluates each action that the model intends to perform.

System Instructions: Developers can set that the agent must refuse or request user confirmation before performing specific high-risk operations.

Conclusion

Google DeepMind has entered the arena with high-profile with Gemini 2.5 Computer Use, not only demonstrating leading performance in multiple benchmark tests but also officially pushing the competition in the field of AI agents into a white-hot stage.

From OpenAI to Anthropic, and now Google, technology giants are competing to define the future of our interaction with computers. Although the current model still seems inexperienced when facing complex real-world tasks, this is exactly a true portrayal before the dawn of technology. What we see today is not just a new model but also a clear signal: the dominance of the keyboard and mouse is being challenged, and an era where the digital world is directly driven by natural language is accelerating towards us.

Reference Links

https://blog.google/technology/google-deepmind/gemini-computer-use-model/

https://x.com/GoogleAIStudio/status/1975648565222691279

https://x.com/GoogleDeepMind/status/1975648789911224793

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). Author: Someone Concerned about CUA. Republished by 36Kr with authorization.