OpenAI's new agent has been outperformed by a 24 - person startup team from China. In actual tests, it lost badly in both cost and quality. Overseas users say that Chinese agents have a generational lead.
Early this morning, OpenAI launched a new feature called ChatGPT Agent. This feature enables its AI assistant to complete multi-step tasks by controlling its own web browser, marking OpenAI's official entry into the field of "agentic AI" — systems that can autonomously take multi-step actions on behalf of users.
It is reported that this update combines the capabilities of OpenAI's early Operator tool and Deep Research function with the conversational advantages of ChatGPT. This allows ChatGPT to browse websites, run code, and create documents, while users still maintain control over the process. Similar to the previous Operator, the Agent feature requires user permission before performing certain operations with real-world impacts (such as making purchases). Users can interrupt tasks, take over browser control, or stop operations entirely at any time. The system also includes a "Watch Mode" suitable for tasks like sending emails that require full user supervision.
When using the Agent, users will see all the operations performed by the AI in its dedicated private sandbox within a window on the ChatGPT interface. The sandbox has its own virtual operating system and a web browser with access to the real internet, but it does not control the user's personal device. According to OpenAI, "ChatGPT uses its own virtual computer to perform these tasks, smoothly switching between reasoning and action to handle complex workflows from start to finish, all based on your instructions."
A still frame from the promotional demo video of ChatGPT Agent, showing the system searching for flights.
OpenAI stated that users can have the Agent handle various needs, such as matching and purchasing a set of clothing for a specific occasion, creating a PowerPoint presentation, planning meals, or updating a financial spreadsheet with new data. The system combines a web browser, terminal access, and application programming interface (API) connections to complete these tasks, including "ChatGPT Connectors" that can integrate with applications like Gmail and GitHub.
Just now, OpenAI announced that ChatGPT Agent is now being rolled out to Pro, Plus, and Team users. Enterprise and education users will gain access in the coming weeks. Additionally, since the Agent surpasses the Operator in functionality, the early Operator preview website will continue to operate for a few weeks before closing.
Official Evaluation: Achieved State-of-the-Art Performance
In a public evaluation report, OpenAI introduced that ChatGPT Agent achieved state-of-the-art performance in its own benchmark tests. In the "Humanity's Last Exam" (a test that evaluates an AI's performance on expert-level questions), the Agent had an accuracy rate of 41.6%. In contrast, OpenAI's o3 model had an accuracy rate of 24.9% when using tools. In the "FrontierMath" test (one of the most difficult math benchmark tests currently designed), the Agent achieved an accuracy rate of 27.4% when using tools, while the o3 model had an accuracy rate of 19.3% when using Python.
The company also claimed that ChatGPT Agent outperforms humans in data science tasks such as data analysis and modeling. In the DSBench benchmark test used to measure this ability, the system scored 89.9% in data analysis tasks, compared to 64.1% for humans; in data modeling tasks, it scored 85.5%, compared to 65.0% for humans. Additionally, the system scored 68.9% in OpenAI's BrowseComp test (used to evaluate the ability to find hard-to-locate web information) and 45.5% in the SpreadsheetBench test (used to evaluate spreadsheet editing ability), both higher than other AI models from OpenAI.
Some users shared the results of using ChatGPT Agent to create a financial analysis report for NVIDIA, saying, "ChatGPT Agent is amazing! It achieved this level in just a few minutes! However, in terms of calculations, it's still quite far behind a newly hired junior investment banking analyst."
It's worth noting that although OpenAI said the Agent can create PowerPoint presentations for users, the company admitted that the slide generation function is still in the testing phase, and its output may appear "relatively basic" in terms of format and refinement. Some users who experienced it said that ChatGPT Agent created slides that could reach a practical level with minor modifications in just 9 minutes. The results are as follows:
According to a user, OpenAI's Agent mode can also improve the output presentation slides through reinforcement learning. However, "Manus had this function a long time ago."
Actual Results: Obvious Limitations and Blind Spots in Capability
What OpenAI says is one thing, but in reality, the effectiveness of the newly launched ChatGPT Agent in completing multi-step tasks seems to vary greatly depending on the specific situation.
Some users pointed out that ChatGPT Agent performed worse than o3 on PaperBench, SWE-Bench verification, OpenAI PRs, and OpenAI Research Engineer interview questions.
Another user, when sharing a case of using ChatGPT Agent to "analyze a dataset on Kaggle and convert it into PPT and Excel," said, "Although it didn't make any operational errors, some of the data seemed off." Only after his feedback did the system figure out there was a problem with the data and the cause.
According to foreign media reports, the underlying AI model is not a complete problem-solving intelligence but more like a complex advanced imitator. It has some flexibility in integrating scenarios but also has many blind spots. Moreover, OpenAI trained this Agent and its components using examples of computer and tool use. It may struggle to complete any tasks beyond the scope of the examples included in the training data.
For example, the ChatGPT Agent system card shows that the agent may fail when completing complex tasks that require connecting multiple steps in a novel way. In a "Cyber Range" evaluation, ChatGPT Agent was required to perform comprehensive operations in a simulated network environment of a small online retailer. When asked to solve problems independently, it couldn't complete the task. Although it could successfully execute the initial research steps, such as identifying servers in the network, it had difficulty progressing further and couldn't connect the necessary means to achieve the final goal. Even with prompts, the Agent still failed (in this case, it might be a good thing as it couldn't perform automated hacking), indicating obvious limitations in its ability to solve complex problems beyond its familiar training examples.
A developer said that in most of his AI usage scenarios, there's no need to choose ChatGPT Agent at all. "o3 can fully meet the needs and is very cost-effective. There's no need to launch a whole virtual machine with a browser and command-line interface." Moreover, he pointed out that OpenAI has packaged a lot of complex technologies into a consumer-friendly product, but achieving this high level of user-friendliness comes at the cost of sacrificing customization and composability, which currently limits its capabilities.
"For research tasks, I'll still use Claude Code — it's a more powerful professional tool." Claude Code is an application that runs on a computer, offering a more flexible way of use: it can directly access all files, and users can customize its operation without restrictions. In contrast, ChatGPT Agent exists within ChatGPT and can only work in a preset way. "So it's useful, but it's not a product for daily use yet."
Overseas Netizens' "Certification": Inferior to the AI Agents Released by Chinese Teams
"ChatGPT Agent seems to be a real competitor to Manus." After OpenAI launched this system, many overseas users first compared it with AI Agent products such as Manus AI and Genspark, which were promoted by Chinese entrepreneurs. Among them, Genspark is a general AI agent launched by MainFunc, a company jointly founded by Jing Kun, the former vice president of Baidu Group and the former CEO of Xiaodu Technology, and Zhu Kaihua, the former CTO of Xiaodu Technology. Initially positioned as an AI search engine, it later transformed into a Super Agent capable of independent thinking, task planning, and tool invocation to complete complex multi-step tasks. Its ARR (Annual Recurring Revenue) exceeded $10 million just 9 days after its launch.
Shubham Saboo, a multi-year AI product leader, publicly commented, "ChatGPT Agent is overhyped. Genspark and Manus AI have long been far ahead in generating well-researched AI presentations and handling spreadsheets."
This morning, Eric Jing, the co-founder and CEO of MainFunc, said on the X platform that they used the same prompts as in OpenAI's presentation released this morning and succeeded on the first try, achieving the following results: it took only a fraction of the time, cost only a fraction of the price, but had several times higher quality. Saboo not only reposted the video of the comparison results but also said bluntly, "Genspark Super Agent can really beat OpenAI's ChatGPT in one go."
"I never thought this day would come — as a small startup with only 24 people, we're leading by so much... even ahead of OpenAI," Eric Jing said excitedly. Moreover, he posted the full replay of his test task in the comment section: https://www.genspark.ai/autopilotagent_viewer?id=ec2525b1-a16e-4f69-a568-d16b4b687aaf
In response, some overseas netizens praised, "You guys are amazing. A small team can be so successful." Another user pointed out, "Based on the usage cases of some of our customers, Genspark is indeed faster on some tasks, while only the Agent Mode can work on other tasks (we also tested Manus, Skywork, and Flowith)." At the same time, he also gave extremely high praise to Genspark: "The slides you (Genspark) make are definitely far ahead of the rest. Other products can't even come close."
Reference Links:
https://openai.com/zh-Hans-CN/index/introducing-chatgpt-agent/
https://arstechnica.com/information-technology/2025/07/chatgpts-new-ai-agent-can-browse-the-web-and-create-powerpoint-slideshows/
This article is from the WeChat official account "AI Frontline," compiled by Hua Wei, and republished by 36Kr with authorization.