HomeArticle

For the first time, core members of OpenAI revealed: How did we build ChatGPT Agent?

36氪的朋友们2025-07-24 13:06
OpenAI has launched the multimodal ChatGPT Agent, which is driven by reinforcement learning and supports one-hour long tasks and secure interactions.
  1. ChatGPT Agent is an integration of Deep Research and Operator. It combines text understanding and visual interaction capabilities, enabling it to perform various types of tasks from web browsing to code execution, demonstrating multi - modal collaborative processing capabilities.
  2. The core training method is reinforcement learning. Through a task - reward mechanism, the model autonomously learns tool - using strategies, breaking through the limitations of "preset action programming" and possessing high data efficiency and task generalization ability.
  3. ChatGPT Agent can execute tasks lasting up to an hour and supports multi - round conversations and interaction adjustments with users. For example, it can handle tasks such as generating financial report models, writing research reports, and searching for products, realizing a new paradigm of collaboration between AI and humans.
  4. The development team is small but efficient. It deeply integrates research and engineering, conducts cross - functional collaboration and rapid iteration around user scenarios, which is a typical practice of OpenAI's integration of engineering and product.
  5. To ensure security, OpenAI has deployed multi - layer protection measures: red - team testing, real - time behavior monitoring, permission confirmation mechanisms, etc., to prevent the model from performing harmful operations or being misused.
  6. OpenAI tends to create a general super - agent that can handle almost all tasks that humans perform on computers.

On July 23, Sequoia Capital recently held a dialogue session to in - depth discuss the technological innovation and future potential of OpenAI's ChatGPT Agent with its team members. The dialogue was co - hosted by two Sequoia Capital partners, Sonya Huang and Lauren Reeder. OpenAI team members Isa Fulford, Casey Chu, and Edward Sun, who participated in the ChatGPT Agent release event, also joined the discussion.

During this dialogue, they shared the development process of ChatGPT Agent and explored how it combines the advantages of Deep Research and Operator to efficiently execute cross - domain tasks. They also discussed the security measures and wide range of application scenarios of ChatGPT Agent.

According to OpenAI's vision, ChatGPT Agent will have stronger independent judgment ability, be able to provide customized services according to each user's habits and needs, and support multiple communication methods such as voice, text, and images. In the future, OpenAI will create a general super - agent that can handle almost all tasks that humans perform on computers.

The following is a condensed version of the dialogue:

Host: Today, we will discuss the evolution of AI agents with Isa Fulford, Casey Chu, and Edward Sun from the OpenAI team. You have developed the new ChatGPT Agent. Please introduce its core functions and major breakthroughs.

Fulford: Thank you for inviting us to this program. ChatGPT Agent is the result of the collaboration between the Deep Research and Operator teams. This AI agent can perform complex, multi - step tasks that may take up to an hour. We have equipped it with a virtual computer environment that integrates functions such as text browsing, visual browsing, terminal access, and API integration. All these tools share states, similar to how multiple applications on a human - used computer share the file system.

This design allows ChatGPT Agent to flexibly handle various complex tasks, significantly improving efficiency and capabilities. We are particularly satisfied with its performance in multi - round conversations, as it can continuously handle tasks and improve. In the future, we hope to further enhance personalization and memory functions, enabling ChatGPT Agent to execute tasks without users' active initiation.

01 Birth and Evolution

Host: Could you share the origin story of this project? How did it start?

Casey Chu: This project originated from the combination of two products, Deep Research and Operator. In January 2025, we released Operator, which can perform Internet tasks such as online shopping.

Two weeks later, we launched Deep Research, which focuses on browsing and synthesizing online information to generate detailed research reports with citations. When formulating the future development roadmap, we realized that these two products could complement each other.

Operator is good at handling visual interactions, such as clicking on web elements, while Deep Research is better at processing text information, such as reading long articles. User feedback showed that they hoped Deep Research could access paid content, and Operator already had this ability. Therefore, combining the two was a natural choice.

Edward Sun: Our team achieved a significant leap in capabilities by unifying the architectures of Deep Research and Operator. All tools share states, allowing users to smoothly switch between text analysis, visual browsing, and code execution. Instead of pre - programming the tool - using patterns, we let the model discover the best strategies on its own through reinforcement learning on thousands of virtual machines.

This approach enables ChatGPT Agent to collaborate with users for hours, asking clarification questions and accepting corrections during the task, greatly expanding the interaction methods with AI agents. We also faced challenges such as security and task complexity. For example, date selection is still a difficult task for AI. The small team achieved a breakthrough through careful data screening, indicating that AI development has entered a new stage where product insights are as important as computing power.

Fulford: ChatGPT Agent can perform complex tasks that would take humans a lot of time. We provided it with a virtual computer environment containing various tools: a text browser (similar to the Deep Research tool) for efficiently obtaining online information; a visual browser (similar to the Operator tool) capable of interacting with graphical user interfaces, supporting operations such as clicking, filling in forms, scrolling, and dragging; and a terminal tool for running code, analyzing files, generating spreadsheets or slide presentations.

In addition, through API integration, ChatGPT Agent can access services such as GitHub, Google Drive, and SharePoint. All tools share states, similar to how applications on a human computer share the file system. This design enables ChatGPT Agent to flexibly handle complex tasks and provide strong support for users.

Host: Could you elaborate on the combination process? How did you achieve the "1 + 1 > 2" effect?

Casey Chu: Our team developed Operator and Deep Research separately. Operator is good at handling visual interactions, such as clicking or filling in forms on the web, but not good at reading long articles. Deep Research is good at efficiently browsing and synthesizing text information but has difficulty handling highly interactive visual elements. We noticed that users were trying Deep Research - type tasks on Operator, such as "research a trip and then book".

Therefore, combining the two was a natural choice. We not only integrated these two tools but also added a terminal tool, an image - generation tool, and API - calling functions, enabling ChatGPT Agent to perform a wider range of tasks. For example, the terminal tool can run commands for calculations, the image - generation tool can add visual elements to slide presentations, and API calls can generate PowerPoint presentations.

Edward Sun: This combination significantly enhances the capabilities of ChatGPT Agent. For example, it can use the text browser to efficiently search for information, then switch to the visual browser to view pictures or interactive elements, and even run code in the terminal to generate outputs. All tools share states, allowing ChatGPT Agent to operate different applications seamlessly like a human.

Our team member Eric analyzed user prompts on Operator and found that many tasks involved Deep Research - type needs, such as "research a trip and then book", which further verified the necessity of the combination.

02 Multi - scenario Task Capabilities

Host: What are the specific application scenarios of ChatGPT Agent? How do users use it?

Fulford: We deliberately designed an open - ended agent named "ChatGPT Agent" to encourage users to explore its potential. We trained it to handle Deep Research tasks, such as generating detailed reports; Operator tasks, such as booking flights or online shopping; and data analysis tasks, such as creating spreadsheets or slide presentations. Its flexibility makes us expect users to discover more unexpected uses.

For example, Deep Research users unexpectedly discovered the code - search function. We hope ChatGPT Agent can play a role in both consumer and enterprise scenarios, such as helping professional users generate detailed reports or planning activities for individual users. Whether a consumer waits 30 minutes to get a detailed report or an enterprise user uses it in work, it can handle the tasks.

Casey Chu: I personally use it to process data in Google Docs and generate slide presentations to display the data. Another interesting case is that I used it to research new developments in the field of ancient DNA. Since the information in this field is scattered and lacks comprehensive reference materials, ChatGPT Agent can collect information from the Internet and synthesize it into reports or slide presentations, greatly simplifying my work.

Edward Sun: I use it for online shopping, especially in scenarios that require visual browsing, such as viewing product pictures or selecting styles through search filters. It is also very useful for event planning, such as arranging itineraries or events. My favorite shopping task is buying clothes because many websites require a visual browser to handle search filters or view product appearances.

Host: You previously showed a cool case. Could you share it?

Fulford: Sure! Our colleague asked ChatGPT Agent to estimate the valuation of OpenAI based on online information and generate a financial model, including spreadsheets, summary analysis, and slide presentations to show the results. This task took 28 minutes, demonstrating its ability to handle long - term tasks. ChatGPT Agent's predictions were quite bold, and the quality of the slide presentation was impressive!

Casey Chu: This case initiates a new paradigm: users can leave after assigning a task, and ChatGPT Agent will return a detailed report after some time. As ChatGPT Agent becomes more autonomous, task durations may be longer, and this is a good example.

Host: 28 minutes is quite long! Do you have longer - duration tasks? How do you ensure that ChatGPT Agent doesn't deviate from the track during long - term operation?

Edward Sun: I recently ran a task that lasted for an hour, which might be the longest - duration task we've seen. To ensure stability, we developed tools to extend ChatGPT Agent's context length, enabling it to record task progress and gradually complete complex tasks.

In addition, we designed a flexible human - machine interaction mechanism, allowing users to correct ChatGPT Agent, provide additional instructions, or request status updates at any time. For example, users can ask it to summarize the current progress or add instructions like "I only want blue sneakers".

Fulford: This collaboration model mimics the way people communicate through Slack. ChatGPT Agent will ask for permissions or clarification when necessary, such as seeking user consent when performing destructive operations or needing to log in.

Our interface also allows users to monitor ChatGPT Agent's operations in real - time and even take over the virtual computer environment after the task is completed, such as logging into accounts or entering credit card information. This "observing a colleague's operations and taking over at any time" experience is very intuitive, enhancing users' sense of control over ChatGPT Agent.

03 Training and Breakthroughs

Host: From a technical perspective, how was ChatGPT Agent trained?

Casey Chu: We used reinforcement learning (RL) technology and provided it with tools such as a text browser, a GUI browser, a terminal, and an image - generation tool in a virtual machine environment.

We designed complex tasks for ChatGPT Agent to discover the best tool - using strategies through trials and rewarded it based on the quality and efficiency of task completion. For example, ChatGPT Agent might first use the text browser to search for restaurant information, then use the GUI browser to view dish pictures and reservation availability, or download data from the website and process it in the terminal. This tool design with shared states allows ChatGPT Agent to seamlessly switch tools and complete diverse tasks.

Fulford: Different from previous tool - using methods, all tools share states, similar to how humans use multiple applications on a computer. This design enables ChatGPT Agent to efficiently handle interaction tasks related to the Internet, file systems, and code. We didn't pre - specify tool - using rules but let the model discover the best strategies through reinforcement learning, and the results are almost magical. Reinforcement learning requires far less data than pre - training, and we teach the model new skills through carefully selected high - quality datasets.

Edward Sun: Reinforcement learning is very data - efficient. We only need a small amount of high - quality datasets to teach new skills. For example, we created a diverse set of tasks, including finding niche information and writing long reports. As long as we can evaluate the output quality, reinforcement learning can effectively improve performance. To make the Operator function perform well, we spent a lot of time in the past two or three years to enable the model to understand visual elements and page interactions, laying the foundation for the current ChatGPT Agent.

Host: Is this reinforcement learning method the standard method for OpenAI to train AI agents?

Fulford: We believe this method has great potential. This release is the Minimum Viable Product (MVP) after our team's collaboration, but it has already shown strong capabilities. For example, the slide - generation function is already very good, thanks to the efforts of many team members. We believe that we can further improve it through the same technology, but other technologies may also need to be introduced.

Casey Chu: This method is amazing. The same reinforcement learning algorithm applies to Deep Research, Operator, and now the computer - using ChatGPT Agent. We achieved these results in a short time, and there is still much room for improvement in the future.

Host: Is there any special training method for reinforcement learning in terms of interactivity?

Edward Sun: We mainly focus on end - to - end performance, from user prompts to task completion. ChatGPT Agent performs well in interacting with users, partly because we included diverse task trajectories in the training. Users can intervene at any time, providing clarifications or corrections, and it can adjust its behavior based on the feedback.

Host: The early World of Bits project (a general AI training platform developed by OpenAI) tried to use reinforcement learning to control the mouse path, but the problem was too complex. What has changed to make this problem solvable now?

Edward Sun: The development of ChatGPT Agent can be traced back to the World of Bits project in 2017. We jokingly call it "World of Bits 2". The biggest change is the increase in training scale. Whether it's pre - training or reinforcement learning, the computational volume may have increased by hundreds of thousands of times. The increase in data scale and computing power has enabled us to achieve our goals.

04 How to Prevent "Runaway"

Host: When ChatGPT Agent performs external operations, how do you ensure its security and reliability?

Fulford: Since ChatGPT Agent can interact with the external world, such as accessing websites or calling APIs, security is a core concern.

Compared with the read - only mode of Deep Research, ChatGPT Agent may pose greater risks. For example, it may perform unexpected destructive operations while completing tasks, such as buying 100 different options to ensure user satisfaction. To address this, we have implemented multi - level security measures, including internal and external red - team testing, a real - time monitoring system (similar to antivirus software), and protocols to quickly respond to new threats. We pay special attention to serious issues such as biological risks, for example, preventing ChatGPT Agent from being used to create biological weapons.

Casey Chu: The Internet is full of risks, including phishing attacks, fraud, etc. Our model has been trained for security and can identify some risks, but sometimes it may be too eager to complete tasks and get deceived. We developed a real - time monitoring system to check ChatGPT Agent's behavior. If it detects suspicious operations (such as accessing abnormal websites), it will immediately pause the task.

In addition, we have protocols to quickly respond to new threats, similar to updating antivirus software. Thanks to the efforts of the company's biosecurity team in risk mitigation, we conducted weeks of red - team testing to ensure that the model will not be used for harmful purposes.

Fulford: Security training is a cross - team effort involving security, governance, legal, research, and engineering teams. We have implemented protective measures at each level and will continue to iterate to address new threats. For example, we ensure that ChatGPT Agent will seek user permission before performing sensitive operations (such as logging into a bank account).

05 Team Collaboration Behind the Scenes

Host: