Sequoia Capital Interviews the OpenAI Team: First Reveal of the R & D Secrets of ChatGPT Agent
On July 23, Sequoia Capital recently hosted a dialogue session to delve deeply into the technological innovation and future potential of OpenAI's ChatGPT Agent with members of its team. The dialogue was co-hosted by two Sequoia Capital partners, Sonya Huang and Lauren Reeder. Isa Fulford, Casey Chu, and Edward Sun, members of the OpenAI team who participated in the ChatGPT Agent launch event, took part.
During this dialogue, they shared the development journey of ChatGPT Agent and explored how it combines the advantages of Deep Research and Operator to efficiently execute cross - domain tasks. They also discussed the security measures and extensive application scenarios of ChatGPT Agent.
According to OpenAI's vision, ChatGPT Agent will have stronger independent judgment ability, be able to provide customized services based on each user's habits and needs, and support multiple communication methods such as voice, text, and images. In the future, OpenAI will create a general super - intelligent agent capable of handling almost all human tasks on a computer.
The following is a condensed version of the dialogue:
Host: Today, we will discuss the evolution of AI agents with Fulford, Casey Chu, and Edward Sun from the OpenAI team. You have developed the new ChatGPT Agent. Please introduce its core functions and major breakthroughs.
Fulford: Thank you for inviting us to this program. ChatGPT Agent is the result of the collaborative development of the Deep Research and Operator teams. This AI agent can execute complex, multi - step tasks that can take up to an hour. We have equipped it with a virtual computer environment that integrates functions such as text browsing, visual browsing, terminal access, and API integration. All these tools share the state, similar to how multiple applications on a human - used computer share the file system.
This design allows ChatGPT Agent to flexibly handle various complex tasks, significantly enhancing its efficiency and capabilities. We are particularly satisfied with the performance of this model in multi - round conversations. It can continuously handle tasks and make improvements. In the future, we hope to further enhance the personalization and memory functions so that ChatGPT Agent can execute tasks without the user's initiative.
1 Birth and Evolution
Host: Could you share the origin story of this project? How did it start?
Casey Chu: This project originated from the combination of two products, Deep Research and Operator. In January 2025, we released Operator, which can perform Internet tasks such as online shopping.
Two weeks later, we launched Deep Research, which focuses on browsing and synthesizing online information to generate detailed research reports with citations. When formulating the future development roadmap, we realized that these two products could complement each other.
Operator is good at handling visual interactions, such as clicking on web elements, while Deep Research is better at handling text information, such as reading long articles. User feedback showed that they hoped Deep Research could access paid content, and Operator already had this ability. Therefore, combining the two was a natural choice.
Edward Sun: Our team achieved a huge leap in capabilities by unifying the architectures of Deep Research and Operator. All tools share the state, allowing users to smoothly switch between text analysis, visual browsing, and code execution. We did not pre - program the usage patterns of the tools but let the model discover the best strategies on its own through reinforcement learning on thousands of virtual machines.
This approach enables ChatGPT Agent to collaborate with users for hours, asking clarification questions and accepting corrections during the task, greatly expanding the interaction methods with AI agents. We also faced challenges such as security and task complexity. For example, date selection is still a difficult problem for AI. A small team achieved a breakthrough through careful data screening, indicating that AI development has entered a new stage where product insights are as important as computing power.
Fulford: ChatGPT Agent can execute complex tasks that would take humans a lot of time. We provided it with a virtual computer environment containing various tools: a text browser (similar to the Deep Research tool) for efficiently obtaining online information; a visual browser (similar to the Operator tool) capable of interacting with the graphical user interface, supporting operations such as clicking, filling in forms, scrolling, and dragging; and a terminal tool for running code, analyzing files, generating spreadsheets or slides, etc.
In addition, through API integration, ChatGPT Agent can access services such as GitHub, Google Drive, and SharePoint. All tools share the state, similar to how applications on a human computer share the file system. This design allows ChatGPT Agent to flexibly handle complex tasks and provide strong support to users.
Host: Could you elaborate on the combination process? How did you achieve the "1 + 1 > 2" effect?
Casey Chu: Our team developed Operator and Deep Research separately. Operator is good at handling visual interactions, such as clicking or filling in forms on the web, but not good at reading long articles. Deep Research is good at efficiently browsing and synthesizing text information but has difficulty handling highly interactive visual elements. We noticed that users were trying Deep Research - type tasks on Operator, such as "research a trip and then book".
Therefore, combining the two was a natural choice. We not only integrated these two tools but also added a terminal tool, an image generation tool, and API call functions, enabling ChatGPT Agent to perform a wider range of tasks. For example, the terminal tool can run commands for calculations, the image generation tool can add visual elements to slides, and API calls can generate PowerPoint presentations.
Edward Sun: This combination significantly enhanced the capabilities of ChatGPT Agent. For example, it can efficiently search for information with the text browser, then switch to the visual browser to view pictures or interactive elements, and even run code in the terminal to generate outputs. All tools share the state, allowing ChatGPT Agent to operate different applications seamlessly like a human.
Our team member Eric analyzed the user prompts on Operator and found that many tasks involved Deep Research - type requirements, such as "research a trip and then book", which further verified the necessity of the combination.
2 Multi - Scenario Task Capabilities
Host: What are the specific application scenarios of ChatGPT Agent? How do users use it?
Fulford: We deliberately designed an open - ended agent named "ChatGPT Agent" to encourage users to explore its potential. We trained it to handle Deep Research tasks, such as generating detailed reports; Operator tasks, such as booking flights or online shopping; and data analysis tasks, such as creating spreadsheets or slides. Its flexibility makes us expect that users will discover more unexpected uses.
For example, Deep Research users unexpectedly discovered the code search function. We hope that ChatGPT Agent can play a role in both consumer and enterprise scenarios, such as helping professional users generate detailed reports or planning activities for individual users. Whether it's a consumer waiting 30 minutes for a detailed report or an enterprise user using it at work, it can handle the tasks.
Casey Chu: Personally, I used it to process data in Google Docs and generate slides to display the data. Another interesting case is that I used it to research new developments in the field of ancient DNA. Since the information in this field is scattered and there is a lack of comprehensive reference materials, ChatGPT Agent can collect information from the Internet and synthesize it into reports or slides, greatly simplifying my work.
Edward Sun: I used it for online shopping, especially in scenarios that require visual browsing, such as viewing product pictures or selecting styles through search filters. It is also very useful for event planning, such as arranging itineraries or activities. My favorite shopping task is buying clothes because many websites require a visual browser to handle search filters or view product appearances.
Host: You previously showed a cool case. Could you share it?
Fulford: Sure! Our colleague asked ChatGPT Agent to estimate OpenAI's valuation based on online information and generate a financial model, including spreadsheets, summary analysis, and slides to present the results. This task took 28 minutes, demonstrating its ability to handle long - term tasks. ChatGPT Agent's predictions were quite bold, and the quality of the slides was impressive!
Casey Chu: This case opened up a new paradigm: users can leave after submitting a task, and ChatGPT Agent will return a detailed report after some time. As ChatGPT Agent becomes more autonomous, the task time may be longer, and this is a good example.
Host: 28 minutes is already a long time! Do you have longer - term tasks? How do you ensure that ChatGPT Agent doesn't deviate from the track during long - term operation?
Edward Sun: I recently ran a task that lasted for an hour, which might be the longest - term task we've seen. To ensure stability, we developed tools to extend ChatGPT Agent's context length, enabling it to record task progress and complete complex tasks step by step.
In addition, we designed a flexible human - machine interaction mechanism. Users can correct ChatGPT Agent at any time, provide additional instructions, or request status updates. For example, users can ask it to summarize the current progress or add instructions like "I only want blue sneakers".
Fulford: This collaborative model mimics the way people communicate through Slack. ChatGPT Agent will ask for permission or clarification when needed, such as seeking user consent when performing a destructive operation or needing to log in.
Our interface also allows users to monitor ChatGPT Agent's operations in real - time and even take over the virtual computer environment after the task is completed, such as logging into an account or entering credit card information. This "observing a colleague's operation and taking over at any time" experience is very intuitive, enhancing users' sense of control over ChatGPT Agent.
3 Training and Breakthroughs
Host: From a technical perspective, how was ChatGPT Agent trained?
Casey Chu: We used reinforcement learning (RL) technology and provided it with tools such as a text browser, a GUI browser, a terminal, and an image generation tool in a virtual machine environment.
We designed complex tasks and let ChatGPT Agent discover the best tool - using strategies through testing and rewarded it based on the quality and efficiency of task completion. For example, ChatGPT Agent might first use the text browser to search for restaurant information, then use the GUI browser to view dish pictures and booking availability, or download data from the website and process it in the terminal. This shared - state tool design allows ChatGPT Agent to switch tools seamlessly and complete diverse tasks.
Fulford: Different from previous tool - using methods, all tools share the state, similar to how humans use multiple applications on a computer. This design enables ChatGPT Agent to efficiently handle interaction tasks related to the Internet, file system, and code. We did not pre - specify tool - using rules but let the model discover the best strategies through reinforcement learning by itself, and the effect is almost magical. The data requirement for reinforcement learning is much less than that for pre - training. We taught the model new skills through a carefully selected high - quality dataset.
Edward Sun: Reinforcement learning is very data - efficient. We only need a small amount of high - quality datasets to teach new skills. For example, we created a diverse set of tasks, including finding niche information and writing long reports. As long as the output quality can be evaluated, reinforcement learning can effectively improve performance. To make the Operator function well, we spent a lot of time in the past two or three years to enable the model to understand visual elements and page interactions, laying the foundation for the current ChatGPT Agent.
Host: Is this reinforcement learning method the standard method for OpenAI to train AI agents?
Fulford: We believe this method has great potential. This release is the Minimum Viable Product (MVP) after our team's collaboration, but it has already shown strong capabilities. For example, the slide generation function is already very good, thanks to the efforts of many team members. We believe that the same technology can be used for further improvement, but other technologies may also need to be introduced.
Casey Chu: This method is very amazing. The same reinforcement learning algorithm is applicable to Deep Research, Operator, and now the computer - using ChatGPT Agent. We achieved these results in a short time, and there is still much room for improvement in the future.
Host: Is there any special training method for reinforcement learning in terms of interactivity?
Edward Sun: We mainly focus on end - to - end performance, from user prompts to task completion. ChatGPT Agent performs well in interacting with users, partly because we included diverse task trajectories in the training. Users can intervene at any time, providing clarification or corrections, and it can adjust its behavior based on the feedback.
Host: The early World of Bits project (a general AI training platform developed by OpenAI) tried to use reinforcement learning to control the mouse path, but the problem was too complex. What has changed to make this problem solvable now?
Edward Sun: The development of ChatGPT Agent can be traced back to the World of Bits project in 2017. We jokingly call it "World of Bits 2". The biggest change is the increase in training scale. Whether it's pre - training or reinforcement learning, the computing volume may have increased hundreds of thousands of times. The increase in data scale and computing power has enabled us to achieve our goals.
4 How to Prevent "Runaway"
Host: When ChatGPT Agent performs external operations, how do you ensure its security and reliability?
Fulford: Since ChatGPT Agent can interact with the external world, such as accessing websites or calling APIs, security is a core concern.
Compared with the read - only mode of Deep Research, ChatGPT Agent may pose greater risks, such as performing unexpected destructive operations when completing tasks, like buying 100 different options to ensure user satisfaction. Therefore , we implemented multi - level security measures, including internal and external red - team testing, a real - time monitoring system (similar to antivirus software), and protocols to quickly respond to new threats. We pay special attention to serious issues such as biological risks, for example, preventing ChatGPT Agent from being used to create biological weapons.
Casey Chu: The Internet is full of risks, including phishing attacks and fraud. Our model has been trained for security and can identify some risks, but sometimes it may be too eager to complete the task and get deceived. We developed a real - time monitoring system to check ChatGPT Agent's behavior. If it detects a suspicious operation (such as accessing an abnormal website), it will immediately pause the task.
In addition, we have protocols to quickly respond to new threats, similar to updating antivirus software. Thanks to the mitigation work of the company's biological risk team, we conducted several weeks of red - team testing to ensure that the model will not be used for harmful purposes.
Fulford: Security training is a cross - team effort involving security, governance, legal, research, and engineering teams. We implemented protective measures at every level and will continue to iterate to deal with new threats. For example, we ensure that ChatGPT Agent will seek user permission before performing sensitive operations (such as logging into a bank account).
5 Team Collaboration Behind the Scenes
Host: How did the development team collaborate? What was the team size?
Fulford: Our team was formed by merging the research and application teams of Deep Research and Operator, and the total number of people is not large. The Deep Research team initially had only 3 - 4 people, the Operator team had about 6 - 8 people, plus an excellent engineering and product design team led by Yash Kumar. The research and application teams worked closely together, from defining product functions to model training, all oriented towards user scenarios. This small - team collaboration enabled us to achieve significant results in a short time.
Casey Chu: The boundary between the research and application teams is not strict. Application engineers participate in model training, and researchers also participate in model deployment. This cross - functional cooperation makes the project dynamic, and the team atmosphere is very good. Fulford and I are old friends, and this tacit understanding also promotes team cooperation.
Edward Sun: A small team can achieve great things. We completed this project in a few months. The research and application teams defined the product functions together from the beginning to ensure that it is user - need - oriented. Although ChatGPT Agent has not fully achieved all its goals, this framework allows us to iterate quickly.
Host: What was the biggest challenge