The new function of OpenAI, "Deep Research", makes its debut, and its performance in the ultimate human examination exceeds that of DeepSeek R1.
On the morning of February 3 Beijing Time, OpenAI officially launched the Deep Research function of its agent product for the deep research field in the field of in-depth research.
Previously, a professional research report that an experienced industry analyst needed to spend several days or even weeks to complete can now be completed in just 5 - 30 minutes with this breakthrough function. This function, comparable to an "AI researcher", can independently analyze complex professional information, search and integrate hundreds of online resources in real time, and finally generate a complete report of professional level.
Source: OpenAI
This powerful function is supported by a specially tailored version of the upcoming OpenAI o3 model, which has been specifically optimized for web browsing and data analysis scenarios. It can use reasoning ability to search, interpret and analyze a large number of text, image and PDF files on the Internet, and flexibly adjust the research direction according to the information encountered.
It is worth noting that in the evaluation of the ability of this agent, OpenAI specifically compared DeepSeekR1, stating that in the Humanity’s Last Exam (abbreviated as HLE) test, the model used in deep research achieved an accuracy rate of 26.6% on expert-level questions, refreshing the previous record of 18.2%.
In contrast, the accuracy rate of DeepSeek's R1 model is 9.4%.
This test was jointly developed by many experts in various fields around the world to evaluate the performance of artificial intelligence in a wide range of disciplines and is regarded as a cutting-edge benchmark for measuring the academic ability of artificial intelligence. The test contains more than 3,000 multiple-choice questions and short-answer questions, covering more than 100 disciplinary fields from linguistics to rocket science, and from classical studies to ecology.
From this, it can also be seen that DeepSeek does make OpenAI feel quite pressured.
Tencent Technology has comprehensively combined OpenAI's deep research introduction documents and technical interpretation live broadcasts to sort out the technical points that are most worthy of attention in this release.
Source: OpenAI
Becoming a Professional Researcher, Skilled in Finding Uncommon Information and Integrating It Across the Entire Network
The Deep Research function is specifically designed for high-intensity knowledge workers in fields such as finance, science, policy, engineering, etc., who require comprehensive, accurate and reliable research results. At the same time, it is also suitable for consumers who need to make careful research when purchasing products such as cars, home appliances, furniture, etc., and hope to seek highly personalized advice.
1. The output of Deep Research comes with clear citations and a summary of its thinking process, making it convenient for users to consult and verify information.
2. It is particularly good at finding uncommon, non-intuitive information, and can help users unload and accelerate complex and time-consuming online research tasks with just one query, thereby saving time.
3. Deep Research can independently discover, reason and integrate various insights from the network. It uses the same reinforcement learning method as OpenAI o1 (OpenAI's first reasoning model) in the training process, and has been trained on real tasks for the use of browsers and Python tools.
Although o1 performs well in technical fields such as programming and mathematics, many real-world challenges require extensive collection of contextual information from diverse online resources.
Deep Research further expands the reasoning ability on this basis, bridging this gap and enabling it to deal with various problems that people face in work and daily life.
In ChatGPT, users can select the "Deep Research" option in the message box and enter a question. Users can explain their needs to ChatGPT, or they can attach files or spreadsheets to add background information to the question. Once it starts running, the sidebar will show a summary of the steps taken and the sources used.
It may take 5 to 30 minutes for Deep Research to complete the work, depending on the complexity of the task and the amount of information required. During this period, users can leave or engage in other tasks - once the research is completed, users will receive a notification. The final output will be presented in the form of a report in the chat.
In the next few weeks, OpenAI will also add embedded images, data visualizations and other analysis results to such reports to provide more clarity and background information.
Compared with Deep Research, GPT - 4o is more suitable for real-time, multi-modal conversations.
For complex problems in multiple aspects and specific fields that require in-depth exploration and detailed analysis, Deep Research can conduct extensive research and provide citations for each point of view. This is different from a simple quick summary. It can provide a detailed answer that is fully documented and verified and can be directly used as a work result.
End-to-End Reinforcement Learning is the Key, Multiple Modules Work Together
Deep Research is trained through end-to-end reinforcement learning for complex web browsing and reasoning tasks in multiple fields.
Through these trainings, it has learned how to plan and execute multi-step operation processes to find the required data, and to backtrack and respond to real-time information when necessary.
The model can also browse the files uploaded by users, use Python tools to draw and iterate graphics, embed the generated graphics and images obtained from the website into the answers, and quote specific sentences or paragraphs from its sources.
This innovative learning method breaks the limitation of traditional machine learning that requires artificial division of training stages, enabling the model to think and make decisions as a human researcher does.
At the technical architecture level, Deep Research consists of four core modules that work together to form a complete intelligent research system.
First, the information discovery module, similar to the "explorer" of the system.
It can keenly locate valuable information in multiple platforms such as academic databases, research institution websites, and professional forums. This module not only has a strong retrieval ability, but also is equipped with an advanced information filtering mechanism that can quickly filter out high-quality research materials based on multi-dimensional standards such as keywords, semantic associations, timeliness and credibility.
Second, the information integration module, plays the role of the "integrator".
It can sort the scattered information from different channels into a systematic knowledge system. Whether it is processing text reports, analyzing data charts, or understanding professional pictures, this module can accurately grasp the logical relationship between the information and extract the key points.
For example, when dealing with research tasks in the field of science and technology, it can integrate the information in multiple dimensions such as technical principles, application cases and development trends to form a complete technical analysis report.
Third, the reasoning module, which gives the system human-like thinking ability.
It uses logical reasoning and knowledge graph technology to deeply analyze and deduce the collected information. When facing complex scientific problems, the reasoning module can conduct rigorous arguments based on known facts; when conducting market analysis, it will comprehensively consider historical data, market dynamics and policy environment to make reasonable predictions. More importantly, this module has the ability of self-correction and can adjust the reasoning path in time according to the newly discovered information.
Fourth, the output module, is the "expresser" of the system, responsible for transforming the research results into a professional presentation form.
It can generate reports, papers or analysis charts in a standardized format according to user needs. In this process, the system will strictly follow academic norms and provide accurate source citations for each conclusion to ensure the reliability and professionalism of the research results.
The collaborative work of these several models is similar to a collaborative work system of multiple Agents. The Deep Research system can invest 5 - 30 minutes or even longer for in-depth research according to the complexity of the task, and show its working process in the sidebar. Users can also jump out to do other tasks during this process. After the model completes its thinking, they will receive a push notification. The design of this mechanism allows the product's capabilities to be better exerted and takes into account the user experience.
Note: The more the model browses and the deeper it thinks about the browsing content, the better its performance. This is why it is important to give it thinking time.
The HLE Test Achieved an Accuracy Rate of 26.6%
On such a technical basis, Deep Research has reached a new height in many public evaluations for real-world problems.
Note: The scores of Deep Research and various models in Humanity's Last Exam
It is worth noting that in the Humanity’s Last Exam (abbreviated as HLE) test, the model used in Deep Research achieved an accuracy rate of 26.6% on expert-level questions, setting a new high.
In contrast, the accuracy rate of DeepSeek's R1 model is 9.4%. This test was jointly developed by many experts in various fields around the world to evaluate the performance of artificial intelligence in a wide range of disciplines and is regarded as a cutting-edge benchmark for measuring the academic ability of artificial intelligence. The test contains more than 3,000 multiple-choice questions and short-answer questions, covering more than 100 disciplinary fields from linguistics to rocket science, and from classical studies to ecology.
Compared with OpenAI's o1 model, the model of Deep Research has made significant progress in the fields of chemistry, humanities and social sciences, and mathematics. It shows a human-like processing method by effectively finding professional information.
Note: The scores of Deep Research in the GAIA benchmark test
In the GAIA benchmark test, the model used in Deep Research has reached the state-of-the-art (SOTA) level and ranked first on the external ranking list.
GAIA is a public benchmark test specifically designed to evaluate the performance of artificial intelligence on real-world problems. The test contains three difficulty levels of questions, covering a wide range of practical application scenarios. Successfully completing these tasks requires reasoning ability, multi-modal interaction ability, web browsing ability and proficiency in using tools.
In the internal evaluation of expert-level tasks in multiple fields, Deep Research was rated by domain experts as being able to automate several hours of complex, manual investigation work.
Deep Research unlocks many new capabilities, but it is still in the early stage and has some limitations. According to the internal evaluation, although its error rate is significantly lower than the existing ChatGPT model, Deep Research may still generate false information or make wrong inferences in the answers.
In addition, it may be difficult to distinguish authoritative information from rumors, and there is a deficiency in confidence calibration, often unable to accurately convey uncertainty. In the initial stage of release, there may be some formatting errors in reports and citations, and task startup may take longer. However, OpenAI expects that these problems will be rapidly improved with more use and over time.
Pro Users Can Use It Up to 100 Times per Month
The use of Deep Research in ChatGPT currently has a high demand for computing resources. The longer the research time required for the query, the greater the amount of reasoning computation required. Currently, OpenAI has launched an optimized version for Pro users, supporting up to 100 queries per month.
Next, Plus and Team users will gain access, followed by enterprise users. Currently, OpenAI is still working to provide access to users in the UK, Switzerland and the European Economic Area.
All paid users will soon receive a significant increase in the rate limit of Deep Research. OpenAI plans to launch a faster and more cost-effective version in the future, which is driven by a smaller model but can still provide high-quality results.
In the next few weeks and months, OpenAI will focus on improving the technical infrastructure, closely monitoring the performance of the current version, and conducting more rigorous tests. This is in line with OpenAI's iterative deployment principle. If all safety checks continue to meet the release standards, it is expected that Deep Research will be launched to Plus users in about a month.
Deep Research is currently available on the ChatGPT web version and is planned to be expanded to mobile and desktop applications within a month. Currently, Deep Research can access the open web and files uploaded by users. In the future, users will be able to connect to more specialized data sources to expand their access to subscription-based or internal resources, thereby making its output more rich and personalized.
In the longer term, the combination of Deep Research and Operator will provide users with more powerful asynchronous research and real-world execution capabilities.
Deep Research can conduct asynchronous online research, while Operator can take real-world actions. The combination of the two will enable ChatGPT to perform increasingly complex tasks.
This article is from the WeChat public account "Tencent Technology", author: Xiaojing Wuji, and 36Kr is authorized to publish it.