Tsinghua team proposes DeepDive: New breakthrough in deep search agents
Equipping large language models (LLMs) with browsing tools can significantly enhance their potential as deep search agents to solve complex real - world tasks.
However, due to the limited ability of long - range reasoning using browsing tools and the lack of sufficiently challenging supervised data, open - source LLMs still perform poorly in such scenarios.
To promote the development of deep search agents, a research team from Tsinghua University and Northeastern University proposed DeepDive. This method creates agents with complex long - range reasoning and web browsing capabilities by combining automated data synthesis from knowledge graphs (KGs) and end - to - end multi - round RL.
Paper link: https://arxiv.org/abs/2509.10446
Experiments show that DeepDive - 32B trained based on this method achieved an accuracy of 14.8% in the BrowseComp test. This proves that the test - time expansion of tool invocation and parallel sampling is effective in deep search.
Figure | Left: DeepDive - 32B outperforms open - source deep search models and proprietary models on BrowseComp; Middle: DeepDive drives the model's deep search ability by maximizing tool invocation, thereby improving its performance on BrowseComp; Right: Multi - round RL continuously enhances the performance of DeepDive - 32B on 4 deep search benchmarks.
In addition to the above methods and data, the research team also open - sourced an additional study on semi - automatic independent and identically distributed (i.i.d.) deep search question - answering synthesis. Using only the data from this study, the accuracy of DeepDive - 32B on BrowseComp can be further improved to 22.2%.
It is worth mentioning that these automatically generated knowledge graph data and semi - automatic i.i.d. data have helped the GLM - 4.5 series of open - source models achieve excellent performance in the BrowseComp test.
Finally, all DeepDive datasets, models, and codes have been open - sourced on GitHub.
(Address: https://github.com/THUDM/DeepDive)
How was DeepDive developed?
Deep search agents need to reason and retrieve information by analyzing hundreds of online resources to locate complex and hard - to - obtain information. However, there is a significant gap between open - source models and proprietary LLMs such as OpenAI DeepResearch in the field of deep search agents.
The research team believes that this gap stems from the lack of accessible data resources and the absence of a multi - round RL training mechanism. In terms of data, most existing question - answering datasets usually contain relatively simple questions, which are difficult to truly reflect "difficult cases"; in terms of training methods, how to effectively combine long - range reasoning with the use of deep search tools remains an unsolved problem; in addition, existing search or browsing agents integrated with browsing tools are mainly designed for direct search tasks.
DeepDive aims to enhance the long - term information retrieval ability of deep search agents and achieves breakthroughs through two technical modules: data construction and RL. They developed a strategy to automatically generate hard - to - find query questions from open knowledge graphs and used end - to - end multi - round RL technology to enhance the long - range reasoning ability of language models through deep search.
In terms of data, to build a deep search agent, its training data must break through the limitations of traditional multi - hop question - answering.
Knowledge graphs naturally have a structured and semantically rich environment, which supports multi - hop reasoning, making them particularly suitable for generating supervised data required for training deep search agents. They solved the problem of the lack of difficulty in question - answering datasets by automatically generating deep search question - answering datasets from knowledge graphs.
Since knowledge graphs naturally support multi - hop connections and each entity has different attributes, they deliberately blurred some attributes of each entity when constructing questions, thus creating a form of "fuzzy entities".
Subsequently, they performed random walks on the knowledge graph, extracted long - distance multi - hop paths, and used LLMs to further confuse key clues, making the question - answer pairs more challenging. The data generated by this data synthesis process can effectively stimulate the long - range reasoning ability and deep search ability of LLMs.
Figure | Automated question - answering data synthesis from knowledge graphs for DeepDive. Deep search question - answer pairs are automatically constructed by performing random walks on the knowledge graph and then confused using LLMs.
In terms of training methods, they adopted end - to - end multi - round RL to integrate reasoning and the use of search tools. They used the multi - round GRPO algorithm for end - to - end RL, where the LLM interacts with the network environment and obtains rewards based on the final answers in the constructed question - answering dataset.
Experiments show that the model trained by RL can improve the efficiency of tool use more effectively than the baseline method during the reasoning stage, which proves the scalability of tool invocation during testing, thereby effectively enhancing the long - range reasoning ability and deep search ability.
Figure | Overview of multi - round RL for training the reasoning and deep search ability of DeepDive.
To further improve deployment efficiency and ensure the effectiveness of positive samples, they also introduced an early - termination mechanism: when the model makes a formatting error at any step, the trajectory generation will be immediately terminated and a reward of 0 points will be given. This mechanism ensures that all trajectories that receive positive rewards are error - free and completely reliable, thereby significantly enhancing the robustness of multi - round tool use.
What's the effect?
The research team evaluated DeepDive through 4 public and challenging deep search benchmarks such as Bro - wseComp and BrowseComp - ZH and compared DeepDive with various types of models. The results are as follows:
Table | Evaluation results of deep search question - answering benchmarks. The accuracy (%) is the reported data. * indicates the reported performance of existing studies. † indicates that the browsing function is implemented through function calls.
Figure | Training rewards (a) and evaluation accuracy on BrowseComp - 266 (b), as well as the average number of tool invocations during training and evaluation (c), showing how reinforcement learning (RL) gradually cultivates deeper search strategies.
Figure | Generalization effect of DeepDive on simple search benchmarks. † indicates that the browsing function is implemented through function calls.
The above results show that complex supervision and multi - round reinforcement learning together lay the foundation for tool use. The performance of the model will improve with the increase in tool invocation budget and parallel sampling, and the skills learned in solving complex problems can be transferred to simpler scenarios.
Limitations and future directions
Of course, DeepDive is not perfect and still has some limitations.
For example, the data generated by the two challenging deep search question - answering data synthesis methods is still less difficult than datasets such as BrowseComp. This indirectly leads to the performance of DeepDive - 32B on BrowseComp being much lower than that of advanced models such as o3 with browsing capabilities.
In addition, the research team's method of mainly training on high - difficulty data has led to the phenomenon of "over - search" in DeepDive - 32B. Therefore, determining the optimal training steps and designing a more appropriate reward mechanism for the RL stage will be an important exploration direction in the future.
This article is from the WeChat official account "Academic Headlines" (ID: SciTouTiao), author: Xiaoyu, published by 36Kr with permission.