Tsinghua-Universität Team: DeepDive vorgeschlagen - Neuer Durchbruch für Deep Search Agent!

Knowledge Graph + End-to-End Multi-Round Reinforcement Learning

Equipping Large Language Models (LLMs) with surfing tools can significantly enhance their potential as Deep Search Agents for solving complex real - world tasks.

However, open - source LLMs still yield poor results in such scenarios because their ability for remote inference with surfing tools is limited and there is a lack of sufficiently challenging supervision data.

To advance the development of Deep Search Agents, a research team from Tsinghua University and Northeast University proposed DeepDive. This method creates agents with complex remote inference and web - browsing capabilities by combining automated data synthesis from Knowledge Graphs (KGs) with end - to - end multiple RL.

Link to the study: https://arxiv.org/abs/2509.10446

Experiments show that the DeepDive - 32B trained with this method achieved an accuracy of 14.8% in the BrowseComp test. This proves that the test - time extension through tool calls and parallel sampling in Deep Search is effective.

Figure | Left: DeepDive - 32B performs better than open - source Deep Search models and proprietary models on BrowseComp. Middle: DeepDive drives the Deep Search ability of the model by maximizing tool calls, which improves its performance on BrowseComp. Right: Multiple RL continuously improves the performance of DeepDive - 32B on 4 Deep Search benchmarks.

In addition, besides the above methods and data, the research team made an open - source study on semi - automatic i.i.d. Deep Search question - answering synthesis. Using only the data from this study, the accuracy of DeepDive - 32B on BrowseComp can be increased to 22.2%.

This automatically generated Knowledge Graph data and semi - automatic i.i.d. data help the open - source models of the GLM - 4.5 series achieve excellent results in the BrowseComp test.

Finally, all DeepDive datasets, models, and code were made open - source on GitHub.

(Address: https://github.com/THUDM/DeepDive)

How was DeepDive developed?

Deep Search Agents must infer and search by analyzing hundreds of online resources to find complex and hard - to - access information. However, there is a significant gap between open models and proprietary LLMs like OpenAI DeepResearch in terms of Deep Search Agents.

The research team believes that this gap lies in the shortage of hard - to - access data resources and the lack of a multiple RL training mechanism. Regarding the data, most existing question - answering datasets contain relatively simple questions that do not really reflect the "hard cases". In terms of the training method, it is still an unsolved problem how to effectively combine remote inference with the use of Deep Search tools. Moreover, the existing search or surfing agents with integrated surfing tools are mainly designed for direct search tasks.

DeepDive aims to improve the long - term information search of Deep Search Agents and achieves a breakthrough through two technical modules: data creation and RL. They developed a strategy that automatically generates hard - to - find search questions from open Knowledge Graphs and applies end - to - end multiple RL techniques to improve the remote inference ability of language models through Deep Search.

In the area of data, the training data for creating Deep Search Agents must overcome the limitations of traditional multi - hop question - answering.

Knowledge Graphs naturally provide a structured and semantically rich environment that supports multi - hop inference. This makes them particularly suitable for generating supervision data for training Deep Search Agents. They solve the problem of the lack of difficulty in question - answering datasets by automatically generating a Deep Search question - answering dataset from Knowledge Graphs.

Since Knowledge Graphs naturally support multi - hop connections and each entity has different attributes, they deliberately obscured some attributes of each entity when creating questions to create a form of "obscured entities".

Then they conduct a random walk on the Knowledge Graph, extract long multi - hop paths, and further confuse the key hints with the help of LLMs to make the question - answering pairs more challenging. This data synthesis process generates data that can effectively stimulate the remote inference ability and Deep Search ability of LLMs.

Figure | Automatic data synthesis of question - answering from Knowledge Graphs for DeepDive. Conducting a random walk on the Knowledge Graph, automatically creating Deep Search question - answering pairs, and then confusing them with an LLM.

In the training method, they use end - to - end multiple RL to integrate inference and the use of search tools. They apply the multiple GRPO algorithm for end - to - end RL, in which the LLM interacts with the network environment and receives rewards based on the final answers in the created question - answering dataset.

Experiments show that the model trained with RL improves the efficiency of tool utilization more effectively in the inference phase than the baseline method. This proves the scalability of tool calls during the test and thus effectively improves the remote inference ability and Deep Search ability.

Figure | Overview of multiple RL for training the inference and Deep Search abilities of DeepDive.

To further improve the deployment efficiency and ensure the effectiveness of positive examples, they introduced an early - exit mechanism: If the model makes a format error at any step, the trajectory generation is immediately terminated and a reward of 0 points is given. This mechanism ensures that all trajectories with positive rewards are error - free and completely reliable, which significantly improves the robustness of multiple tool utilization.

How good is the performance?

The research team evaluates DeepDive using 4 public and challenging Deep Search benchmarks such as Bro - wseComp and BrowseComp - ZH and compares DeepDive with different models. The results are as follows:

Table | Evaluation results of Deep Search question - answering benchmarks. The accuracy (%) are the reported data. * indicates the reported performance of existing studies. † indicates that the surfing function is implemented via function calls.

Figure | Training reward (a) and evaluation accuracy on BrowseComp - 266 (b) and the average number of tool calls during training and evaluation (c), which show how Reinforcement Learning (RL) gradually develops deeper search strategies.

Figure | Generalization performance of DeepDive on simple search benchmarks. † indicates that the surfing function is implemented via function calls.

The above results show that complex supervision and multiple Reinforcement Learning together lay the foundation for tool utilization. The performance of the model improves with increasing tool call budget and parallel sampling, and the abilities learned when solving complex problems can be transferred to simpler scenarios.

Limitations and future directions

Naturally, DeepDive is not perfect and still has some limitations.

For example, the data generated by two challenging methods for synthesizing Deep Search question - answering data are still less difficult than datasets like BrowseComp. This indirectly leads to the fact that the performance of DeepDive - 32B on BrowseComp is much lower than that of advanced models like o3 with surfing ability.

In addition, the research team's method of mainly training challenging data causes DeepDive - 32B to show the phenomenon of "over - searching". Therefore, determining the optimal training steps and designing a more suitable reward mechanism for the RL phase will be an important research area for the future.

This article is from the WeChat account "Academic Headlines" (ID: SciTouTiao), author: Xiaoyu, published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Das Team der Tsinghua-Universität hat DeepDive vorgeschlagen: Der Deep Search Agent erlebt einen neuen Durchbruch.

How was DeepDive developed?

How good is the performance?

Limitations and future directions